[jira] [Commented] (HIVEMALL-242) Drop support for Spark 2.1
[ https://issues.apache.org/jira/browse/HIVEMALL-242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773985#comment-16773985 ] Takeshi Yamamuro commented on HIVEMALL-242: --- Yea, I think so, too. I'll drop v2.1 and support v2.4. > Drop support for Spark 2.1 > -- > > Key: HIVEMALL-242 > URL: https://issues.apache.org/jira/browse/HIVEMALL-242 > Project: Hivemall > Issue Type: Task >Affects Versions: 0.5.2 >Reporter: Makoto Yui >Assignee: Takeshi Yamamuro >Priority: Minor > Labels: spark > Fix For: 0.6.0 > > > We can drop Spark 2.1 support in Hivemall. Spark 2.1 requires Java7. Spark > 2.2 or later requires Java8 or later. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-225) Upgrade spark from v2.3.0 to v2.3.2
[ https://issues.apache.org/jira/browse/HIVEMALL-225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-225: -- Labels: spark (was: ) > Upgrade spark from v2.3.0 to v2.3.2 > --- > > Key: HIVEMALL-225 > URL: https://issues.apache.org/jira/browse/HIVEMALL-225 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Priority: Trivial > Labels: spark > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-224) Support brickhouse functions for hivemall-spark
[ https://issues.apache.org/jira/browse/HIVEMALL-224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-224: -- Labels: spark (was: ) > Support brickhouse functions for hivemall-spark > --- > > Key: HIVEMALL-224 > URL: https://issues.apache.org/jira/browse/HIVEMALL-224 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Priority: Major > Labels: spark > > Add these functions in HIvemallOps; > https://github.com/apache/incubator-hivemall/commit/1e1b77ea4724c48f56dd1f3aa15027506558dee1 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVEMALL-225) Upgrade spark from v2.3.0 to v2.3.2
Takeshi Yamamuro created HIVEMALL-225: - Summary: Upgrade spark from v2.3.0 to v2.3.2 Key: HIVEMALL-225 URL: https://issues.apache.org/jira/browse/HIVEMALL-225 Project: Hivemall Issue Type: Improvement Reporter: Takeshi Yamamuro -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVEMALL-224) Support brickhouse functions for hivemall-spark
Takeshi Yamamuro created HIVEMALL-224: - Summary: Support brickhouse functions for hivemall-spark Key: HIVEMALL-224 URL: https://issues.apache.org/jira/browse/HIVEMALL-224 Project: Hivemall Issue Type: Improvement Reporter: Takeshi Yamamuro Add these functions in HIvemallOps; https://github.com/apache/incubator-hivemall/commit/1e1b77ea4724c48f56dd1f3aa15027506558dee1 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Description: In machine learning and statistics, feature selection is one of useful techniques to choose a subset of relevant data in model construction for simplification of models and shorter training times, e.g., scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this selection is too time-consuming process if training data have a large number of columns and rows (For example, the number of columns could frequently go over 1,000 in real business use cases). An objective of this ticket is to implement plan rewriting rules in Spark Catalyst to filter meaningful training data before feature selection. We assume a workflow below from data extraction to model training; !fig1.png! In the example workflow above, one prepares raw training data, R(v1, v2, v3, v4) in the figure, by joining and projecting input data (R1, R2, and R3) in various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset (the red box) of the raw data, sampling and feature selection apply to them. In real business use cases, it sometimes happens that raw training data have many meaningless columns because of historical reasons (e.g., redundant schema designs). So, if we could filter out these meaningless data in the phase of data extraction, we should efficiently process the data extraction itself and following feature selection. In the example above, we actually need not join the relation R3 because all the columns in the relation are filtered out in feature selection. Also, the join processing should be faster if we could sample data directly in the input data (R1 and R2). This optimized workflow is as following; !fig2.png! This optimization might be achived by rewriting a plan tree for data extraction as following; !fig3.png! Since Spark already has a pluggable optimizer interface (extendedOperatorOptimizationRules) and a framework to collect data statistics for input data in data sources, the major tasks of this ticket are to add plan rewriting rules to filter meaningful training data before feature selections. As a pretty simple task, Spark might have a rule to filter out columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophisticated techniques have been proposed in [1, 2, 3]. I will make pull requests as sub-tasks and put relevant activities (reseaches and other OSS functionalities) in this ticket to track them. *References:* [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379, 2017. [3] Z. Zhao, R. Christensen, F. Li, X. Hu, K. Yi, Random Sampling over Joins Revisited, Proceedings of SIGMOD, 2018. was: In machine learning and statistics, feature selection is one of useful techniques to choose a subset of relevant data in model construction for simplification of models and shorter training times, e.g., scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this selection is too time-consuming process if training data have a large number of columns and rows (For example, the number of columns could frequently go over 1,000 in real business use cases). An objective of this ticket is to implement plan rewriting rules in Spark Catalyst to filter meaningful training data before feature selection. We assume a workflow below from data extraction to model training; !fig1.png! In the example workflow above, one prepares raw training data, R(v1, v2, v3, v4) in the figure, by joining and projecting input data (R1, R2, and R3) in various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset (the red box) of the raw data, sampling and feature selection apply to them. In real business use cases, it sometimes happens that raw training data have many meaningless columns because of historical reasons (e.g., redundant schema designs). So, if we could filter out these meaningless data in the phase of data extraction, we should efficiently process the data extraction itself and following feature selection. In the example above, we actually need not join the relation R3 because all the columns in the relation
[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Description: In machine learning and statistics, feature selection is one of useful techniques to choose a subset of relevant data in model construction for simplification of models and shorter training times, e.g., scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this selection is too time-consuming process if training data have a large number of columns and rows (For example, the number of columns could frequently go over 1,000 in real business use cases). An objective of this ticket is to implement plan rewriting rules in Spark Catalyst to filter meaningful training data before feature selection. We assume a workflow below from data extraction to model training; !fig1.png! In the example workflow above, one prepares raw training data, R(v1, v2, v3, v4) in the figure, by joining and projecting input data (R1, R2, and R3) in various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset (the red box) of the raw data, sampling and feature selection apply to them. In real business use cases, it sometimes happens that raw training data have many meaningless columns because of historical reasons (e.g., redundant schema designs). So, if we could filter out these meaningless data in the phase of data extraction, we should efficiently process the data extraction itself and following feature selection. In the example above, we actually need not join the relation R3 because all the columns in the relation are filtered out in feature selection. Also, the join processing should be faster if we could sample data directly in the input data (R1 and R2). This optimized workflow is as following; !fig2.png! This optimization might be achived by rewriting a plan tree for data extraction as following; !fig3.png! Since Spark already has a pluggable optimizer interface (extendedOperatorOptimizationRules) and a framework to collect data statistics for input data in data sources, the major tasks of this ticket are to add plan rewriting rules to filter meaningful training data before feature selections. As a pretty simple task, Spark might have a rule to filter out columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophisticated techniques have been proposed in [1, 2, 3]. I will make pull requests as sub-tasks and put relevant activities (reseaches and other OSS functionalities) in this ticket to track them. References: [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379, 2017. [3] Z. Zhao, R. Christensen, F. Li, X. Hu, K. Yi, Random Sampling over Joins Revisited, Proceedings of SIGMOD, 2018. was: In machine learning and statistics, feature selection is one of useful techniques to choose a subset of relevant data in model construction for simplification of models and shorter training times, e.g., scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this selection is too time-consuming process if training data have a large number of columns and rows (For example, the number of columns could frequently go over 1,000 in real business use cases). An objective of this ticket is to implement plan rewriting rules in Spark Catalyst to filter meaningful training data before feature selection. We assume a workflow below from data extraction to model training; !fig1.png! In the example workflow above, one prepares raw training data, R(v1, v2, v3, v4) in the figure, by joining and projecting input data (R1, R2, and R3) in various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset (the red box) of the raw data, sampling and feature selection apply to them. In real business use cases, it sometimes happens that raw training data have many meaningless columns because of historical reasons (e.g., redundant schema designs). So, if we could filter out these meaningless data in the phase of data extraction, we should efficiently process the data extraction itself and following feature selection. In the example above, we actually need not join the relation R3 because all the columns in the relation are
[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Description: In machine learning and statistics, feature selection is one of useful techniques to choose a subset of relevant data in model construction for simplification of models and shorter training times, e.g., scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this selection is too time-consuming process if training data have a large number of columns and rows (For example, the number of columns could frequently go over 1,000 in real business use cases). An objective of this ticket is to implement plan rewriting rules in Spark Catalyst to filter meaningful training data before feature selection. We assume a workflow below from data extraction to model training; !fig1.png! In the example workflow above, one prepares raw training data, R(v1, v2, v3, v4) in the figure, by joining and projecting input data (R1, R2, and R3) in various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset (the red box) of the raw data, sampling and feature selection apply to them. In real business use cases, it sometimes happens that raw training data have many meaningless columns because of historical reasons (e.g., redundant schema designs). So, if we could filter out these meaningless data in the phase of data extraction, we should efficiently process the data extraction itself and following feature selection. In the example above, we actually need not join the relation R3 because all the columns in the relation are filtered out in feature selection. Also, the join processing should be faster if we could sample data directly in the input data (R1 and R2). This optimized workflow is as following; !fig2.png! This optimization might be achived by rewriting a plan tree for data extraction as following; !fig3.png! Since Spark already has a pluggable optimizer interface (extendedOperatorOptimizationRules) and a framework to collect data statistics for input data in data sources, the major tasks of this ticket are to add plan rewriting rules to filter meaningful training data before feature selections. As a pretty simple task, Spark might have a rule to filter out columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophisticated techniques have been proposed in [1, 2, 3]. I will make pull requests as sub-tasks and put relevant activities (reseaches and other OSS functionalities) in this ticket to track them. References: [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379, 2017. was: In machine learning and statistics, feature selection is one of useful techniques to choose a subset of relevant data in model construction for simplification of models and shorter training times, e.g., scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this selection is too time-consuming process if training data have a large number of columns and rows (For example, the number of columns could frequently go over 1,000 in real business use cases). An objective of this ticket is to implement plan rewriting rules in Spark Catalyst to filter meaningful training data before feature selection. We assume a workflow below from data extraction to model training; !fig1.png! In the example workflow above, one prepares raw training data, R(v1, v2, v3, v4) in the figure, by joining and projecting input data (R1, R2, and R3) in various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset (the red box) of the raw data, sampling and feature selection apply to them. In real business use cases, it sometimes happens that raw training data have many meaningless columns because of historical reasons (e.g., redundant schema designs). So, if we could filter out these meaningless data in the phase of data extraction, we should efficiently process the data extraction itself and following feature selection. In the example above, we actually need not join the relation R3 because all the columns in the relation are filtered out in feature selection. Also, the join processing should be faster if we could sample data directly in
[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Description: In machine learning and statistics, feature selection is one of useful techniques to choose a subset of relevant data in model construction for simplification of models and shorter training times, e.g., scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this selection is too time-consuming process if training data have a large number of columns and rows (For example, the number of columns could frequently go over 1,000 in real business use cases). An objective of this ticket is to implement plan rewriting rules in Spark Catalyst to filter meaningful training data before feature selection. We assume a workflow below from data extraction to model training; !fig1.png! In the example workflow above, one prepares raw training data, R(v1, v2, v3, v4) in the figure, by joining and projecting input data (R1, R2, and R3) in various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset (the red box) of the raw data, sampling and feature selection apply to them. In real business use cases, it sometimes happens that raw training data have many meaningless columns because of historical reasons (e.g., redundant schema designs). So, if we could filter out these meaningless data in the phase of data extraction, we should efficiently process the data extraction itself and following feature selection. In the example above, we actually need not join the relation R3 because all the columns in the relation are filtered out in feature selection. Also, the join processing should be faster if we could sample data directly in the input data (R1 and R2). This optimized workflow is as following; !fig2.png! This optimization might be achived by rewriting a plan tree for data extraction as following; !fig3.png! Since Spark already has a pluggable optimizer interface (extendedOperatorOptimizationRules) and a framework to collect data statistics for input data in data sources, the major tasks of this ticket are to add plan rewriting rules to filter meaningful training data before feature selections. As a pretty simple task, Spark might have a rule to filter out columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophisticated techniques have been proposed in [1, 2, 3]. I will make pull requests as sub-tasks and put relevant activities (papers and other OSS functionalities) in this ticket to track them. References: [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379, 2017. was: In machine learning and statistics, feature selection is one of useful techniques to choose a subset of relevant data in model construction for simplification of models and shorter training times, e.g., scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this selection is too time-consuming process if training data have a large number of columns and rows (For example, the number of columns could frequently go over 1,000 in real business use cases). An objective of this ticket is to implement plan rewriting rules in Spark Catalyst to filter meaningful training data before feature selection. We assume a workflow below from data extraction to model training; !fig1.png! In the example workflow above, one prepares raw training data, R(v1, v2, v3, v4) in the figure, by joining and projecting input data (R1, R2, and R3) in various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset (the red box) of the raw data, sampling and feature selection apply to them. In real business use cases, it sometimes happens that raw training data have many meaningless columns because of historical reasons (e.g., redundant schema designs). So, if we could filter out these meaningless data in the phase of data extraction, we should efficiently process the data extraction itself and following feature selection. In the example above, we actually need not join the relation R3 because all the columns in the relation are filtered out in feature selection. Also, the join processing should be faster if we could sample data directly in the
[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Attachment: fig3.png fig2.png > Plan rewriting rules to filter meaningful training data before feature > selections > - > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > Attachments: fig1.png, fig2.png, fig3.png > > > In machine learning and statistics, feature selection is one of useful > techniques to choose a subset of relevant data in model construction for > simplification of models and shorter training times, e.g., scikit-learn has > some APIs for feature selection > ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this > selection is too time-consuming process if training data have a large number > of columns and rows (For example, the number of columns could frequently go > over 1,000 in real business use cases). > An objective of this ticket is to implement plan rewriting rules in Spark > Catalyst to filter meaningful training data before feature selection. We > assume a workflow below from data extraction to model training; > !fig1.png! > In the example workflow above, one prepares raw training data, R(v1, v2, v3, > v4) in the figure, by joining and projecting input data (R1, R2, and R3) in > various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset > (the red box) of the raw data, sampling and feature selection apply to them. > In real business use cases, it sometimes happens that raw training data have > many meaningless columns because of historical reasons (e.g., redundant > schema designs). So, if we could filter out these meaningless data in the > phase of data extraction, we should efficiently process the data extraction > itself and following feature selection. In the example above, we actually > need not join the relation R3 because all the columns in the relation are > filtered out in feature selection. Also, the join processing should be faster > if we could sample data directly in the input data (R1 and R2). This > optimized workflow is as following; > !fig2.png! > This optimization might be achived by rewriting a plan tree for data > extraction as following; > !fig3.png! > Since Spark already has a pluggable optimizer interface > (extendedOperatorOptimizationRules) and a framework to collect data > statistics for input data in data sources, the major tasks of this ticket are > to add plan rewriting rules to filter meaningful training data before feature > selections. > As a pretty simple task, Spark might have a rule to filter out columns with > low variances (This process is corresponding to `VarianceThreshold` in > scikit-learn) by implicitly adding a `Project` node in the top of an user > plan. Then, the Spark optimizer might push down this `Project` node into leaf > nodes (e.g., `LogicalRelation`) and the plan execution could be significantly > faster. Moreover, more sophisticated techniques have been proposed in [1, 2]. > I will make pull requests as sub-tasks and put relevant activities (papers > and other OSS functionalities) in this ticket to track them. > > References: > [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join > or Not to Join?: Thinking Twice about Joins before Feature Selection, > Proceedings of SIGMOD, 2016. > [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe > to avoid when learning high-capacity classifiers?, Proceedings of the VLDB > Endowment, Volume 11 Issue 3, Pages 366-379, 2017. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Attachment: (was: fig2.png) > Plan rewriting rules to filter meaningful training data before feature > selections > - > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > Attachments: fig1.png, fig2.png, fig3.png > > > In machine learning and statistics, feature selection is one of useful > techniques to choose a subset of relevant data in model construction for > simplification of models and shorter training times, e.g., scikit-learn has > some APIs for feature selection > ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this > selection is too time-consuming process if training data have a large number > of columns and rows (For example, the number of columns could frequently go > over 1,000 in real business use cases). > An objective of this ticket is to implement plan rewriting rules in Spark > Catalyst to filter meaningful training data before feature selection. We > assume a workflow below from data extraction to model training; > !fig1.png! > In the example workflow above, one prepares raw training data, R(v1, v2, v3, > v4) in the figure, by joining and projecting input data (R1, R2, and R3) in > various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset > (the red box) of the raw data, sampling and feature selection apply to them. > In real business use cases, it sometimes happens that raw training data have > many meaningless columns because of historical reasons (e.g., redundant > schema designs). So, if we could filter out these meaningless data in the > phase of data extraction, we should efficiently process the data extraction > itself and following feature selection. In the example above, we actually > need not join the relation R3 because all the columns in the relation are > filtered out in feature selection. Also, the join processing should be faster > if we could sample data directly in the input data (R1 and R2). This > optimized workflow is as following; > !fig2.png! > This optimization might be achived by rewriting a plan tree for data > extraction as following; > !fig3.png! > Since Spark already has a pluggable optimizer interface > (extendedOperatorOptimizationRules) and a framework to collect data > statistics for input data in data sources, the major tasks of this ticket are > to add plan rewriting rules to filter meaningful training data before feature > selections. > As a pretty simple task, Spark might have a rule to filter out columns with > low variances (This process is corresponding to `VarianceThreshold` in > scikit-learn) by implicitly adding a `Project` node in the top of an user > plan. Then, the Spark optimizer might push down this `Project` node into leaf > nodes (e.g., `LogicalRelation`) and the plan execution could be significantly > faster. Moreover, more sophisticated techniques have been proposed in [1, 2]. > I will make pull requests as sub-tasks and put relevant activities (papers > and other OSS functionalities) in this ticket to track them. > > References: > [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join > or Not to Join?: Thinking Twice about Joins before Feature Selection, > Proceedings of SIGMOD, 2016. > [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe > to avoid when learning high-capacity classifiers?, Proceedings of the VLDB > Endowment, Volume 11 Issue 3, Pages 366-379, 2017. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Attachment: fig2.png fig3.png > Plan rewriting rules to filter meaningful training data before feature > selections > - > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > Attachments: fig1.png, fig2.png, fig3.png > > > In machine learning and statistics, feature selection is one of useful > techniques to choose a subset of relevant data in model construction for > simplification of models and shorter training times, e.g., scikit-learn has > some APIs for feature selection > ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this > selection is too time-consuming process if training data have a large number > of columns and rows (For example, the number of columns could frequently go > over 1,000 in real business use cases). > An objective of this ticket is to implement plan rewriting rules in Spark > Catalyst to filter meaningful training data before feature selection. We > assume a workflow below from data extraction to model training; > !fig1.png! > In the example workflow above, one prepares raw training data, R(v1, v2, v3, > v4) in the figure, by joining and projecting input data (R1, R2, and R3) in > various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset > (the red box) of the raw data, sampling and feature selection apply to them. > In real business use cases, it sometimes happens that raw training data have > many meaningless columns because of historical reasons (e.g., redundant > schema designs). So, if we could filter out these meaningless data in the > phase of data extraction, we should efficiently process the data extraction > itself and following feature selection. In the example above, we actually > need not join the relation R3 because all the columns in the relation are > filtered out in feature selection. Also, the join processing should be faster > if we could sample data directly in the input data (R1 and R2). This > optimized workflow is as following; > !fig2.png! > This optimization might be achived by rewriting a plan tree for data > extraction as following; > !fig3.png! > Since Spark already has a pluggable optimizer interface > (extendedOperatorOptimizationRules) and a framework to collect data > statistics for input data in data sources, the major tasks of this ticket are > to add plan rewriting rules to filter meaningful training data before feature > selections. > As a pretty simple task, Spark might have a rule to filter out columns with > low variances (This process is corresponding to `VarianceThreshold` in > scikit-learn) by implicitly adding a `Project` node in the top of an user > plan. Then, the Spark optimizer might push down this `Project` node into leaf > nodes (e.g., `LogicalRelation`) and the plan execution could be significantly > faster. Moreover, more sophisticated techniques have been proposed in [1, 2]. > I will make pull requests as sub-tasks and put relevant activities (papers > and other OSS functionalities) in this ticket to track them. > > References: > [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join > or Not to Join?: Thinking Twice about Joins before Feature Selection, > Proceedings of SIGMOD, 2016. > [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe > to avoid when learning high-capacity classifiers?, Proceedings of the VLDB > Endowment, Volume 11 Issue 3, Pages 366-379, 2017. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Attachment: (was: fig3.png) > Plan rewriting rules to filter meaningful training data before feature > selections > - > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > Attachments: fig1.png > > > In machine learning and statistics, feature selection is one of useful > techniques to choose a subset of relevant data in model construction for > simplification of models and shorter training times, e.g., scikit-learn has > some APIs for feature selection > ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this > selection is too time-consuming process if training data have a large number > of columns and rows (For example, the number of columns could frequently go > over 1,000 in real business use cases). > An objective of this ticket is to implement plan rewriting rules in Spark > Catalyst to filter meaningful training data before feature selection. We > assume a workflow below from data extraction to model training; > !fig1.png! > In the example workflow above, one prepares raw training data, R(v1, v2, v3, > v4) in the figure, by joining and projecting input data (R1, R2, and R3) in > various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset > (the red box) of the raw data, sampling and feature selection apply to them. > In real business use cases, it sometimes happens that raw training data have > many meaningless columns because of historical reasons (e.g., redundant > schema designs). So, if we could filter out these meaningless data in the > phase of data extraction, we should efficiently process the data extraction > itself and following feature selection. In the example above, we actually > need not join the relation R3 because all the columns in the relation are > filtered out in feature selection. Also, the join processing should be faster > if we could sample data directly in the input data (R1 and R2). This > optimized workflow is as following; > !fig2.png! > This optimization might be achived by rewriting a plan tree for data > extraction as following; > !fig3.png! > Since Spark already has a pluggable optimizer interface > (extendedOperatorOptimizationRules) and a framework to collect data > statistics for input data in data sources, the major tasks of this ticket are > to add plan rewriting rules to filter meaningful training data before feature > selections. > As a pretty simple task, Spark might have a rule to filter out columns with > low variances (This process is corresponding to `VarianceThreshold` in > scikit-learn) by implicitly adding a `Project` node in the top of an user > plan. Then, the Spark optimizer might push down this `Project` node into leaf > nodes (e.g., `LogicalRelation`) and the plan execution could be significantly > faster. Moreover, more sophisticated techniques have been proposed in [1, 2]. > I will make pull requests as sub-tasks and put relevant activities (papers > and other OSS functionalities) in this ticket to track them. > > References: > [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join > or Not to Join?: Thinking Twice about Joins before Feature Selection, > Proceedings of SIGMOD, 2016. > [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe > to avoid when learning high-capacity classifiers?, Proceedings of the VLDB > Endowment, Volume 11 Issue 3, Pages 366-379, 2017. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Attachment: (was: fig2.png) > Plan rewriting rules to filter meaningful training data before feature > selections > - > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > Attachments: fig1.png > > > In machine learning and statistics, feature selection is one of useful > techniques to choose a subset of relevant data in model construction for > simplification of models and shorter training times, e.g., scikit-learn has > some APIs for feature selection > ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this > selection is too time-consuming process if training data have a large number > of columns and rows (For example, the number of columns could frequently go > over 1,000 in real business use cases). > An objective of this ticket is to implement plan rewriting rules in Spark > Catalyst to filter meaningful training data before feature selection. We > assume a workflow below from data extraction to model training; > !fig1.png! > In the example workflow above, one prepares raw training data, R(v1, v2, v3, > v4) in the figure, by joining and projecting input data (R1, R2, and R3) in > various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset > (the red box) of the raw data, sampling and feature selection apply to them. > In real business use cases, it sometimes happens that raw training data have > many meaningless columns because of historical reasons (e.g., redundant > schema designs). So, if we could filter out these meaningless data in the > phase of data extraction, we should efficiently process the data extraction > itself and following feature selection. In the example above, we actually > need not join the relation R3 because all the columns in the relation are > filtered out in feature selection. Also, the join processing should be faster > if we could sample data directly in the input data (R1 and R2). This > optimized workflow is as following; > !fig2.png! > This optimization might be achived by rewriting a plan tree for data > extraction as following; > !fig3.png! > Since Spark already has a pluggable optimizer interface > (extendedOperatorOptimizationRules) and a framework to collect data > statistics for input data in data sources, the major tasks of this ticket are > to add plan rewriting rules to filter meaningful training data before feature > selections. > As a pretty simple task, Spark might have a rule to filter out columns with > low variances (This process is corresponding to `VarianceThreshold` in > scikit-learn) by implicitly adding a `Project` node in the top of an user > plan. Then, the Spark optimizer might push down this `Project` node into leaf > nodes (e.g., `LogicalRelation`) and the plan execution could be significantly > faster. Moreover, more sophisticated techniques have been proposed in [1, 2]. > I will make pull requests as sub-tasks and put relevant activities (papers > and other OSS functionalities) in this ticket to track them. > > References: > [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join > or Not to Join?: Thinking Twice about Joins before Feature Selection, > Proceedings of SIGMOD, 2016. > [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe > to avoid when learning high-capacity classifiers?, Proceedings of the VLDB > Endowment, Volume 11 Issue 3, Pages 366-379, 2017. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Attachment: fig1.png > Plan rewriting rules to filter meaningful training data before feature > selections > - > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > Attachments: fig1.png, fig2.png, fig3.png > > > In machine learning and statistics, feature selection is one of useful > techniques to choose a subset of relevant data in model construction for > simplification of models and shorter training times, e.g., scikit-learn has > some APIs for feature selection > ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this > selection is too time-consuming process if training data have a large number > of columns and rows (For example, the number of columns could frequently go > over 1,000 in real business use cases). > An objective of this ticket is to implement plan rewriting rules in Spark > Catalyst to filter meaningful training data before feature selection. We > assume a workflow below from data extraction to model training; > !fig1.png! > In the example workflow above, one prepares raw training data, R(v1, v2, v3, > v4) in the figure, by joining and projecting input data (R1, R2, and R3) in > various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset > (the red box) of the raw data, sampling and feature selection apply to them. > In real business use cases, it sometimes happens that raw training data have > many meaningless columns because of historical reasons (e.g., redundant > schema designs). So, if we could filter out these meaningless data in the > phase of data extraction, we should efficiently process the data extraction > itself and following feature selection. In the example above, we actually > need not join the relation R3 because all the columns in the relation are > filtered out in feature selection. Also, the join processing should be faster > if we could sample data directly in the input data (R1 and R2). This > optimized workflow is as following; > !fig2.png! > This optimization might be achived by rewriting a plan tree for data > extraction as following; > !fig3.png! > Since Spark already has a pluggable optimizer interface > (extendedOperatorOptimizationRules) and a framework to collect data > statistics for input data in data sources, the major tasks of this ticket are > to add plan rewriting rules to filter meaningful training data before feature > selections. > As a pretty simple task, Spark might have a rule to filter out columns with > low variances (This process is corresponding to `VarianceThreshold` in > scikit-learn) by implicitly adding a `Project` node in the top of an user > plan. Then, the Spark optimizer might push down this `Project` node into leaf > nodes (e.g., `LogicalRelation`) and the plan execution could be significantly > faster. Moreover, more sophisticated techniques have been proposed in [1, 2]. > I will make pull requests as sub-tasks and put relevant activities (papers > and other OSS functionalities) in this ticket to track them. > > References: > [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join > or Not to Join?: Thinking Twice about Joins before Feature Selection, > Proceedings of SIGMOD, 2016. > [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe > to avoid when learning high-capacity classifiers?, Proceedings of the VLDB > Endowment, Volume 11 Issue 3, Pages 366-379, 2017. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Attachment: (was: fig1.png) > Plan rewriting rules to filter meaningful training data before feature > selections > - > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > Attachments: fig1.png, fig2.png, fig3.png > > > In machine learning and statistics, feature selection is one of useful > techniques to choose a subset of relevant data in model construction for > simplification of models and shorter training times, e.g., scikit-learn has > some APIs for feature selection > ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this > selection is too time-consuming process if training data have a large number > of columns and rows (For example, the number of columns could frequently go > over 1,000 in real business use cases). > An objective of this ticket is to implement plan rewriting rules in Spark > Catalyst to filter meaningful training data before feature selection. We > assume a workflow below from data extraction to model training; > !fig1.png! > In the example workflow above, one prepares raw training data, R(v1, v2, v3, > v4) in the figure, by joining and projecting input data (R1, R2, and R3) in > various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset > (the red box) of the raw data, sampling and feature selection apply to them. > In real business use cases, it sometimes happens that raw training data have > many meaningless columns because of historical reasons (e.g., redundant > schema designs). So, if we could filter out these meaningless data in the > phase of data extraction, we should efficiently process the data extraction > itself and following feature selection. In the example above, we actually > need not join the relation R3 because all the columns in the relation are > filtered out in feature selection. Also, the join processing should be faster > if we could sample data directly in the input data (R1 and R2). This > optimized workflow is as following; > !fig2.png! > This optimization might be achived by rewriting a plan tree for data > extraction as following; > !fig3.png! > Since Spark already has a pluggable optimizer interface > (extendedOperatorOptimizationRules) and a framework to collect data > statistics for input data in data sources, the major tasks of this ticket are > to add plan rewriting rules to filter meaningful training data before feature > selections. > As a pretty simple task, Spark might have a rule to filter out columns with > low variances (This process is corresponding to `VarianceThreshold` in > scikit-learn) by implicitly adding a `Project` node in the top of an user > plan. Then, the Spark optimizer might push down this `Project` node into leaf > nodes (e.g., `LogicalRelation`) and the plan execution could be significantly > faster. Moreover, more sophisticated techniques have been proposed in [1, 2]. > I will make pull requests as sub-tasks and put relevant activities (papers > and other OSS functionalities) in this ticket to track them. > > References: > [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join > or Not to Join?: Thinking Twice about Joins before Feature Selection, > Proceedings of SIGMOD, 2016. > [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe > to avoid when learning high-capacity classifiers?, Proceedings of the VLDB > Endowment, Volume 11 Issue 3, Pages 366-379, 2017. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Description: In machine learning and statistics, feature selection is one of useful techniques to choose a subset of relevant data in model construction for simplification of models and shorter training times, e.g., scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this selection is too time-consuming process if training data have a large number of columns and rows (For example, the number of columns could frequently go over 1,000 in real business use cases). An objective of this ticket is to implement plan rewriting rules in Spark Catalyst to filter meaningful training data before feature selection. We assume a workflow below from data extraction to model training; !fig1.png! In the example workflow above, one prepares raw training data, R(v1, v2, v3, v4) in the figure, by joining and projecting input data (R1, R2, and R3) in various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset (the red box) of the raw data, sampling and feature selection apply to them. In real business use cases, it sometimes happens that raw training data have many meaningless columns because of historical reasons (e.g., redundant schema designs). So, if we could filter out these meaningless data in the phase of data extraction, we should efficiently process the data extraction itself and following feature selection. In the example above, we actually need not join the relation R3 because all the columns in the relation are filtered out in feature selection. Also, the join processing should be faster if we could sample data directly in the input data (R1 and R2). This optimized workflow is as following; !fig2.png! This optimization might be achived by rewriting a plan tree for data extraction as following; !fig3.png! Since Spark already has a pluggable optimizer interface (extendedOperatorOptimizationRules) and a framework to collect data statistics for input data in data sources, the major tasks of this ticket are to add plan rewriting rules to filter meaningful training data before feature selections. As a pretty simple task, Spark might have a rule to filter out columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophisticated techniques have been proposed in [1, 2]. I will make pull requests as sub-tasks and put relevant activities (papers and other OSS functionalities) in this ticket to track them. References: [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379, 2017. was: In machine learning and statistics, feature selection is one of useful techniques to choose a subset of relevant data in model construction for simplification of models and shorter training times, e.g., scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this selection is too time-consuming process if training data have a large number of columns and rows (For example, the number of columns could frequently go over 1,000 in real business use cases). An objective of this ticket is to implement plan rewriting rules in Spark Catalyst to filter meaningful training data before feature selection. We assume a workflow below from data extraction to model training; !fig1.png! In the example workflow above, one prepares raw training data, R(v1, v2, v3, v4) in the figure, by joining and projecting input data (R1, R2, and R3) in various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset (the red box) of the raw data, sampling and feature selection apply to them. In real business use cases, it sometimes happens that raw training data have many meaningless columns because of historical reasons (e.g., redundant schema designs). So, if we could filter out these meaningless data in the phase of data extraction, we should efficiently process the data extraction itself and following feature selection. In the example above, we actually need not join the relation R3 because all the columns in the relation are filtered out in feature selection. Also, the join processing should be faster if we could sample data directly in the
[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Attachment: fig3.png > Plan rewriting rules to filter meaningful training data before feature > selections > - > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > Attachments: fig1.png, fig2.png, fig3.png > > > In machine learning and statistics, feature selection is one of useful > techniques to choose a subset of relevant data in model construction for > simplification of models and shorter training times, e.g., scikit-learn has > some APIs for feature selection > ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this > selection is too time-consuming process if training data have a large number > of columns and rows (For example, the number of columns could frequently go > over 1,000 in real business use cases). > An objective of this ticket is to implement plan rewriting rules in Spark > Catalyst to filter meaningful training data before feature selection. We > assume a workflow below from data extraction to model training; > !fig1.png! > In the example workflow above, one prepares raw training data, R(v1, v2, v3, > v4) in the figure, by joining and projecting input data (R1, R2, and R3) in > various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset > (the red box) of the raw data, sampling and feature selection apply to them. > In real business use cases, it sometimes happens that raw training data have > many meaningless columns because of historical reasons (e.g., redundant > schema designs). So, if we could filter out these meaningless data in the > phase of data extraction, we should efficiently process the data extraction > itself and following feature selection. In the example above, we actually > need not join the relation R3 because all the columns in the relation are > filtered out in feature selection. Also, the join processing should be faster > if we could sample data directly in the input data (R1 and R2). This > optimized workflow is as following; > > This optimization might be achived by rewriting a plan tree for data > extraction as following; > > Since Spark already has a pluggable optimizer interface > (extendedOperatorOptimizationRules) and a framework to collect data > statistics for input data in data sources, the major tasks of this ticket are > to add plan rewriting rules to filter meaningful training data before feature > selections. > As a pretty simple task, Spark might have a rule to filter out columns with > low variances (This process is corresponding to `VarianceThreshold` in > scikit-learn) by implicitly adding a `Project` node in the top of an user > plan. Then, the Spark optimizer might push down this `Project` node into leaf > nodes (e.g., `LogicalRelation`) and the plan execution could be significantly > faster. Moreover, more sophisticated techniques have been proposed in [1, 2]. > I will make pull requests as sub-tasks and put relevant activities (papers > and other OSS functionalities) in this ticket to track them. > References: > [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join > or Not to Join?: Thinking Twice about Joins before Feature Selection, > Proceedings of SIGMOD, 2016. > [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe > to avoid when learning high-capacity classifiers?, Proceedings of the VLDB > Endowment, Volume 11 Issue 3, Pages 366-379, 2017. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Description: In machine learning and statistics, feature selection is one of useful techniques to choose a subset of relevant data in model construction for simplification of models and shorter training times, e.g., scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this selection is too time-consuming process if training data have a large number of columns and rows (For example, the number of columns could frequently go over 1,000 in real business use cases). An objective of this ticket is to implement plan rewriting rules in Spark Catalyst to filter meaningful training data before feature selection. We assume a workflow below from data extraction to model training; !fig1.png! In the example workflow above, one prepares raw training data, R(v1, v2, v3, v4) in the figure, by joining and projecting input data (R1, R2, and R3) in various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset (the red box) of the raw data, sampling and feature selection apply to them. In real business use cases, it sometimes happens that raw training data have many meaningless columns because of historical reasons (e.g., redundant schema designs). So, if we could filter out these meaningless data in the phase of data extraction, we should efficiently process the data extraction itself and following feature selection. In the example above, we actually need not join the relation R3 because all the columns in the relation are filtered out in feature selection. Also, the join processing should be faster if we could sample data directly in the input data (R1 and R2). This optimized workflow is as following; !fig2.png! This optimization might be achived by rewriting a plan tree for data extraction as following; !fig3.png! Since Spark already has a pluggable optimizer interface (extendedOperatorOptimizationRules) and a framework to collect data statistics for input data in data sources, the major tasks of this ticket are to add plan rewriting rules to filter meaningful training data before feature selections. As a pretty simple task, Spark might have a rule to filter out columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophisticated techniques have been proposed in [1, 2]. I will make pull requests as sub-tasks and put relevant activities (papers and other OSS functionalities) in this ticket to track them. References: [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379, 2017. was: In machine learning and statistics, feature selection is one of useful techniques to choose a subset of relevant data in model construction for simplification of models and shorter training times, e.g., scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this selection is too time-consuming process if training data have a large number of columns and rows (For example, the number of columns could frequently go over 1,000 in real business use cases). An objective of this ticket is to implement plan rewriting rules in Spark Catalyst to filter meaningful training data before feature selection. We assume a workflow below from data extraction to model training; !fig1.png! In the example workflow above, one prepares raw training data, R(v1, v2, v3, v4) in the figure, by joining and projecting input data (R1, R2, and R3) in various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset (the red box) of the raw data, sampling and feature selection apply to them. In real business use cases, it sometimes happens that raw training data have many meaningless columns because of historical reasons (e.g., redundant schema designs). So, if we could filter out these meaningless data in the phase of data extraction, we should efficiently process the data extraction itself and following feature selection. In the example above, we actually need not join the relation R3 because all the columns in the relation are filtered out in feature selection. Also, the join processing should be faster if we could sample data directly in the input
[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Attachment: fig2.png > Plan rewriting rules to filter meaningful training data before feature > selections > - > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > Attachments: fig1.png, fig2.png, fig3.png > > > In machine learning and statistics, feature selection is one of useful > techniques to choose a subset of relevant data in model construction for > simplification of models and shorter training times, e.g., scikit-learn has > some APIs for feature selection > ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this > selection is too time-consuming process if training data have a large number > of columns and rows (For example, the number of columns could frequently go > over 1,000 in real business use cases). > An objective of this ticket is to implement plan rewriting rules in Spark > Catalyst to filter meaningful training data before feature selection. We > assume a workflow below from data extraction to model training; > !fig1.png! > In the example workflow above, one prepares raw training data, R(v1, v2, v3, > v4) in the figure, by joining and projecting input data (R1, R2, and R3) in > various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset > (the red box) of the raw data, sampling and feature selection apply to them. > In real business use cases, it sometimes happens that raw training data have > many meaningless columns because of historical reasons (e.g., redundant > schema designs). So, if we could filter out these meaningless data in the > phase of data extraction, we should efficiently process the data extraction > itself and following feature selection. In the example above, we actually > need not join the relation R3 because all the columns in the relation are > filtered out in feature selection. Also, the join processing should be faster > if we could sample data directly in the input data (R1 and R2). This > optimized workflow is as following; > > This optimization might be achived by rewriting a plan tree for data > extraction as following; > > Since Spark already has a pluggable optimizer interface > (extendedOperatorOptimizationRules) and a framework to collect data > statistics for input data in data sources, the major tasks of this ticket are > to add plan rewriting rules to filter meaningful training data before feature > selections. > As a pretty simple task, Spark might have a rule to filter out columns with > low variances (This process is corresponding to `VarianceThreshold` in > scikit-learn) by implicitly adding a `Project` node in the top of an user > plan. Then, the Spark optimizer might push down this `Project` node into leaf > nodes (e.g., `LogicalRelation`) and the plan execution could be significantly > faster. Moreover, more sophisticated techniques have been proposed in [1, 2]. > I will make pull requests as sub-tasks and put relevant activities (papers > and other OSS functionalities) in this ticket to track them. > References: > [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join > or Not to Join?: Thinking Twice about Joins before Feature Selection, > Proceedings of SIGMOD, 2016. > [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe > to avoid when learning high-capacity classifiers?, Proceedings of the VLDB > Endowment, Volume 11 Issue 3, Pages 366-379, 2017. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Attachment: fig1.png > Plan rewriting rules to filter meaningful training data before feature > selections > - > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > Attachments: fig1.png, fig2.png, fig3.png > > > In machine learning and statistics, feature selection is one of useful > techniques to choose a subset of relevant data in model construction for > simplification of models and shorter training times, e.g., scikit-learn has > some APIs for feature selection > ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this > selection is too time-consuming process if training data have a large number > of columns and rows (For example, the number of columns could frequently go > over 1,000 in real business use cases). > An objective of this ticket is to implement plan rewriting rules in Spark > Catalyst to filter meaningful training data before feature selection. We > assume a workflow below from data extraction to model training; > !fig1.png! > In the example workflow above, one prepares raw training data, R(v1, v2, v3, > v4) in the figure, by joining and projecting input data (R1, R2, and R3) in > various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset > (the red box) of the raw data, sampling and feature selection apply to them. > In real business use cases, it sometimes happens that raw training data have > many meaningless columns because of historical reasons (e.g., redundant > schema designs). So, if we could filter out these meaningless data in the > phase of data extraction, we should efficiently process the data extraction > itself and following feature selection. In the example above, we actually > need not join the relation R3 because all the columns in the relation are > filtered out in feature selection. Also, the join processing should be faster > if we could sample data directly in the input data (R1 and R2). This > optimized workflow is as following; > > This optimization might be achived by rewriting a plan tree for data > extraction as following; > > Since Spark already has a pluggable optimizer interface > (extendedOperatorOptimizationRules) and a framework to collect data > statistics for input data in data sources, the major tasks of this ticket are > to add plan rewriting rules to filter meaningful training data before feature > selections. > As a pretty simple task, Spark might have a rule to filter out columns with > low variances (This process is corresponding to `VarianceThreshold` in > scikit-learn) by implicitly adding a `Project` node in the top of an user > plan. Then, the Spark optimizer might push down this `Project` node into leaf > nodes (e.g., `LogicalRelation`) and the plan execution could be significantly > faster. Moreover, more sophisticated techniques have been proposed in [1, 2]. > I will make pull requests as sub-tasks and put relevant activities (papers > and other OSS functionalities) in this ticket to track them. > References: > [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join > or Not to Join?: Thinking Twice about Joins before Feature Selection, > Proceedings of SIGMOD, 2016. > [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe > to avoid when learning high-capacity classifiers?, Proceedings of the VLDB > Endowment, Volume 11 Issue 3, Pages 366-379, 2017. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Description: In machine learning and statistics, feature selection is one of useful techniques to choose a subset of relevant data in model construction for simplification of models and shorter training times, e.g., scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this selection is too time-consuming process if training data have a large number of columns and rows (For example, the number of columns could frequently go over 1,000 in real business use cases). An objective of this ticket is to implement plan rewriting rules in Spark Catalyst to filter meaningful training data before feature selection. We assume a workflow below from data extraction to model training; !fig1.png! In the example workflow above, one prepares raw training data, R(v1, v2, v3, v4) in the figure, by joining and projecting input data (R1, R2, and R3) in various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset (the red box) of the raw data, sampling and feature selection apply to them. In real business use cases, it sometimes happens that raw training data have many meaningless columns because of historical reasons (e.g., redundant schema designs). So, if we could filter out these meaningless data in the phase of data extraction, we should efficiently process the data extraction itself and following feature selection. In the example above, we actually need not join the relation R3 because all the columns in the relation are filtered out in feature selection. Also, the join processing should be faster if we could sample data directly in the input data (R1 and R2). This optimized workflow is as following; This optimization might be achived by rewriting a plan tree for data extraction as following; Since Spark already has a pluggable optimizer interface (extendedOperatorOptimizationRules) and a framework to collect data statistics for input data in data sources, the major tasks of this ticket are to add plan rewriting rules to filter meaningful training data before feature selections. As a pretty simple task, Spark might have a rule to filter out columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophisticated techniques have been proposed in [1, 2]. I will make pull requests as sub-tasks and put relevant activities (papers and other OSS functionalities) in this ticket to track them. References: [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379, 2017. was: In machine learning and statistics, feature selection is one of useful techniques to choose a subset of relevant data in model construction for simplification of models and shorter training times. scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this selection is too time-consuming process if training data have a large number of columns (the number could frequently go over 1,000 in business use cases). An objective of this ticket is to add new optimizer rules in Spark to filter meaningful training data before feature selection. As a pretty simple example, Spark might be able to filter out columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophisticated techniques have been proposed in [1, 2]. I will make pull requests as sub-tasks and put relevant activities (papers and other OSS functionalities) in this ticket to track them. References: [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379, 2017. > Plan rewriting rules to filter meaningful training data
[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Attachment: (was: fig1.png) > Plan rewriting rules to filter meaningful training data before feature > selections > - > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > In machine learning and statistics, feature selection is one of useful > techniques to choose a subset of relevant data in model construction for > simplification of models and shorter training times, e.g., scikit-learn has > some APIs for feature selection > ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this > selection is too time-consuming process if training data have a large number > of columns and rows (For example, the number of columns could frequently go > over 1,000 in real business use cases). > An objective of this ticket is to implement plan rewriting rules in Spark > Catalyst to filter meaningful training data before feature selection. We > assume a workflow below from data extraction to model training; > !fig1.png! > In the example workflow above, one prepares raw training data, R(v1, v2, v3, > v4) in the figure, by joining and projecting input data (R1, R2, and R3) in > various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset > (the red box) of the raw data, sampling and feature selection apply to them. > In real business use cases, it sometimes happens that raw training data have > many meaningless columns because of historical reasons (e.g., redundant > schema designs). So, if we could filter out these meaningless data in the > phase of data extraction, we should efficiently process the data extraction > itself and following feature selection. In the example above, we actually > need not join the relation R3 because all the columns in the relation are > filtered out in feature selection. Also, the join processing should be faster > if we could sample data directly in the input data (R1 and R2). This > optimized workflow is as following; > > This optimization might be achived by rewriting a plan tree for data > extraction as following; > > Since Spark already has a pluggable optimizer interface > (extendedOperatorOptimizationRules) and a framework to collect data > statistics for input data in data sources, the major tasks of this ticket are > to add plan rewriting rules to filter meaningful training data before feature > selections. > As a pretty simple task, Spark might have a rule to filter out columns with > low variances (This process is corresponding to `VarianceThreshold` in > scikit-learn) by implicitly adding a `Project` node in the top of an user > plan. Then, the Spark optimizer might push down this `Project` node into leaf > nodes (e.g., `LogicalRelation`) and the plan execution could be significantly > faster. Moreover, more sophisticated techniques have been proposed in [1, 2]. > I will make pull requests as sub-tasks and put relevant activities (papers > and other OSS functionalities) in this ticket to track them. > References: > [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join > or Not to Join?: Thinking Twice about Joins before Feature Selection, > Proceedings of SIGMOD, 2016. > [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe > to avoid when learning high-capacity classifiers?, Proceedings of the VLDB > Endowment, Volume 11 Issue 3, Pages 366-379, 2017. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Attachment: fig1.png > Plan rewriting rules to filter meaningful training data before feature > selections > - > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > In machine learning and statistics, feature selection is one of useful > techniques to choose a subset of relevant data in model construction for > simplification of models and shorter training times. scikit-learn has some > APIs for feature selection > ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this > selection is too time-consuming process if training data have a large number > of columns (the number could frequently go over 1,000 in business use cases). > An objective of this ticket is to add new optimizer rules in Spark to filter > meaningful training data before feature selection. As a pretty simple > example, Spark might be able to filter out columns with low variances (This > process is corresponding to `VarianceThreshold` in scikit-learn) by > implicitly adding a `Project` node in the top of an user plan. Then, the > Spark optimizer might push down this `Project` node into leaf nodes (e.g., > `LogicalRelation`) and the plan execution could be significantly faster. > Moreover, more sophisticated techniques have been proposed in [1, 2]. > I will make pull requests as sub-tasks and put relevant activities (papers > and other OSS functionalities) in this ticket to track them. > References: > [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join > or Not to Join?: Thinking Twice about Joins before Feature Selection, > Proceedings of SIGMOD, 2016. > [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe > to avoid when learning high-capacity classifiers?, Proceedings of the VLDB > Endowment, Volume 11 Issue 3, Pages 366-379, 2017. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-181) Plan rewriting rules to filter meaningful training data before feature selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Summary: Plan rewriting rules to filter meaningful training data before feature selections (was: Plan rewrting rules to filter meaningful training data before feature selections) > Plan rewriting rules to filter meaningful training data before feature > selections > - > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > In machine learning and statistics, feature selection is one of useful > techniques to choose a subset of relevant data in model construction for > simplification of models and shorter training times. scikit-learn has some > APIs for feature selection > ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this > selection is too time-consuming process if training data have a large number > of columns (the number could frequently go over 1,000 in business use cases). > An objective of this ticket is to add new optimizer rules in Spark to filter > meaningful training data before feature selection. As a pretty simple > example, Spark might be able to filter out columns with low variances (This > process is corresponding to `VarianceThreshold` in scikit-learn) by > implicitly adding a `Project` node in the top of an user plan. Then, the > Spark optimizer might push down this `Project` node into leaf nodes (e.g., > `LogicalRelation`) and the plan execution could be significantly faster. > Moreover, more sophisticated techniques have been proposed in [1, 2]. > I will make pull requests as sub-tasks and put relevant activities (papers > and other OSS functionalities) in this ticket to track them. > References: > [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join > or Not to Join?: Thinking Twice about Joins before Feature Selection, > Proceedings of SIGMOD, 2016. > [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe > to avoid when learning high-capacity classifiers?, Proceedings of the VLDB > Endowment, Volume 11 Issue 3, Pages 366-379, 2017. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-198) Fix obsolete documentations for hivemall-on-spark
[ https://issues.apache.org/jira/browse/HIVEMALL-198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-198: -- Summary: Fix obsolete documentations for hivemall-on-spark (was: Fix documentations for hivemall-on-spark) > Fix obsolete documentations for hivemall-on-spark > - > > Key: HIVEMALL-198 > URL: https://issues.apache.org/jira/browse/HIVEMALL-198 > Project: Hivemall > Issue Type: Bug >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > Some documentations for hivemall-on-spark are obsolete, so we should fix > before the next release. > https://hivemall.incubator.apache.org/userguide/spark/getting_started/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-198) Fix obsolete documentations for hivemall-on-spark
[ https://issues.apache.org/jira/browse/HIVEMALL-198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-198: -- Labels: spark (was: ) > Fix obsolete documentations for hivemall-on-spark > - > > Key: HIVEMALL-198 > URL: https://issues.apache.org/jira/browse/HIVEMALL-198 > Project: Hivemall > Issue Type: Bug >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > Some documentations for hivemall-on-spark are obsolete, so we should fix > before the next release. > https://hivemall.incubator.apache.org/userguide/spark/getting_started/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVEMALL-198) Fix documentations for hivemall-on-spark
Takeshi Yamamuro created HIVEMALL-198: - Summary: Fix documentations for hivemall-on-spark Key: HIVEMALL-198 URL: https://issues.apache.org/jira/browse/HIVEMALL-198 Project: Hivemall Issue Type: Bug Reporter: Takeshi Yamamuro Assignee: Takeshi Yamamuro Some documentations for hivemall-on-spark are obsolete, so we should fix before the next release. https://hivemall.incubator.apache.org/userguide/spark/getting_started/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVEMALL-181) Plan rewrting rules to filter meaningful training data before feature selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425080#comment-16425080 ] Takeshi Yamamuro commented on HIVEMALL-181: --- Great work! Next time please give me the details of the work offline (Thanks for the link and I'll check later by myself). Anyway, in this ticket, I'd like to focus on the integration of the Spark optimizer and some parts of techniques for feature selections. > Plan rewrting rules to filter meaningful training data before feature > selections > > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > In machine learning and statistics, feature selection is one of useful > techniques to choose a subset of relevant data in model construction for > simplification of models and shorter training times. scikit-learn has some > APIs for feature selection > ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this > selection is too time-consuming process if training data have a large number > of columns (the number could frequently go over 1,000 in business use cases). > An objective of this ticket is to add new optimizer rules in Spark to filter > meaningful training data before feature selection. As a pretty simple > example, Spark might be able to filter out columns with low variances (This > process is corresponding to `VarianceThreshold` in scikit-learn) by > implicitly adding a `Project` node in the top of an user plan. Then, the > Spark optimizer might push down this `Project` node into leaf nodes (e.g., > `LogicalRelation`) and the plan execution could be significantly faster. > Moreover, more sophisticated techniques have been proposed in [1, 2]. > I will make pull requests as sub-tasks and put relevant activities (papers > and other OSS functionalities) in this ticket to track them. > References: > [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join > or Not to Join?: Thinking Twice about Joins before Feature Selection, > Proceedings of SIGMOD, 2016. > [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe > to avoid when learning high-capacity classifiers?, Proceedings of the VLDB > Endowment, Volume 11 Issue 3, Pages 366-379, 2017. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVEMALL-185) Add an optimizer rule to push down a Sample plan node into fact tables
Takeshi Yamamuro created HIVEMALL-185: - Summary: Add an optimizer rule to push down a Sample plan node into fact tables Key: HIVEMALL-185 URL: https://issues.apache.org/jira/browse/HIVEMALL-185 Project: Hivemall Issue Type: Sub-task Reporter: Takeshi Yamamuro Assignee: Takeshi Yamamuro Sampling is a common technique to extract a part of data in joined relations (fact tables and dimension tables) for training data. The optimizer in Spark cannot push down a Sample plan node into larger fact tables because this node is non-deterministic. But, by using RI constraints, we could push down this node into fact tables in some cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-184) Add an optimizer rule to filter out columns by using Mutual Information
[ https://issues.apache.org/jira/browse/HIVEMALL-184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-184: -- Labels: spark (was: ) > Add an optimizer rule to filter out columns by using Mutual Information > --- > > Key: HIVEMALL-184 > URL: https://issues.apache.org/jira/browse/HIVEMALL-184 > Project: Hivemall > Issue Type: Sub-task >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > Mutual Information (MI) is an indicator to find and quantify dependencies > between variables, so the indicator is useful to filter out columns in > feature selection. Nearest-neighbor distances are frequently used to estimate > MI [1], so we could use the distances to compute MI between columns for each > relation when running an ANALYZE command. Then, we could filter out "similar" > columns in the optimizer phase by referring a new threshold (e.g. > `spark.sql.optimizer.featureSelection.mutualInfoThreshold`). > In another story, we need to consider a light-weight way to update MI when > re-running an ANALYZE command. A recent study [2] proposed a sophisticated > technique to compute MI for dynamic data. > [1] Dafydd Evans, A computationally efficient estimator for mutual > information. > In Proceedings of the Royal Society of London A: Mathematical, Physical > and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008. > [2] Michael Vollmer et al., On Complexity and Efficiency of Mutual Information > Estimation on Static and Dynamic Data, Proceedings of EDBT, 2018. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-184) Add an optimizer rule to filter out columns by using Mutual Information
[ https://issues.apache.org/jira/browse/HIVEMALL-184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-184: -- Description: Mutual Information (MI) is an indicator to find and quantify dependencies between variables, so the indicator is useful to filter out columns in feature selection. Nearest-neighbor distances are frequently used to estimate MI [1], so we could use the distances to compute MI between columns for each relation when running an ANALYZE command. Then, we could filter out "similar" columns in the optimizer phase by referring a new threshold (e.g. `spark.sql.optimizer.featureSelection.mutualInfoThreshold`). In another story, we need to consider a light-weight way to update MI when re-running an ANALYZE command. A recent study [2] proposed a sophisticated technique to compute MI for dynamic data. [1] Dafydd Evans, A computationally efficient estimator for mutual information. In Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008. [2] Michael Vollmer et al., On Complexity and Efficiency of Mutual Information Estimation on Static and Dynamic Data, Proceedings of EDBT, 2018. was: Mutual Information (MI) is an indicator to find and quantify dependencies between variables, so the indicator is useful to filter out columns in feature selection. Nearest-neighbor distances are frequently used to estimate MI [1], so we could use the distances to compute MI between columns for each relation when running an ANALYZE command. Then, we could filter out "similar" columns in the optimizer phase by referring a new threshold (e.g. `spark.sql.optimizer.featureSelection.mutualInfoThreshold`). In another story, we need to consider a light-weight way to update MI when re-running an ANALYZE command. A recent study [2] proposed a sophisticated technique to compute MI for dynamic data. [1] Dafydd Evans, A computationally efficient estimator for mutual information. In Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008. [2] Michael Vollmer et al., On Complexity and Efficiency of Mutual Information Estimation on Static and Dynamic Data, Proceedings of EDBT, 2018. > Add an optimizer rule to filter out columns by using Mutual Information > --- > > Key: HIVEMALL-184 > URL: https://issues.apache.org/jira/browse/HIVEMALL-184 > Project: Hivemall > Issue Type: Sub-task >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > Mutual Information (MI) is an indicator to find and quantify dependencies > between variables, so the indicator is useful to filter out columns in > feature selection. Nearest-neighbor distances are frequently used to estimate > MI [1], so we could use the distances to compute MI between columns for each > relation when running an ANALYZE command. Then, we could filter out "similar" > columns in the optimizer phase by referring a new threshold (e.g. > `spark.sql.optimizer.featureSelection.mutualInfoThreshold`). > In another story, we need to consider a light-weight way to update MI when > re-running an ANALYZE command. A recent study [2] proposed a sophisticated > technique to compute MI for dynamic data. > [1] Dafydd Evans, A computationally efficient estimator for mutual > information. In Proceedings of the Royal Society of London A: Mathematical, > Physical > and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008. > [2] Michael Vollmer et al., On Complexity and Efficiency of Mutual > Information Estimation on Static and Dynamic Data, Proceedings of EDBT, 2018. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVEMALL-184) Add an optimizer rule to filter out columns by using Mutual Information
Takeshi Yamamuro created HIVEMALL-184: - Summary: Add an optimizer rule to filter out columns by using Mutual Information Key: HIVEMALL-184 URL: https://issues.apache.org/jira/browse/HIVEMALL-184 Project: Hivemall Issue Type: Sub-task Reporter: Takeshi Yamamuro Assignee: Takeshi Yamamuro Mutual Information (MI) is an indicator to find and quantify dependencies between variables, so the indicator is useful to filter out columns in feature selection. Nearest-neighbor distances are frequently used to estimate MI [1], so we could use the distances to compute MI between columns for each relation when running an ANALYZE command. Then, we could filter out "similar" columns in the optimizer phase by referring a new threshold (e.g. `spark.sql.optimizer.featureSelection.mutualInfoThreshold`). In another story, we need to consider a light-weight way to update MI when re-running an ANALYZE command. A recent study [2] proposed a sophisticated technique to compute MI for dynamic data. [1] Dafydd Evans, A computationally efficient estimator for mutual information. In Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, Vol. 464. The Royal Society, 1203–1215, 2008. [2] Michael Vollmer et al., On Complexity and Efficiency of Mutual Information Estimation on Static and Dynamic Data, Proceedings of EDBT, 2018. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter meaningful training data before feature selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Summary: Plan rewrting rules to filter meaningful training data before feature selections (was: Plan rewrting rules to filter meaningful training data before future selections) > Plan rewrting rules to filter meaningful training data before feature > selections > > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > In machine learning and statistics, feature selection is one of useful > techniques to choose a subset of relevant data in model construction for > simplification of models and shorter training times. scikit-learn has some > APIs for feature selection > ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this > selection is too time-consuming process if training data have a large number > of columns (the number could frequently go over 1,000 in business use cases). > An objective of this ticket is to add new optimizer rules in Spark to filter > meaningful training data before feature selection. As a pretty simple > example, Spark might be able to filter out columns with low variances (This > process is corresponding to `VarianceThreshold` in scikit-learn) by > implicitly adding a `Project` node in the top of an user plan. Then, the > Spark optimizer might push down this `Project` node into leaf nodes (e.g., > `LogicalRelation`) and the plan execution could be significantly faster. > Moreover, more sophisticated techniques have been proposed in [1, 2]. > I will make pull requests as sub-tasks and put relevant activities (papers > and other OSS functionalities) in this ticket to track them. > References: > [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join > or Not to Join?: Thinking Twice about Joins before Feature Selection, > Proceedings of SIGMOD, 2016. > [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe > to avoid when learning high-capacity classifiers?, Proceedings of the VLDB > Endowment, Volume 11 Issue 3, Pages 366-379, 2017. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter meaningful training data before future selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Description: In machine learning and statistics, feature selection is one of useful techniques to choose a subset of relevant data in model construction for simplification of models and shorter training times. scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this selection is too time-consuming process if training data have a large number of columns (the number could frequently go over 1,000 in business use cases). An objective of this ticket is to add new optimizer rules in Spark to filter meaningful training data before feature selection. As a pretty simple example, Spark might be able to filter out columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophisticated techniques have been proposed in [1, 2]. I will make pull requests as sub-tasks and put relevant activities (papers and other OSS functionalities) in this ticket to track them. References: [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379, 2017. was: In machine learning and statistics, feature selection is one of useful techniques to choose a subset of relevant data in model construction for simplification of models and shorter training times. scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this selection is too time-consuming process if training data have a large number of columns (the number could frequently go over 1,000 in business use cases). An objective of this ticket is to add new optimizer rules in Spark to filter meaningful training data before feature selection. As a simple example, Spark might be able to filter out columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophicated techniques have been proposed in [1, 2]. I will make pull requests as sub-tasks and put relevant activities (papers and other OSS functinalities) in this ticket to track them. References: [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379, 2017. > Plan rewrting rules to filter meaningful training data before future > selections > --- > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > In machine learning and statistics, feature selection is one of useful > techniques to choose a subset of relevant data in model construction for > simplification of models and shorter training times. scikit-learn has some > APIs for feature selection > ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this > selection is too time-consuming process if training data have a large number > of columns (the number could frequently go over 1,000 in business use cases). > An objective of this ticket is to add new optimizer rules in Spark to filter > meaningful training data before feature selection. As a pretty simple > example, Spark might be able to filter out columns with low variances (This > process is corresponding to `VarianceThreshold` in scikit-learn) by > implicitly adding a `Project` node in the top of an user plan. Then, the > Spark optimizer might push down this `Project` node into leaf nodes (e.g., > `LogicalRelation`) and the plan execution could be significantly faster. > Moreover, more sophisticated techniques
[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter meaningful training data before future selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Description: In machine learning and statistics, feature selection is one of useful techniques to choose a subset of relevant data in model construction for simplification of models and shorter training times. scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this selection is too time-consuming process if training data have a large number of columns (the number could frequently go over 1,000 in business use cases). An objective of this ticket is to add new optimizer rules in Spark to filter meaningful training data before feature selection. As a simple example, Spark might be able to filter out columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophicated techniques have been proposed in [1, 2]. I will make pull requests as sub-tasks and put relevant activities (papers and other OSS functinalities) in this ticket to track them. References: [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379, 2017. was: In machine learning and statistics, feature selection is one of useful techniques to choose a subset of relevant data in model construction for simplification of models and shorter training times. scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this selection is too time-consuming process if training data have a large number of columns (the number could frequently go over 1,000 in business use cases). An objective of this ticket is to add new optimizer rules in Spark to filter out meaningless columns before feature selection. As a simple example, Spark might be able to filter out columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophicated techniques have been proposed in [1, 2]. I will make pull requests as sub-tasks and put relevant activities (papers and other OSS functinalities) in this ticket to track them. References: [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379, 2017. > Plan rewrting rules to filter meaningful training data before future > selections > --- > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > In machine learning and statistics, feature selection is one of useful > techniques to choose a subset of relevant data in model construction for > simplification of models and shorter training times. scikit-learn has some > APIs for feature selection > ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this > selection is too time-consuming process if training data have a large number > of columns (the number could frequently go over 1,000 in business use cases). > An objective of this ticket is to add new optimizer rules in Spark to filter > meaningful training data before feature selection. As a simple example, Spark > might be able to filter out columns with low variances (This process is > corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a > `Project` node in the top of an user plan. > Then, the Spark optimizer might push down this `Project` node into leaf > nodes (e.g., `LogicalRelation`) and the plan execution could be significantly > faster. Moreover, more sophicated techniques have been proposed in
[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter out meaningful training data before future selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Description: In machine learning and statistics, feature selection is one of useful techniques to choose a subset of relevant data in model construction for simplification of models and shorter training times. scikit-learn has some APIs for feature selection ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this selection is too time-consuming process if training data have a large number of columns (the number could frequently go over 1,000 in business use cases). An objective of this ticket is to add new optimizer rules in Spark to filter out meaningless columns before feature selection. As a simple example, Spark might be able to filter out columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophicated techniques have been proposed in [1, 2]. I will make pull requests as sub-tasks and put relevant activities (papers and other OSS functinalities) in this ticket to track them. References: [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379, 2017. was: In machine learning and statistics, feature selection is a useful techniqe to choose a subset of relevant features in model construction for simplification of models and shorter training times. scikit-learn has some APIs for feature selection (http://scikit-learn.org/stable/modules/feature_selection.html), but this selection is too time-consuming process if training data have a large number of columns (the number could frequently go over 1,000 in bisiness use cases). An objective of this ticket is to add new optimizer rules in Spark to filter out meaningless columns before feature selection. As a simple example, Spark might be able to filter out columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophicated techniques have been proposed in [1, 2]. I will make pull requests as sub-tasks and put relevant activities (papers and other OSS functinalities) in this ticket to track them. References: [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379, 2017. > Plan rewrting rules to filter out meaningful training data before future > selections > --- > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > In machine learning and statistics, feature selection is one of useful > techniques to choose a subset of relevant data in model construction for > simplification of models and shorter training times. scikit-learn has some > APIs for feature selection > ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this > selection is too time-consuming process if training data have a large number > of columns (the number could frequently go over 1,000 in business use cases). > An objective of this ticket is to add new optimizer rules in Spark to filter > out meaningless columns before feature selection. As a simple example, Spark > might be able to filter out columns with low variances (This process is > corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a > `Project` node in the top of an user plan. > Then, the Spark optimizer might push down this `Project` node into leaf > nodes (e.g., `LogicalRelation`) and the plan execution could be significantly > faster. Moreover, more sophicated techniques have been proposed in
[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter meaningful training data before future selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Summary: Plan rewrting rules to filter meaningful training data before future selections (was: Plan rewrting rules to filter out meaningful training data before future selections) > Plan rewrting rules to filter meaningful training data before future > selections > --- > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > In machine learning and statistics, feature selection is one of useful > techniques to choose a subset of relevant data in model construction for > simplification of models and shorter training times. scikit-learn has some > APIs for feature selection > ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this > selection is too time-consuming process if training data have a large number > of columns (the number could frequently go over 1,000 in business use cases). > An objective of this ticket is to add new optimizer rules in Spark to filter > out meaningless columns before feature selection. As a simple example, Spark > might be able to filter out columns with low variances (This process is > corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a > `Project` node in the top of an user plan. > Then, the Spark optimizer might push down this `Project` node into leaf > nodes (e.g., `LogicalRelation`) and the plan execution could be significantly > faster. Moreover, more sophicated techniques have been proposed in [1, 2]. > I will make pull requests as sub-tasks and put relevant activities (papers > and other OSS functinalities) in this ticket to track them. > References: > [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join > or Not to Join?: Thinking Twice about Joins before Feature Selection, > Proceedings of SIGMOD, 2016. > [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe > to avoid when learning high-capacity classifiers?, Proceedings of the VLDB > Endowment, Volume 11 Issue 3, Pages 366-379, 2017. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter out meaningful training data before future selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Summary: Plan rewrting rules to filter out meaningful training data before future selections (was: Plan rewrting rules to filter out meaningless columns before future selections) > Plan rewrting rules to filter out meaningful training data before future > selections > --- > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > In machine learning and statistics, feature selection is a useful techniqe to > choose a subset of relevant features in model construction for simplification > of models and shorter training times. scikit-learn has some APIs for feature > selection (http://scikit-learn.org/stable/modules/feature_selection.html), > but this selection is too time-consuming process if training data have a > large number of columns (the number could frequently go over 1,000 in > bisiness use cases). > An objective of this ticket is to add new optimizer rules in Spark to filter > out meaningless columns before feature selection. As a simple example, Spark > might be able to filter out columns with low variances (This process is > corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a > `Project` node in the top of an user plan. > Then, the Spark optimizer might push down this `Project` node into leaf nodes > (e.g., `LogicalRelation`) and the plan execution could be significantly > faster. Moreover, more sophicated techniques have been proposed in [1, 2]. > I will make pull requests as sub-tasks and put relevant activities (papers > and other OSS functinalities) in this ticket to track them. > References: > [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join > or Not to Join?: Thinking Twice about Joins before Feature Selection, > Proceedings of SIGMOD, 2016. > [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to > avoid when learning high-capacity classifiers?, Proceedings of the VLDB > Endowment, Volume 11 Issue 3, Pages 366-379, 2017. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVEMALL-183) Add an optimizer rule to prune joins without significantly reducing ML accuracy
[ https://issues.apache.org/jira/browse/HIVEMALL-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421824#comment-16421824 ] Takeshi Yamamuro commented on HIVEMALL-183: --- Spark currently does not support FK constraints, so we need to track a Spark Jira ticket to support RIC functionalities in https://issues.apache.org/jira/browse/SPARK-19842 > Add an optimizer rule to prune joins without significantly reducing ML > accuracy > > > Key: HIVEMALL-183 > URL: https://issues.apache.org/jira/browse/HIVEMALL-183 > Project: Hivemall > Issue Type: Sub-task >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > An objective of this ticket is to implement the proposed technique in the > paper [1] below; > [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join > or Not to Join?: Thinking Twice about Joins before Feature Selection, > Proceedings of SIGMOD, 2016. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-183) Add an optimizer rule to prune joins without significantly reducing ML accuracy
[ https://issues.apache.org/jira/browse/HIVEMALL-183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-183: -- Description: An objective of this ticket to implement the proposed technique in a paper [1] below; [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. was: An objective of this ticket to implement the proposed technique in a paper below; without significantly reducing ML accuracy > Add an optimizer rule to prune joins without significantly reducing ML > accuracy > > > Key: HIVEMALL-183 > URL: https://issues.apache.org/jira/browse/HIVEMALL-183 > Project: Hivemall > Issue Type: Sub-task >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > An objective of this ticket to implement the proposed technique in a paper > [1] below; > [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join > or Not to Join?: Thinking Twice about Joins before Feature Selection, > Proceedings of SIGMOD, 2016. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-183) Add an optimizer rule to prune joins without significantly reducing ML accuracy
[ https://issues.apache.org/jira/browse/HIVEMALL-183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-183: -- Description: An objective of this ticket is to implement the proposed technique in the paper [1] below; [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. was: An objective of this ticket to implement the proposed technique in a paper [1] below; [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. > Add an optimizer rule to prune joins without significantly reducing ML > accuracy > > > Key: HIVEMALL-183 > URL: https://issues.apache.org/jira/browse/HIVEMALL-183 > Project: Hivemall > Issue Type: Sub-task >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > An objective of this ticket is to implement the proposed technique in the > paper [1] below; > [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join > or Not to Join?: Thinking Twice about Joins before Feature Selection, > Proceedings of SIGMOD, 2016. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-183) Add an optimizer rule to prune joins without significantly reducing ML accuracy
[ https://issues.apache.org/jira/browse/HIVEMALL-183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-183: -- Labels: spark (was: ) > Add an optimizer rule to prune joins without significantly reducing ML > accuracy > > > Key: HIVEMALL-183 > URL: https://issues.apache.org/jira/browse/HIVEMALL-183 > Project: Hivemall > Issue Type: Sub-task >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVEMALL-183) Add an optimizer rule to prune joins without significantly reducing ML accuracy
Takeshi Yamamuro created HIVEMALL-183: - Summary: Add an optimizer rule to prune joins without significantly reducing ML accuracy Key: HIVEMALL-183 URL: https://issues.apache.org/jira/browse/HIVEMALL-183 Project: Hivemall Issue Type: Sub-task Reporter: Takeshi Yamamuro -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-182) Add an optimizer rule to filter out columns with low variances
[ https://issues.apache.org/jira/browse/HIVEMALL-182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-182: -- Labels: spark (was: ) > Add an optimizer rule to filter out columns with low variances > -- > > Key: HIVEMALL-182 > URL: https://issues.apache.org/jira/browse/HIVEMALL-182 > Project: Hivemall > Issue Type: Sub-task >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (HIVEMALL-182) Add an optimizer rule to filter out columns with low variances
[ https://issues.apache.org/jira/browse/HIVEMALL-182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro reassigned HIVEMALL-182: - Assignee: Takeshi Yamamuro > Add an optimizer rule to filter out columns with low variances > -- > > Key: HIVEMALL-182 > URL: https://issues.apache.org/jira/browse/HIVEMALL-182 > Project: Hivemall > Issue Type: Sub-task >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter out meaningless columns before future selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Description: In machine learning and statistics, feature selection is a useful techniqe to choose a subset of relevant features in model construction for simplification of models and shorter training times. scikit-learn has some APIs for feature selection (http://scikit-learn.org/stable/modules/feature_selection.html), but this selection is too time-consuming process if training data have a large number of columns (the number could frequently go over 1,000 in bisiness use cases). An objective of this ticket is to add new optimizer rules in Spark to filter out meaningless columns before feature selection. As a simple example, Spark might be able to filter out columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophicated techniques have been proposed in [1, 2]. I will make pull requests as sub-tasks and put relevant activities (papers and other OSS functinalities) in this ticket to track them. References: [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379, 2017. was: In machine learning and statistics, feature selection is a useful techniqe to choose a subset of relevant features in model construction for simplification of models and shorter training times. scikit-learn has some APIs for feature selection (http://scikit-learn.org/stable/modules/feature_selection.html), but this selection is too time-consuming process if training data have a large number of columns (the number could frequently go over 1,000 in bisiness use cases). An objective of this ticket is to add new optimizer rules in Spark to filter out meaningless columns before feature selection. As a simple example, Spark might be able to filter out columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophicated techniques have been proposed in [1, 2]. I will make pull requests as sub-tasks and put relevant activities (papers and other OSS functinalities) in this ticket to track them. References: [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379, 2017. > Plan rewrting rules to filter out meaningless columns before future selections > -- > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > In machine learning and statistics, feature selection is a useful techniqe to > choose a subset of relevant features in model construction for simplification > of models and shorter training times. scikit-learn has some APIs for feature > selection (http://scikit-learn.org/stable/modules/feature_selection.html), > but this selection is too time-consuming process if training data have a > large number of columns (the number could frequently go over 1,000 in > bisiness use cases). > An objective of this ticket is to add new optimizer rules in Spark to filter > out meaningless columns before feature selection. As a simple example, Spark > might be able to filter out columns with low variances (This process is > corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a > `Project` node in the top of an user plan. > Then, the Spark optimizer might push down this `Project` node into leaf nodes > (e.g., `LogicalRelation`) and the plan execution could be significantly > faster. Moreover, more sophicated techniques have been proposed in [1, 2]. > I will make pull
[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter out meaningless columns before future selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Labels: spark (was: ) > Plan rewrting rules to filter out meaningless columns before future selections > -- > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > In machine learning and statistics, feature selection is a useful techniqe to > choose a subset of relevant features > in model construction for simplification of models and shorter training times. > scikit-learn has some APIs for feature selection > (http://scikit-learn.org/stable/modules/feature_selection.html), but > this selection is too time-consuming process if training data have a large > number of columns > (the number could frequently go over 1,000 in bisiness use cases). > An objective of this ticket is to add new optimizer rules in Spark to filter > out meaningless columns before feature selection. > As a simple example, Spark might be able to filter out columns with low > variances (This process is corresponding to `VarianceThreshold` in > scikit-learn) > by implicitly adding a `Project` node in the top of an user plan. > Then, the Spark optimizer might push down this `Project` node into leaf nodes > (e.g., `LogicalRelation`) and > the plan execution could be significantly faster. > Moreover, more sophicated techniques have been proposed in [1, 2]. > I will make pull requests as sub-tasks and put relevant activities (papers > and other OSS functinalities) > in this ticket to track them. > References: > [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join > or Not to Join?: Thinking Twice about Joins before Feature Selection, > Proceedings of SIGMOD, 2016. > [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to > avoid when learning high-capacity classifiers?, Proceedings of the VLDB > Endowment, Volume 11 Issue 3, Pages 366-379, 2017. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-181) Plan rewrting rules to filter out meaningless columns before future selections
[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-181: -- Description: In machine learning and statistics, feature selection is a useful techniqe to choose a subset of relevant features in model construction for simplification of models and shorter training times. scikit-learn has some APIs for feature selection (http://scikit-learn.org/stable/modules/feature_selection.html), but this selection is too time-consuming process if training data have a large number of columns (the number could frequently go over 1,000 in bisiness use cases). An objective of this ticket is to add new optimizer rules in Spark to filter out meaningless columns before feature selection. As a simple example, Spark might be able to filter out columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophicated techniques have been proposed in [1, 2]. I will make pull requests as sub-tasks and put relevant activities (papers and other OSS functinalities) in this ticket to track them. References: [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379, 2017. was: In machine learning and statistics, feature selection is a useful techniqe to choose a subset of relevant features in model construction for simplification of models and shorter training times. scikit-learn has some APIs for feature selection (http://scikit-learn.org/stable/modules/feature_selection.html), but this selection is too time-consuming process if training data have a large number of columns (the number could frequently go over 1,000 in bisiness use cases). An objective of this ticket is to add new optimizer rules in Spark to filter out meaningless columns before feature selection. As a simple example, Spark might be able to filter out columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophicated techniques have been proposed in [1, 2]. I will make pull requests as sub-tasks and put relevant activities (papers and other OSS functinalities) in this ticket to track them. References: [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379, 2017. > Plan rewrting rules to filter out meaningless columns before future selections > -- > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > > In machine learning and statistics, feature selection is a useful techniqe to > choose a subset of relevant features in model construction for simplification > of models and shorter training times. scikit-learn has some APIs for feature > selection (http://scikit-learn.org/stable/modules/feature_selection.html), > but this selection is too time-consuming process if training data have a > large number of columns (the number could frequently go over 1,000 in > bisiness use cases). > An objective of this ticket is to add new optimizer rules in Spark to filter > out meaningless columns before feature selection. > As a simple example, Spark might be able to filter out columns with low > variances (This process is corresponding to `VarianceThreshold` in > scikit-learn) > by implicitly adding a `Project` node in the top of an user plan. > Then, the Spark optimizer might push down this `Project` node into leaf nodes > (e.g., `LogicalRelation`) and > the plan execution could be significantly faster. > Moreover, more sophicated techniques have been proposed in [1, 2]. > I will make pull
[jira] [Created] (HIVEMALL-181) Plan rewrting rules to filter out meaningless columns before future selections
Takeshi Yamamuro created HIVEMALL-181: - Summary: Plan rewrting rules to filter out meaningless columns before future selections Key: HIVEMALL-181 URL: https://issues.apache.org/jira/browse/HIVEMALL-181 Project: Hivemall Issue Type: Improvement Reporter: Takeshi Yamamuro Assignee: Takeshi Yamamuro In machine learning and statistics, feature selection is a useful techniqe to choose a subset of relevant features in model construction for simplification of models and shorter training times. scikit-learn has some APIs for feature selection (http://scikit-learn.org/stable/modules/feature_selection.html), but this selection is too time-consuming process if training data have a large number of columns (the number could frequently go over 1,000 in bisiness use cases). An objective of this ticket is to add new optimizer rules in Spark to filter out meaningless columns before feature selection. As a simple example, Spark might be able to filter out columns with low variances (This process is corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a `Project` node in the top of an user plan. Then, the Spark optimizer might push down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan execution could be significantly faster. Moreover, more sophicated techniques have been proposed in [1, 2]. I will make pull requests as sub-tasks and put relevant activities (papers and other OSS functinalities) in this ticket to track them. References: [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings of SIGMOD, 2016. [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to avoid when learning high-capacity classifiers?, Proceedings of the VLDB Endowment, Volume 11 Issue 3, Pages 366-379, 2017. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-180) Drop the Spark-2.0 support
[ https://issues.apache.org/jira/browse/HIVEMALL-180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-180: -- Issue Type: Improvement (was: Sub-task) Parent: (was: HIVEMALL-152) > Drop the Spark-2.0 support > -- > > Key: HIVEMALL-180 > URL: https://issues.apache.org/jira/browse/HIVEMALL-180 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > Fix For: 0.5.2 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-180) Drop the Spark-2.0 support
[ https://issues.apache.org/jira/browse/HIVEMALL-180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-180: -- Labels: spark (was: ) > Drop the Spark-2.0 support > -- > > Key: HIVEMALL-180 > URL: https://issues.apache.org/jira/browse/HIVEMALL-180 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Priority: Major > Labels: spark > Fix For: 0.5.2 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (HIVEMALL-180) Drop the Spark-2.0 support
[ https://issues.apache.org/jira/browse/HIVEMALL-180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro reassigned HIVEMALL-180: - Assignee: Takeshi Yamamuro > Drop the Spark-2.0 support > -- > > Key: HIVEMALL-180 > URL: https://issues.apache.org/jira/browse/HIVEMALL-180 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: spark > Fix For: 0.5.2 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVEMALL-180) Drop the Spark-2.0 support
Takeshi Yamamuro created HIVEMALL-180: - Summary: Drop the Spark-2.0 support Key: HIVEMALL-180 URL: https://issues.apache.org/jira/browse/HIVEMALL-180 Project: Hivemall Issue Type: Improvement Reporter: Takeshi Yamamuro Fix For: 0.5.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVEMALL-179) Support Spark 2.3
[ https://issues.apache.org/jira/browse/HIVEMALL-179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-179: -- Labels: spark (was: ) > Support Spark 2.3 > - > > Key: HIVEMALL-179 > URL: https://issues.apache.org/jira/browse/HIVEMALL-179 > Project: Hivemall > Issue Type: Improvement >Reporter: Makoto Yui >Assignee: Takeshi Yamamuro >Priority: Blocker > Labels: spark > Fix For: 0.5.2 > > > Support Spark 2.3 (with deprecating old spark support?) > https://spark.apache.org/releases/spark-release-2-3-0.html -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVEMALL-136) Support train_classifier and train_regressor for Spark
Takeshi Yamamuro created HIVEMALL-136: - Summary: Support train_classifier and train_regressor for Spark Key: HIVEMALL-136 URL: https://issues.apache.org/jira/browse/HIVEMALL-136 Project: Hivemall Issue Type: Improvement Reporter: Takeshi Yamamuro This ticket is to support GeneralRegressorUDTF and GeneralClassifierUDTF. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVEMALL-134) Create Stanalone API for Scala/Java
[ https://issues.apache.org/jira/browse/HIVEMALL-134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086880#comment-16086880 ] Takeshi Yamamuro commented on HIVEMALL-134: --- You mean we run HiveUDF w/o hive? > Create Stanalone API for Scala/Java > --- > > Key: HIVEMALL-134 > URL: https://issues.apache.org/jira/browse/HIVEMALL-134 > Project: Hivemall > Issue Type: Wish >Reporter: Makoto Yui > > Standalone API of Hivemall would be useful for standalone application with > enough local memory. > A good example of a standalone API is Smile's Scala API. > https://haifengl.github.io/smile/ > https://github.com/haifengl/smile/tree/master/scala/src/main/scala/smile -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Closed] (HIVEMALL-116) Add documentation about SQL in Spark
[ https://issues.apache.org/jira/browse/HIVEMALL-116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro closed HIVEMALL-116. - Resolution: Fixed > Add documentation about SQL in Spark > > > Key: HIVEMALL-116 > URL: https://issues.apache.org/jira/browse/HIVEMALL-116 > Project: Hivemall > Issue Type: Sub-task >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro > Labels: docuentation, spark > > We currently have documentation about DataFrame in Spark. So, it needs to add > documentation for SQL in Spark. > https://hivemall.incubator.apache.org/userguide/spark/binaryclass/a9a_df.html > https://hivemall.incubator.apache.org/userguide/spark/regression/e2006_df.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVEMALL-133) Support spark-v2.2 in the hivemalls-spark module
Takeshi Yamamuro created HIVEMALL-133: - Summary: Support spark-v2.2 in the hivemalls-spark module Key: HIVEMALL-133 URL: https://issues.apache.org/jira/browse/HIVEMALL-133 Project: Hivemall Issue Type: Improvement Reporter: Takeshi Yamamuro Since Spark-v2.2 available now, we support it in /spark module. https://databricks.com/blog/2017/07/11/introducing-apache-spark-2-2.html?utm_campaign=Engineering%20Blog_content=57373960_medium=social_source=twitter -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVEMALL-133) Support spark-v2.2 in the hivemalls-spark module
[ https://issues.apache.org/jira/browse/HIVEMALL-133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-133: -- Labels: spark (was: ) > Support spark-v2.2 in the hivemalls-spark module > > > Key: HIVEMALL-133 > URL: https://issues.apache.org/jira/browse/HIVEMALL-133 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro > Labels: spark > > Since Spark-v2.2 available now, we support it in /spark module. > https://databricks.com/blog/2017/07/11/introducing-apache-spark-2-2.html?utm_campaign=Engineering%20Blog_content=57373960_medium=social_source=twitter -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVEMALL-129) Support wrapper implementation for python in pyspark
Takeshi Yamamuro created HIVEMALL-129: - Summary: Support wrapper implementation for python in pyspark Key: HIVEMALL-129 URL: https://issues.apache.org/jira/browse/HIVEMALL-129 Project: Hivemall Issue Type: Improvement Reporter: Takeshi Yamamuro Priority: Minor The master only supports wrapper implementation for Scala, but most users use pyspark in Spark. So, it might help to implement the wrapper for python. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVEMALL-129) Support wrapper implementation for python in pyspark
[ https://issues.apache.org/jira/browse/HIVEMALL-129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-129: -- Labels: spark (was: ) > Support wrapper implementation for python in pyspark > > > Key: HIVEMALL-129 > URL: https://issues.apache.org/jira/browse/HIVEMALL-129 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Priority: Minor > Labels: spark > > The master only supports wrapper implementation for Scala, but most users use > pyspark in Spark. So, it might help to implement the wrapper for python. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVEMALL-117) Add hivemall in SparkPackages
[ https://issues.apache.org/jira/browse/HIVEMALL-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-117: -- Labels: spark (was: ) > Add hivemall in SparkPackages > - > > Key: HIVEMALL-117 > URL: https://issues.apache.org/jira/browse/HIVEMALL-117 > Project: Hivemall > Issue Type: Bug >Reporter: Takeshi Yamamuro > Labels: spark > > We might add hivemall in SparkPackages after released in Apache: > https://spark-packages.org/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVEMALL-117) Add hivemall in SparkPackages
Takeshi Yamamuro created HIVEMALL-117: - Summary: Add hivemall in SparkPackages Key: HIVEMALL-117 URL: https://issues.apache.org/jira/browse/HIVEMALL-117 Project: Hivemall Issue Type: Bug Reporter: Takeshi Yamamuro We might add hivemall in SparkPackages after released in Apache: https://spark-packages.org/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVEMALL-116) Add documentation about SQL in Spark
[ https://issues.apache.org/jira/browse/HIVEMALL-116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-116: -- Description: We currently have documentation about DataFrame in Spark. So, it needs to add documentation for SQL in Spark. https://hivemall.incubator.apache.org/userguide/spark/binaryclass/a9a_df.html https://hivemall.incubator.apache.org/userguide/spark/regression/e2006_df.html was: We currently have documentation about DataFrame in Spark. So, it helps to add documentation for SQL in Spark. https://hivemall.incubator.apache.org/userguide/spark/binaryclass/a9a_df.html https://hivemall.incubator.apache.org/userguide/spark/regression/e2006_df.html > Add documentation about SQL in Spark > > > Key: HIVEMALL-116 > URL: https://issues.apache.org/jira/browse/HIVEMALL-116 > Project: Hivemall > Issue Type: Sub-task >Reporter: Takeshi Yamamuro > Labels: docuentation, spark > > We currently have documentation about DataFrame in Spark. So, it needs to add > documentation for SQL in Spark. > https://hivemall.incubator.apache.org/userguide/spark/binaryclass/a9a_df.html > https://hivemall.incubator.apache.org/userguide/spark/regression/e2006_df.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVEMALL-116) Add documentation about SQL in Spark
[ https://issues.apache.org/jira/browse/HIVEMALL-116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-116: -- Labels: docuentation spark (was: ) > Add documentation about SQL in Spark > > > Key: HIVEMALL-116 > URL: https://issues.apache.org/jira/browse/HIVEMALL-116 > Project: Hivemall > Issue Type: Sub-task >Reporter: Takeshi Yamamuro > Labels: docuentation, spark > > We currently have documentation about DataFrame in Spark. So, it helps to add > documentation for SQL in Spark. > https://hivemall.incubator.apache.org/userguide/spark/binaryclass/a9a_df.html > https://hivemall.incubator.apache.org/userguide/spark/regression/e2006_df.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVEMALL-116) Add documentation about SQL in Spark
Takeshi Yamamuro created HIVEMALL-116: - Summary: Add documentation about SQL in Spark Key: HIVEMALL-116 URL: https://issues.apache.org/jira/browse/HIVEMALL-116 Project: Hivemall Issue Type: Sub-task Reporter: Takeshi Yamamuro We currently have documentation about DataFrame in Spark. So, it helps to add documentation for SQL in Spark. https://hivemall.incubator.apache.org/userguide/spark/binaryclass/a9a_df.html https://hivemall.incubator.apache.org/userguide/spark/regression/e2006_df.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVEMALL-104) Support deterministic sampling in HivemallOps
[ https://issues.apache.org/jira/browse/HIVEMALL-104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-104: -- Labels: spark (was: ) > Support deterministic sampling in HivemallOps > - > > Key: HIVEMALL-104 > URL: https://issues.apache.org/jira/browse/HIVEMALL-104 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro > Labels: spark > > This feature seems to be beneficial in terms of plan optimization: > https://issues.apache.org/jira/browse/SPARK-14166 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVEMALL-104) Support deterministic sampling in HivemallOps
Takeshi Yamamuro created HIVEMALL-104: - Summary: Support deterministic sampling in HivemallOps Key: HIVEMALL-104 URL: https://issues.apache.org/jira/browse/HIVEMALL-104 Project: Hivemall Issue Type: Improvement Reporter: Takeshi Yamamuro This feature seems to be beneficial in terms of plan optimization: https://issues.apache.org/jira/browse/SPARK-14166 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVEMALL-103) Upgrading spark-v2.1.0 to v2.1.1
Takeshi Yamamuro created HIVEMALL-103: - Summary: Upgrading spark-v2.1.0 to v2.1.1 Key: HIVEMALL-103 URL: https://issues.apache.org/jira/browse/HIVEMALL-103 Project: Hivemall Issue Type: Improvement Reporter: Takeshi Yamamuro v2.1.1 has been released: https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (HIVEMALL-103) Upgrading spark-v2.1.0 to v2.1.1
[ https://issues.apache.org/jira/browse/HIVEMALL-103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-103: -- Labels: spark (was: ) > Upgrading spark-v2.1.0 to v2.1.1 > > > Key: HIVEMALL-103 > URL: https://issues.apache.org/jira/browse/HIVEMALL-103 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro > Labels: spark > > v2.1.1 has been released: > https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (HIVEMALL-102) Support upcoming Spark v2.2.0
[ https://issues.apache.org/jira/browse/HIVEMALL-102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-102: -- Labels: spark (was: ) > Support upcoming Spark v2.2.0 > - > > Key: HIVEMALL-102 > URL: https://issues.apache.org/jira/browse/HIVEMALL-102 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro > Labels: spark > > Spark community is currently voting a v2.2 release: > http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Apache-Spark-2-2-0-RC2-td21497.html -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVEMALL-102) Support upcoming Spark v2.2.0
Takeshi Yamamuro created HIVEMALL-102: - Summary: Support upcoming Spark v2.2.0 Key: HIVEMALL-102 URL: https://issues.apache.org/jira/browse/HIVEMALL-102 Project: Hivemall Issue Type: Improvement Reporter: Takeshi Yamamuro Spark community is currently voting a v2.2 release: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Apache-Spark-2-2-0-RC2-td21497.html -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVEMALL-99) Cross-compilation of XGBoost using Docker
[ https://issues.apache.org/jira/browse/HIVEMALL-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15986583#comment-15986583 ] Takeshi Yamamuro commented on HIVEMALL-99: -- go for it! > Cross-compilation of XGBoost using Docker > - > > Key: HIVEMALL-99 > URL: https://issues.apache.org/jira/browse/HIVEMALL-99 > Project: Hivemall > Issue Type: Improvement >Reporter: Makoto Yui >Assignee: ITO Ryuichi >Priority: Minor > > hivemall-xgboost jar should include native libraries such as x86-64 and else. > (cc: [~maropu], [~amaya]) > We can use dockcross [1] following the way in Xerial [2]. > [1] https://github.com/dockcross/dockcross > [2] https://github.com/xerial/snappy-java/tree/master/docker -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (HIVEMALL-44) Support Top-K joins for DataFrame/Spark
[ https://issues.apache.org/jira/browse/HIVEMALL-44?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved HIVEMALL-44. -- Resolution: Fixed Assignee: Takeshi Yamamuro > Support Top-K joins for DataFrame/Spark > --- > > Key: HIVEMALL-44 > URL: https://issues.apache.org/jira/browse/HIVEMALL-44 > Project: Hivemall > Issue Type: New Feature >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Blocker > Labels: Spark > > In Hivemall, `each_top_k` is useful for practical use cases. On the other > hand, there are some cases we need to join tables then compute Top-K > entries.You know we can compute this query by using regular joins + > `each_top_k`. However, we have space to improve this query more; that is, we > compute Top-K entries while processing joins. This optimization avoids a > substantial amount of I/O for joins. > An example query is as follows; > {code} > val inputDf = Seq( > ("user1", 1, 0.3, 0.5), > ("user2", 2, 0.1, 0.1), > ("user3", 3, 0.8, 0.0), > ("user4", 1, 0.9, 0.9), > ("user5", 3, 0.7, 0.2), > ("user6", 1, 0.5, 0.4), > ("user7", 2, 0.6, 0.8) > ).toDF("userId", "group", "x", "y") > val masterDf = Seq( > (1, "pos1-1", 0.5, 0.1), > (1, "pos1-2", 0.0, 0.0), > (1, "pos1-3", 0.3, 0.3), > (2, "pos2-3", 0.1, 0.3), > (2, "pos2-3", 0.8, 0.8), > (3, "pos3-1", 0.1, 0.7), > (3, "pos3-1", 0.7, 0.1), > (3, "pos3-1", 0.9, 0.0), > (3, "pos3-1", 0.1, 0.3) > ).toDF("group", "position", "x", "y") > // Compute top-1 rows for each group > val distance = sqrt( > pow(inputDf("x") - masterDf("x"), lit(2.0)) + > pow(inputDf("y") - masterDf("y"), lit(2.0)) > ) > val top1Df = inputDf.join_top_k( > lit(1), masterDf, inputDf("group") === masterDf("group"), > distance.as("score") > ) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVEMALL-47) Support codegen for ShuffledHashJoinTopKExec
[ https://issues.apache.org/jira/browse/HIVEMALL-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928278#comment-15928278 ] Takeshi Yamamuro commented on HIVEMALL-47: -- Resolved by https://github.com/apache/incubator-hivemall/pull/37 > Support codegen for ShuffledHashJoinTopKExec > > > Key: HIVEMALL-47 > URL: https://issues.apache.org/jira/browse/HIVEMALL-47 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro > Labels: spark > > https://github.com/apache/incubator-hivemall/blob/master/spark/spark-2.1/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinTopKExec.scala#L32 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (HIVEMALL-65) Update define-all.spark and import-packages.spark
[ https://issues.apache.org/jira/browse/HIVEMALL-65?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved HIVEMALL-65. -- Resolution: Fixed > Update define-all.spark and import-packages.spark > - > > Key: HIVEMALL-65 > URL: https://issues.apache.org/jira/browse/HIVEMALL-65 > Project: Hivemall > Issue Type: Bug >Reporter: Takeshi Yamamuro > Labels: Spark > > Some declarations in define-all.spark and import-packages.spark are incorrect > and duplicated. > e.g. train_arowh: > https://github.com/maropu/incubator-hivemall/blob/AddScriptForSparkShell/resources/ddl/define-all.spark#L32-L36 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVEMALL-89) Support to_csv/from_csv in HivemallOps
Takeshi Yamamuro created HIVEMALL-89: Summary: Support to_csv/from_csv in HivemallOps Key: HIVEMALL-89 URL: https://issues.apache.org/jira/browse/HIVEMALL-89 Project: Hivemall Issue Type: Improvement Reporter: Takeshi Yamamuro It is useful to support to_csv/from_csv for Spark (See SPARK-15463 for related discussion) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVEMALL-26) Add documentation about Hivemall on Apache Spark
[ https://issues.apache.org/jira/browse/HIVEMALL-26?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15880620#comment-15880620 ] Takeshi Yamamuro commented on HIVEMALL-26: -- We keep this ticket open until all the documentations filled for spark. > Add documentation about Hivemall on Apache Spark > > > Key: HIVEMALL-26 > URL: https://issues.apache.org/jira/browse/HIVEMALL-26 > Project: Hivemall > Issue Type: Sub-task >Reporter: Makoto Yui >Assignee: Takeshi Yamamuro > Labels: Documentation, Spark > > Our user guide should have entries about Hivemall on Spark on > http://hivemall.incubator.apache.org/userguide/ -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVEMALL-65) Update define-all.spark and import-packages.spark
[ https://issues.apache.org/jira/browse/HIVEMALL-65?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15865530#comment-15865530 ] Takeshi Yamamuro commented on HIVEMALL-65: -- We also need to check spark versions and load proper functions in these scripts. > Update define-all.spark and import-packages.spark > - > > Key: HIVEMALL-65 > URL: https://issues.apache.org/jira/browse/HIVEMALL-65 > Project: Hivemall > Issue Type: Bug >Reporter: Takeshi Yamamuro > Labels: Spark > > Some declarations in define-all.spark and import-packages.spark are incorrect > and duplicated. > e.g. train_arowh: > https://github.com/maropu/incubator-hivemall/blob/AddScriptForSparkShell/resources/ddl/define-all.spark#L32-L36 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVEMALL-68) Use TaskContext in RowIdUDF
Takeshi Yamamuro created HIVEMALL-68: Summary: Use TaskContext in RowIdUDF Key: HIVEMALL-68 URL: https://issues.apache.org/jira/browse/HIVEMALL-68 Project: Hivemall Issue Type: Improvement Reporter: Takeshi Yamamuro Priority: Minor We possibly use TaskContext via Java Reflection for generating unique IDs. https://github.com/apache/incubator-hivemall/pull/44#issuecomment-279294472 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (HIVEMALL-68) Use TaskContext in RowIdUDF
[ https://issues.apache.org/jira/browse/HIVEMALL-68?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-68: - Labels: Spark (was: ) > Use TaskContext in RowIdUDF > --- > > Key: HIVEMALL-68 > URL: https://issues.apache.org/jira/browse/HIVEMALL-68 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro >Priority: Minor > Labels: Spark > > We possibly use TaskContext via Java Reflection for generating unique IDs. > https://github.com/apache/incubator-hivemall/pull/44#issuecomment-279294472 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVEMALL-66) Remove wrapper classes for Hive UDFs
Takeshi Yamamuro created HIVEMALL-66: Summary: Remove wrapper classes for Hive UDFs Key: HIVEMALL-66 URL: https://issues.apache.org/jira/browse/HIVEMALL-66 Project: Hivemall Issue Type: Improvement Reporter: Takeshi Yamamuro Since the latest Spark does not support Map and List as return types in Hive UDFs, we have some GenericUDF wrapper classes in the spark module. But, spark community currently starts discussing supports for these types. If these types supported in spark, we can remove these wrapper classes. Reference: https://github.com/apache/spark/pull/16886 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVEMALL-65) Update define-all.spark and import-packages.spark
Takeshi Yamamuro created HIVEMALL-65: Summary: Update define-all.spark and import-packages.spark Key: HIVEMALL-65 URL: https://issues.apache.org/jira/browse/HIVEMALL-65 Project: Hivemall Issue Type: Bug Reporter: Takeshi Yamamuro Some declarations in define-all.spark and import-packages.spark are incorrect and duplicated. e.g. train_arowh: https://github.com/maropu/incubator-hivemall/blob/AddScriptForSparkShell/resources/ddl/define-all.spark#L32-L36 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVEMALL-62) Support a function to convert a comma-separated string into typed data and vice versa
Takeshi Yamamuro created HIVEMALL-62: Summary: Support a function to convert a comma-separated string into typed data and vice versa Key: HIVEMALL-62 URL: https://issues.apache.org/jira/browse/HIVEMALL-62 Project: Hivemall Issue Type: New Feature Reporter: Takeshi Yamamuro Priority: Minor Currently, spark does not have this features (IMO this feature will not appear as first-class ones in Spark) it is useful for ETL before ML processing. e.x.) {code} scala> val ds1 = Seq("""1,abc""").toDS() ds1: org.apache.spark.sql.Dataset[String] = [value: string] scala> val schema = new StructType().add("a", IntegerType).add("b", StringType) schema: org.apache.spark.sql.types.StructType = StructType(StructField(a,IntegerType,true), StructField(b,StringType,true)) scala> val ds2 = ds1.select(from_csv($"value", schema)) ds2: org.apache.spark.sql.DataFrame = [csvtostruct(value): struct] scala> ds2.printSchema root |-- csvtostruct(value): struct (nullable = true) ||-- a: integer (nullable = true) ||-- b: string (nullable = true) scala> ds2.show +--+ |csvtostruct(value)| +--+ | [1,abc]| +--+ {code} A related discussion is here: https://github.com/apache/spark/pull/13300#issuecomment-261962773 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVEMALL-48) Support codegen for EachTopK
[ https://issues.apache.org/jira/browse/HIVEMALL-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859668#comment-15859668 ] Takeshi Yamamuro commented on HIVEMALL-48: -- A prototype is here: https://github.com/apache/incubator-hivemall/compare/master...maropu:HIVEMALL-48 > Support codegen for EachTopK > > > Key: HIVEMALL-48 > URL: https://issues.apache.org/jira/browse/HIVEMALL-48 > Project: Hivemall > Issue Type: New Feature >Reporter: Takeshi Yamamuro > Labels: spark > > https://github.com/apache/incubator-hivemall/blob/master/spark/spark-2.1/src/main/scala/org/apache/spark/sql/catalyst/expressions/EachTopK.scala#L124 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (HIVEMALL-55) Drop off the Spark v1.6 support before next HIvemall GA release
[ https://issues.apache.org/jira/browse/HIVEMALL-55?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-55: - Labels: Spark (was: ) > Drop off the Spark v1.6 support before next HIvemall GA release > --- > > Key: HIVEMALL-55 > URL: https://issues.apache.org/jira/browse/HIVEMALL-55 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro > Labels: Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVEMALL-55) Drop off the Spark v1.6 support before next HIvemall GA release
Takeshi Yamamuro created HIVEMALL-55: Summary: Drop off the Spark v1.6 support before next HIvemall GA release Key: HIVEMALL-55 URL: https://issues.apache.org/jira/browse/HIVEMALL-55 Project: Hivemall Issue Type: Improvement Reporter: Takeshi Yamamuro -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (HIVEMALL-54) Add a easy-to-use script for spark-shell
[ https://issues.apache.org/jira/browse/HIVEMALL-54?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated HIVEMALL-54: - Labels: Spark (was: ) > Add a easy-to-use script for spark-shell > > > Key: HIVEMALL-54 > URL: https://issues.apache.org/jira/browse/HIVEMALL-54 > Project: Hivemall > Issue Type: Improvement >Reporter: Takeshi Yamamuro > Labels: Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] (HIVEMALL-46) Make it more simpler to upgrade Spark versions
Title: Message Title Takeshi Yamamuro created an issue Hivemall / HIVEMALL-46 Make it more simpler to upgrade Spark versions Issue Type: Improvement Assignee: Unassigned Created: 31/Jan/17 12:14 Priority: Major Reporter: Takeshi Yamamuro To support upcoming Spark releases, we currently need to copy many files from `spark/spark-2.X` to `spark/spark-2.Y' and then fix some compile errors happened there. It seems this works fine though, this copying makes an amount of code files blow up. So, we need to clean up source code structure (e.g., APIs) for easily following up-coming Spark releases. Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)