[jira] [Commented] (MADLIB-1146) Elastic Net fails when used without normalization with grouping

2017-08-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133934#comment-16133934
 ] 

ASF GitHub Bot commented on MADLIB-1146:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-madlib/pull/172


> Elastic Net fails when used without normalization with grouping
> ---
>
> Key: MADLIB-1146
> URL: https://issues.apache.org/jira/browse/MADLIB-1146
> Project: Apache MADlib
>  Issue Type: Bug
>Reporter: Cooper Sloan
>Assignee: Nandish Jayaram
>Priority: Minor
>
> ```
> DROP TABLE IF EXISTS house_en,house_en_summary;
> SELECT madlib.elastic_net_train(
> 'lin_housing_wi',
> 'house_en',
> 'y',
> 'x',
> 'gaussian',
> 0.5,
> 0.5,
> False,
> 'grp_by_col',
> 'fista',
> '',
> NULL,
> 1,
> 1e-6
> );
> psql:/Users/csloan/elastic_net.sql:1: NOTICE:  table "house_en" does not 
> exist, skipping
> psql:/Users/csloan/elastic_net.sql:1: NOTICE:  table "house_en_summary" does 
> not exist, skipping
> DROP TABLE
> psql:/Users/csloan/elastic_net.sql:17: ERROR:  KeyError: 'select_grp'
> CONTEXT:  Traceback (most recent call last):
>   PL/Python function "elastic_net_train", line 27, in 
> excluded, max_iter, tolerance)
>   PL/Python function "elastic_net_train", line 467, in elastic_net_train
>   PL/Python function "elastic_net_train", line 502, in 
> _internal_elastic_net_train
>   PL/Python function "elastic_net_train", line 24, in 
> _elastic_net_gaussian_fista_train
>   PL/Python function "elastic_net_train", line 171, in 
> _elastic_net_fista_train
>   PL/Python function "elastic_net_train", line 297, in 
> _elastic_net_fista_train_compute
>   PL/Python function "elastic_net_train", line 83, in 
> _elastic_net_generate_result
> PL/Python function "elastic_net_train"
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1073) Graph - Phase 1 measures

2017-08-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133772#comment-16133772
 ] 

ASF GitHub Bot commented on MADLIB-1073:


GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/173

Measures: Use outer join for in-out degrees computation

JIRA: MADLIB-1073

Commit 06788cc added the graph measure functions described in the JIRA.
This commit fixes a bug from that commit in the graph_vertex_degrees
function. The bug led to results not containing vertices that
either had 0 in-degree or out-degree.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib bugfix/in_out_degrees

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/173.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #173


commit f3697fdaaebeb851dfa23a0503c2c143c54f7f69
Author: Rahul Iyer 
Date:   2017-08-18T23:19:39Z

Measures: Use outer join for in-out degrees computation

JIRA: MADLIB-1073

Commit 06788cc added the graph measure functions described in the JIRA.
This commit fixes a bug from that commit in the graph_vertex_degrees
function. The bug led to results not containing vertices that
either had 0 in-degree or out-degree.




> Graph - Phase 1 measures
> 
>
> Key: MADLIB-1073
> URL: https://issues.apache.org/jira/browse/MADLIB-1073
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
>Assignee: Rahul Iyer
> Fix For: v1.12
>
> Attachments: Graph Measures Interfaces - JIRA.pdf
>
>
> Follow on from  https://issues.apache.org/jira/browse/MADLIB-1072. Given that 
> this story is complete, what measures can we compute from APSP?
> Story
> As a MADlib developer, I want to implement the following measures:
> * Closeness (uses APSP)
> * Graph diameter  (uses APSP)
> * Average path length (uses APSP)
> * In/out degrees
> Acceptance
> 1) Interface defined
> 2) Design document updated
> 3) Documentation and on-line help
> 4) IC and functional tests
> 5) Scale tests



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1146) Elastic Net fails when used without normalization with grouping

2017-08-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133294#comment-16133294
 ] 

ASF GitHub Bot commented on MADLIB-1146:


GitHub user njayaram2 opened a pull request:

https://github.com/apache/incubator-madlib/pull/172

Elastic_net: Fix grouping without normalization bug

JIRA: MADLIB-1146

Selecting grouping columns into the output table was not working
when data was NOT scaled, but grouping was used. This commit
fixes it.

Closes #172

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/njayaram2/incubator-madlib bugfix/MADlib_1146

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/172.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #172


commit 54c09a64b98a070cd4f18c85bf47bf77a2546264
Author: Nandish Jayaram 
Date:   2017-08-18T17:10:58Z

Elastic_net: Fix grouping without normalization bug

JIRA: MADLIB-1146

Selecting grouping columns into the output table was not working
when data was NOT scaled, but grouping was used. This commit
fixes it.

Closes #172




> Elastic Net fails when used without normalization with grouping
> ---
>
> Key: MADLIB-1146
> URL: https://issues.apache.org/jira/browse/MADLIB-1146
> Project: Apache MADlib
>  Issue Type: Bug
>Reporter: Cooper Sloan
>Assignee: Nandish Jayaram
>Priority: Minor
>
> ```
> DROP TABLE IF EXISTS house_en,house_en_summary;
> SELECT madlib.elastic_net_train(
> 'lin_housing_wi',
> 'house_en',
> 'y',
> 'x',
> 'gaussian',
> 0.5,
> 0.5,
> False,
> 'grp_by_col',
> 'fista',
> '',
> NULL,
> 1,
> 1e-6
> );
> psql:/Users/csloan/elastic_net.sql:1: NOTICE:  table "house_en" does not 
> exist, skipping
> psql:/Users/csloan/elastic_net.sql:1: NOTICE:  table "house_en_summary" does 
> not exist, skipping
> DROP TABLE
> psql:/Users/csloan/elastic_net.sql:17: ERROR:  KeyError: 'select_grp'
> CONTEXT:  Traceback (most recent call last):
>   PL/Python function "elastic_net_train", line 27, in 
> excluded, max_iter, tolerance)
>   PL/Python function "elastic_net_train", line 467, in elastic_net_train
>   PL/Python function "elastic_net_train", line 502, in 
> _internal_elastic_net_train
>   PL/Python function "elastic_net_train", line 24, in 
> _elastic_net_gaussian_fista_train
>   PL/Python function "elastic_net_train", line 171, in 
> _elastic_net_fista_train
>   PL/Python function "elastic_net_train", line 297, in 
> _elastic_net_fista_train_compute
>   PL/Python function "elastic_net_train", line 83, in 
> _elastic_net_generate_result
> PL/Python function "elastic_net_train"
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1134) Neural Networks - MLP - Phase 2

2017-08-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126051#comment-16126051
 ] 

ASF GitHub Bot commented on MADLIB-1134:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-madlib/pull/162


> Neural Networks - MLP - Phase 2
> ---
>
> Key: MADLIB-1134
> URL: https://issues.apache.org/jira/browse/MADLIB-1134
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Neural Networks
>Reporter: Frank McQuillan
>Assignee: Cooper Sloan
> Fix For: v1.12
>
>
> Follow on from https://issues.apache.org/jira/browse/MADLIB-413
> Story
> As a MADlib developer, I want to get 2nd phase implementation of NN going 
> with training and prediction functions, so that I can use this to build to an 
> MVP version for GA.
> Features to add:
> * weights for inputs
> * logic for n_tries
> * normalize inputs
> * L2 regularization
> * learning rate policy
> * warm start



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1119) Train-test split

2017-08-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126041#comment-16126041
 ] 

ASF GitHub Bot commented on MADLIB-1119:


GitHub user cooper-sloan opened a pull request:

https://github.com/apache/incubator-madlib/pull/166

Sample: test_train_split

JIRA: MADLIB-1119

Add utility to sample test and train
data from an input table.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cooper-sloan/incubator-madlib test_train_split

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/166.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #166


commit beafde5d3d218ba1d0b9f95843ee2186eb621edb
Author: Cooper Sloan 
Date:   2017-08-11T22:19:43Z

Sample: test_train_split

JIRA: MADLIB-1119

Add utility to sample test and train
data from an input table.




> Train-test split
> 
>
> Key: MADLIB-1119
> URL: https://issues.apache.org/jira/browse/MADLIB-1119
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Sampling
>Reporter: Frank McQuillan
> Fix For: v1.12
>
>
> Context
> See related story on stratified sampling 
> https://issues.apache.org/jira/browse/MADLIB-986
> Story
> As a data scientist, I want to split a data table into training and test sets 
> including grouping support, so that I use the result sets for model 
> development in the usual way.
> The MVP for this story is:
> * support split by group
> * allow option to sample without replacement (default) and sample with 
> replacement
> * allow option to output a subset of columns to the output table
> * output one table with a new test/train column, or optionally two separate 
> tables
> Proposed Interface
> {code}
> train_test_split ( 
>source_table,
>output_table,
>train_proportion,
>test_proportion, -- optional
>grouping_col -- optional
>with_replacement, -- optional
>target_cols -- optional
>separate_output_tables -- optional
> )
> source_table
> TEXT. The name of the table containing the input data.
> output_table
> TEXT. Name of output table.   A new INTEGER column on the right 
> called 'split' will identify 1 for train set and 0 for test set,
> unless the 'separate_output_tables' parameter below is TRUE, 
> in which case two output tables will be created using 
> the 'output_table' name with the suffixes '_train' and '_test'.
> The output table contains all the  columns present in the source 
> table unless otherwise specified  in the 'target_cols' parameter below. 
> train_proportion
> FLOAT8 in the range (0,1).  Proportion of the dataset to include 
> in the train split.  If the 'grouping_col' parameter is specified below, 
> each group will be sampled independently using the 
> train proportion, i.e., in a stratified fashion.
> test_proportion (optional)
> FLOAT8 in the range (0,1).  Proportion of the dataset to include 
> in the test split.  Default is the complement to the train
> proportion (1-'train_proportion').  If the 'grouping_col' 
> parameter is specified below,  each group will be sampled 
> independently using the  train proportion, 
> i.e., in a stratified fashion.
> grouping_col (optional)
> TEXT, default: NULL. A single column or a list of comma-separated columns
>  that defines how to stratify.  When this parameter is NULL, 
> the train-test split is not stratified.
> with_replacement (optional) 
> BOOLEAN, default FALSE.  Determines whether to sample with replacement 
> or without replacement (default).
> target_cols (optional)
> TEXT, default NULL. A comma-separated list of columns to appear in the 
> 'output_table'. 
> If NULL, all columns from the 'source_table'  will appear in the 
> 'output_table'.
> separate_output_tables (optional)
> BOOLEAN, default FALSE.  If TRUE, two output tables will be created using 
> the 'output_table' name with the suffixes '_train' and '_test'.
> {code}
> Other notes
> 1) PDL tools is one example implementation of train/test split to review [2]. 
>  
> 2) From Rahul Iyer: "The goal of having both train and test is to provide 
> subsample and train/test split in one function. 
> For eg. if train_size = 0.4 and test_size = 0.1, then only half the inputed 
> data will be output. This is tremendously useful in situations where a user 
> wants to 

[jira] [Commented] (MADLIB-1094) Elastic Net fails when used without normalization

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122503#comment-16122503
 ] 

ASF GitHub Bot commented on MADLIB-1094:


GitHub user cooper-sloan opened a pull request:

https://github.com/apache/incubator-madlib/pull/164

Elastic Net: Fix normalization issue

MADLIB-1094 and MADLIB-1146

avg in psql is numerically unstable
Data scaling was not occuring when
grouping is true.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cooper-sloan/incubator-madlib 
elastic_net_normalization

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/164.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #164


commit 0b00513bf20e7f0b9032b267472321bd6cfc4355
Author: Cooper Sloan 
Date:   2017-08-10T19:04:04Z

Elastic Net: Fix normalization issue

MADLIB-1094 and MADLIB-1146

avg in psql is numerically unstable
Data scaling was not occuring when
grouping is true.




> Elastic Net fails when used without normalization
> -
>
> Key: MADLIB-1094
> URL: https://issues.apache.org/jira/browse/MADLIB-1094
> Project: Apache MADlib
>  Issue Type: Bug
>  Components: Module: Regularized Regression
>Reporter: Nandish Jayaram
>Priority: Minor
> Fix For: v1.12
>
>
> Using Elastic Net with the normalization/standardize flag turned off (for 
> Gaussian IGD) results in failure, with the following error:
> {code:sql}
> madlib-pg94=# SELECT madlib.elastic_net_train(
> 'houses1',
> 'houses_en',
> 'array[tax, bath, size]',
> 'gaussian',
> 0.5,
> 0.1, 
> FALSE,  -- Standardize 
> NULL,
> 'igd',
> '',
> NULL,
> 1,1e-6);
> ERROR:  spiexceptions.NumericValueOutOfRange: value out of range: overflow
> CONTEXT:  Traceback (most recent call last):
>   PL/Python function "elastic_net_train", line 23, in 
> return elastic_net.elastic_net_train(**globals())
>   PL/Python function "elastic_net_train", line 332, in elastic_net_train
>   PL/Python function "elastic_net_train", line 42, in 
> __elastic_net_gaussian_igd_train
>   PL/Python function "elastic_net_train", line 268, in __elastic_net_igd_train
>   PL/Python function "elastic_net_train", line 373, in 
> __elastic_net_igd_train_compute
>   PL/Python function "elastic_net_train", line 69, in 
> __elastic_net_generate_result
>   PL/Python function "elastic_net_train", line 154, in 
> __compute_log_likelihood
> PL/Python function "elastic_net_train"
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1133) TLP graduation - remove references to "incubating" in source tree

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122334#comment-16122334
 ] 

ASF GitHub Bot commented on MADLIB-1133:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-madlib/pull/158


> TLP graduation  - remove references to "incubating" in source tree
> --
>
> Key: MADLIB-1133
> URL: https://issues.apache.org/jira/browse/MADLIB-1133
> Project: Apache MADlib
>  Issue Type: Task
>  Components: All Modules
>Reporter: Frank McQuillan
>Assignee: Ed Espino
>Priority: Minor
> Fix For: v1.12
>
>
>  Source tree incubation references. Run
>"ack -i incubat" 
> command on the master branch and make appropriate updates as per the output



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1133) TLP graduation - remove references to "incubating" in source tree

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122243#comment-16122243
 ] 

ASF GitHub Bot commented on MADLIB-1133:


Github user asfgit commented on the issue:

https://github.com/apache/incubator-madlib/pull/158
  

Refer to this link for build results (access rights to CI server needed): 
https://builds.apache.org/job/madlib-pr-build/149/



> TLP graduation  - remove references to "incubating" in source tree
> --
>
> Key: MADLIB-1133
> URL: https://issues.apache.org/jira/browse/MADLIB-1133
> Project: Apache MADlib
>  Issue Type: Task
>  Components: All Modules
>Reporter: Frank McQuillan
>Assignee: Ed Espino
>Priority: Minor
> Fix For: v1.12
>
>
>  Source tree incubation references. Run
>"ack -i incubat" 
> command on the master branch and make appropriate updates as per the output



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1133) TLP graduation - remove references to "incubating" in source tree

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122207#comment-16122207
 ] 

ASF GitHub Bot commented on MADLIB-1133:


Github user njayaram2 commented on the issue:

https://github.com/apache/incubator-madlib/pull/158
  
Jenkins ok to test


> TLP graduation  - remove references to "incubating" in source tree
> --
>
> Key: MADLIB-1133
> URL: https://issues.apache.org/jira/browse/MADLIB-1133
> Project: Apache MADlib
>  Issue Type: Task
>  Components: All Modules
>Reporter: Frank McQuillan
>Assignee: Ed Espino
>Priority: Minor
> Fix For: v1.12
>
>
>  Source tree incubation references. Run
>"ack -i incubat" 
> command on the master branch and make appropriate updates as per the output



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1118) Reduce size of elastic net install check table

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122200#comment-16122200
 ] 

ASF GitHub Bot commented on MADLIB-1118:


Github user asfgit commented on the issue:

https://github.com/apache/incubator-madlib/pull/163
  

Refer to this link for build results (access rights to CI server needed): 
https://builds.apache.org/job/madlib-pr-build/148/



> Reduce size of elastic net install check table
> --
>
> Key: MADLIB-1118
> URL: https://issues.apache.org/jira/browse/MADLIB-1118
> Project: Apache MADlib
>  Issue Type: Task
>  Components: Module: Regularized Regression
>Reporter: Frank McQuillan
>Assignee: Ed Espino
>Priority: Minor
> Fix For: v1.12
>
>
> IC is taking too long for elastic net.  I would suggest we reduce the size of 
> the input data table.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1133) TLP graduation - remove references to "incubating" in source tree

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122152#comment-16122152
 ] 

ASF GitHub Bot commented on MADLIB-1133:


Github user edespino commented on the issue:

https://github.com/apache/incubator-madlib/pull/158
  
Jenkins please retest.


> TLP graduation  - remove references to "incubating" in source tree
> --
>
> Key: MADLIB-1133
> URL: https://issues.apache.org/jira/browse/MADLIB-1133
> Project: Apache MADlib
>  Issue Type: Task
>  Components: All Modules
>Reporter: Frank McQuillan
>Assignee: Ed Espino
>Priority: Minor
> Fix For: v1.12
>
>
>  Source tree incubation references. Run
>"ack -i incubat" 
> command on the master branch and make appropriate updates as per the output



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1118) Reduce size of elastic net install check table

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122029#comment-16122029
 ] 

ASF GitHub Bot commented on MADLIB-1118:


Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/163
  
Jenkins please retest. 


> Reduce size of elastic net install check table
> --
>
> Key: MADLIB-1118
> URL: https://issues.apache.org/jira/browse/MADLIB-1118
> Project: Apache MADlib
>  Issue Type: Task
>  Components: Module: Regularized Regression
>Reporter: Frank McQuillan
>Assignee: Ed Espino
>Priority: Minor
> Fix For: v1.12
>
>
> IC is taking too long for elastic net.  I would suggest we reduce the size of 
> the input data table.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1118) Reduce size of elastic net install check table

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121994#comment-16121994
 ] 

ASF GitHub Bot commented on MADLIB-1118:


Github user edespino commented on the issue:

https://github.com/apache/incubator-madlib/pull/163
  
@iyerr3 - Thanks for the review. Very much appreciated. Additionally, I 
should have reviewed the `install-check` options. Thanks for the info.


> Reduce size of elastic net install check table
> --
>
> Key: MADLIB-1118
> URL: https://issues.apache.org/jira/browse/MADLIB-1118
> Project: Apache MADlib
>  Issue Type: Task
>  Components: Module: Regularized Regression
>Reporter: Frank McQuillan
>Assignee: Ed Espino
>Priority: Minor
> Fix For: v1.12
>
>
> IC is taking too long for elastic net.  I would suggest we reduce the size of 
> the input data table.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1118) Reduce size of elastic net install check table

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121990#comment-16121990
 ] 

ASF GitHub Bot commented on MADLIB-1118:


Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/163
  
The change looks good. 

Few comments: 
- An alternative to changing the threshold is to reduce the max number of 
iterations. Even with the lower threshold, we're not necessarily guaranteed 
quicker completion. 
- The log file can be accessed even with the test passing by adding `-vl` 
option to the `madpack install-check` command. The options indicate `-v: 
Verbose` and `-l: Keep logs`. 
- The install-check itself also provides the run time for execution of the 
whole file. However, `\timing` is needed if run time for individual queries is 
desired. 

@njayaram2 Any idea why the asserts on `log_likelihood` are commented out? 


> Reduce size of elastic net install check table
> --
>
> Key: MADLIB-1118
> URL: https://issues.apache.org/jira/browse/MADLIB-1118
> Project: Apache MADlib
>  Issue Type: Task
>  Components: Module: Regularized Regression
>Reporter: Frank McQuillan
>Assignee: Ed Espino
>Priority: Minor
> Fix For: v1.12
>
>
> IC is taking too long for elastic net.  I would suggest we reduce the size of 
> the input data table.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1118) Reduce size of elastic net install check table

2017-08-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16120865#comment-16120865
 ] 

ASF GitHub Bot commented on MADLIB-1118:


Github user njayaram2 commented on the issue:

https://github.com/apache/incubator-madlib/pull/163
  
LGTM, since we anyway don't assert on `relative_error` on `log_likelihood` 
in elastic_net.


> Reduce size of elastic net install check table
> --
>
> Key: MADLIB-1118
> URL: https://issues.apache.org/jira/browse/MADLIB-1118
> Project: Apache MADlib
>  Issue Type: Task
>  Components: Module: Regularized Regression
>Reporter: Frank McQuillan
>Assignee: Ed Espino
>Priority: Minor
> Fix For: v1.12
>
>
> IC is taking too long for elastic net.  I would suggest we reduce the size of 
> the input data table.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1118) Reduce size of elastic net install check table

2017-08-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16120552#comment-16120552
 ] 

ASF GitHub Bot commented on MADLIB-1118:


Github user edespino commented on the issue:

https://github.com/apache/incubator-madlib/pull/163
  
For future reference, this is how I reviewed the elastic_net install-check 
execution:

* Update the following file:`/src/ports/postgres/modules/elastic_net/test/elastic_net_install_check.sql_in`
  * Added `\timing` to top of the file.
  * Added `SELECT ASSERT (FALSE, 'Deliberately forced failure');` to the 
bottom of the file to force a failure condition. This will allowed me to review 
the timing information in the log files from the test execution.
* From build directory run `make install` to push updated install-check 
file to installation directory
* Run only the elastic_net test suite (using Postgres): 
`/usr/local/madlib/bin/madpack -s madlib -p postgres install-check -t 
elastic_net

I updated the elastic_net_train tolerance values with varying values and 
reran the repeated the scenario reviewing the recorded `Time:` values.


> Reduce size of elastic net install check table
> --
>
> Key: MADLIB-1118
> URL: https://issues.apache.org/jira/browse/MADLIB-1118
> Project: Apache MADlib
>  Issue Type: Task
>  Components: Module: Regularized Regression
>Reporter: Frank McQuillan
>Assignee: Ed Espino
>Priority: Minor
> Fix For: v1.12
>
>
> IC is taking too long for elastic net.  I would suggest we reduce the size of 
> the input data table.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1118) Reduce size of elastic net install check table

2017-08-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119491#comment-16119491
 ] 

ASF GitHub Bot commented on MADLIB-1118:


GitHub user edespino opened a pull request:

https://github.com/apache/incubator-madlib/pull/163

MADLIB-1118. Change tolerance to 1e-2 (from 1e-6)

This changes the execution elapsed time to 2252 milliseconds from
10171 milliseconds on mac with Postgre 9.6

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/edespino/incubator-madlib MADLIB-1138

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/163.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #163


commit 8c3bc61047f8a4cab61dd239502d08ede415316f
Author: Ed Espino 
Date:   2017-08-09T06:40:13Z

MADLIB-1118. Change tolerance to 1e-2 (from 1e-6)

This changes the execution elapsed time to 2252 milliseconds from
10171 milliseconds on mac with Postgre 9.6




> Reduce size of elastic net install check table
> --
>
> Key: MADLIB-1118
> URL: https://issues.apache.org/jira/browse/MADLIB-1118
> Project: Apache MADlib
>  Issue Type: Task
>  Components: Module: Regularized Regression
>Reporter: Frank McQuillan
>Assignee: Ed Espino
>Priority: Minor
> Fix For: v1.12
>
>
> IC is taking too long for elastic net.  I would suggest we reduce the size of 
> the input data table.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1109) Do not fail certain modules when optimizer_control GUC is set to off in Greenplum

2017-08-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16118634#comment-16118634
 ] 

ASF GitHub Bot commented on MADLIB-1109:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-madlib/pull/157


> Do not fail certain modules when optimizer_control GUC is set to off in 
> Greenplum
> -
>
> Key: MADLIB-1109
> URL: https://issues.apache.org/jira/browse/MADLIB-1109
> Project: Apache MADlib
>  Issue Type: Task
>  Components: All Modules
>Reporter: Frank McQuillan
>Priority: Minor
> Fix For: v1.12
>
>
> If the optimizer_control GUC is set to off in Greenplum, the following 
> install checks will fail, and these MADlib functions will not work:  
> * decision tree
> * random forest
> * LDA
> * k-Means
> * PMML export for decision tree
> * PMML export for random forest
> There may be others, but these are the ones I am aware of. 
> The parameter optimizer_control 
> https://gpdb.docs.pivotal.io/43130/ref_guide/config_params/guc-list.html#optimizer_control
> controls whether the server configuration parameter optimizer 
> https://gpdb.docs.pivotal.io/43130/ref_guide/config_params/guc-list.html#optimizer
> can be changed. The parameter optimizer controls whether the GPORCA optimizer 
> is enabled when running SQL queries.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1133) TLP graduation - remove references to "incubating" in source tree

2017-08-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16117591#comment-16117591
 ] 

ASF GitHub Bot commented on MADLIB-1133:


Github user njayaram2 commented on the issue:

https://github.com/apache/incubator-madlib/pull/158
  
+1
LGTM


> TLP graduation  - remove references to "incubating" in source tree
> --
>
> Key: MADLIB-1133
> URL: https://issues.apache.org/jira/browse/MADLIB-1133
> Project: Apache MADlib
>  Issue Type: Task
>  Components: All Modules
>Reporter: Frank McQuillan
>Assignee: Ed Espino
>Priority: Minor
> Fix For: v1.12
>
>
>  Source tree incubation references. Run
>"ack -i incubat" 
> command on the master branch and make appropriate updates as per the output



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1134) Neural Networks - MLP - Phase 2

2017-08-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16116970#comment-16116970
 ] 

ASF GitHub Bot commented on MADLIB-1134:


GitHub user cooper-sloan opened a pull request:

https://github.com/apache/incubator-madlib/pull/162

MLP: Multilayer Perceptron Phase 2

JIRA: MADLIB-1134

Weights, warm start, n_tries,
regularization, learning rate policy,
standardization and tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cooper-sloan/incubator-madlib mlp_phase2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/162.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #162


commit 0d008d9995b1ffb5b35271318b878b396375456a
Author: Cooper Sloan 
Date:   2017-06-17T00:41:07Z

MLP: Multilayer Perceptron Phase 2

JIRA: MADLIB-1134

Weights, warm start, n_tries,
regularization, learning rate policy,
standardization and tests.




> Neural Networks - MLP - Phase 2
> ---
>
> Key: MADLIB-1134
> URL: https://issues.apache.org/jira/browse/MADLIB-1134
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Neural Networks
>Reporter: Frank McQuillan
>Assignee: Cooper Sloan
> Fix For: v1.12
>
>
> Follow on from https://issues.apache.org/jira/browse/MADLIB-413
> Story
> As a MADlib developer, I want to get 2nd phase implementation of NN going 
> with training and prediction functions, so that I can use this to build to an 
> MVP version for GA.
> Features to add:
> * weights for inputs
> * logic for n_tries
> * normalize inputs
> * L2 regularization
> * learning rate policy
> * warm start



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1133) TLP graduation - remove references to "incubating" in source tree

2017-08-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113241#comment-16113241
 ] 

ASF GitHub Bot commented on MADLIB-1133:


Github user edespino commented on the issue:

https://github.com/apache/incubator-madlib/pull/158
  
I have filed MADLIB-1144 to track the removal of the DISCLAIMER file once 
the MADlib registered trademark is fully transferred to ASF.


> TLP graduation  - remove references to "incubating" in source tree
> --
>
> Key: MADLIB-1133
> URL: https://issues.apache.org/jira/browse/MADLIB-1133
> Project: Apache MADlib
>  Issue Type: Task
>  Components: All Modules
>Reporter: Frank McQuillan
>Assignee: Ed Espino
>Priority: Minor
> Fix For: v1.12
>
>
>  Source tree incubation references. Run
>"ack -i incubat" 
> command on the master branch and make appropriate updates as per the output



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1133) TLP graduation - remove references to "incubating" in source tree

2017-08-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113210#comment-16113210
 ] 

ASF GitHub Bot commented on MADLIB-1133:


Github user edespino commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/158#discussion_r131218538
  
--- Diff: tool/jenkins/rat_check.sh ---
@@ -22,15 +22,16 @@
 set -exu
 
 workdir=`pwd`
+reponame=incubator-madlib
--- End diff --

That is correct. When the repo moves, this will be addressed here and in 
the Jenkins projects through MADLIB-1142 (filed yesterday).


> TLP graduation  - remove references to "incubating" in source tree
> --
>
> Key: MADLIB-1133
> URL: https://issues.apache.org/jira/browse/MADLIB-1133
> Project: Apache MADlib
>  Issue Type: Task
>  Components: All Modules
>Reporter: Frank McQuillan
>Assignee: Ed Espino
>Priority: Minor
> Fix For: v1.12
>
>
>  Source tree incubation references. Run
>"ack -i incubat" 
> command on the master branch and make appropriate updates as per the output



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1133) TLP graduation - remove references to "incubating" in source tree

2017-08-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113196#comment-16113196
 ] 

ASF GitHub Bot commented on MADLIB-1133:


Github user rvs commented on the issue:

https://github.com/apache/incubator-madlib/pull/158
  
@njayaram2 great idea -- why don't you just reuse the file and put 
something like this there:

Be advised that the registered trademark for MADlib is in a process of 
being transferred to the Apache Software Foundation. Please refer to 
https://issues.apache.org/jira/browse/PODLINGNAMESEARCH-125 for more details.


> TLP graduation  - remove references to "incubating" in source tree
> --
>
> Key: MADLIB-1133
> URL: https://issues.apache.org/jira/browse/MADLIB-1133
> Project: Apache MADlib
>  Issue Type: Task
>  Components: All Modules
>Reporter: Frank McQuillan
>Assignee: Ed Espino
>Priority: Minor
> Fix For: v1.12
>
>
>  Source tree incubation references. Run
>"ack -i incubat" 
> command on the master branch and make appropriate updates as per the output



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1133) TLP graduation - remove references to "incubating" in source tree

2017-08-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113183#comment-16113183
 ] 

ASF GitHub Bot commented on MADLIB-1133:


Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/158#discussion_r131215480
  
--- Diff: CMakeLists.txt ---
@@ -189,7 +189,6 @@ install(
 )
 install(
 FILES
-"${CMAKE_CURRENT_SOURCE_DIR}/DISCLAIMER"
--- End diff --

Might be a good idea to keep this file, and use it to point out the 
disclaimer regarding MADlib's trademark. This is coming from the fact that 
MADlib's trademark does not belong to ASF yet, and it was suggested we make 
this clear in both our homepage and releases.
Tagging @rvs for more on this.


> TLP graduation  - remove references to "incubating" in source tree
> --
>
> Key: MADLIB-1133
> URL: https://issues.apache.org/jira/browse/MADLIB-1133
> Project: Apache MADlib
>  Issue Type: Task
>  Components: All Modules
>Reporter: Frank McQuillan
>Assignee: Ed Espino
>Priority: Minor
> Fix For: v1.12
>
>
>  Source tree incubation references. Run
>"ack -i incubat" 
> command on the master branch and make appropriate updates as per the output



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1133) TLP graduation - remove references to "incubating" in source tree

2017-08-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113151#comment-16113151
 ] 

ASF GitHub Bot commented on MADLIB-1133:


Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/158#discussion_r131209570
  
--- Diff: tool/jenkins/rat_check.sh ---
@@ -22,15 +22,16 @@
 set -exu
 
 workdir=`pwd`
+reponame=incubator-madlib
--- End diff --

This should be `madlib` once the repo name is changed.


> TLP graduation  - remove references to "incubating" in source tree
> --
>
> Key: MADLIB-1133
> URL: https://issues.apache.org/jira/browse/MADLIB-1133
> Project: Apache MADlib
>  Issue Type: Task
>  Components: All Modules
>Reporter: Frank McQuillan
>Assignee: Ed Espino
>Priority: Minor
> Fix For: v1.12
>
>
>  Source tree incubation references. Run
>"ack -i incubat" 
> command on the master branch and make appropriate updates as per the output



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1133) TLP graduation - remove references to "incubating" in source tree

2017-08-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16109906#comment-16109906
 ] 

ASF GitHub Bot commented on MADLIB-1133:


Github user asfgit commented on the issue:

https://github.com/apache/incubator-madlib/pull/158
  

Refer to this link for build results (access rights to CI server needed): 
https://builds.apache.org/job/madlib-pr-build/139/



> TLP graduation  - remove references to "incubating" in source tree
> --
>
> Key: MADLIB-1133
> URL: https://issues.apache.org/jira/browse/MADLIB-1133
> Project: Apache MADlib
>  Issue Type: Task
>  Components: All Modules
>Reporter: Frank McQuillan
>Assignee: Frank McQuillan
>Priority: Minor
> Fix For: v1.12
>
>
>  Source tree incubation references. Run
>"ack -i incubat" 
> command on the master branch and make appropriate updates as per the output



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1133) TLP graduation - remove references to "incubating" in source tree

2017-08-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16109873#comment-16109873
 ] 

ASF GitHub Bot commented on MADLIB-1133:


Github user asfgit commented on the issue:

https://github.com/apache/incubator-madlib/pull/158
  

Refer to this link for build results (access rights to CI server needed): 
https://builds.apache.org/job/madlib-pr-build/138/



> TLP graduation  - remove references to "incubating" in source tree
> --
>
> Key: MADLIB-1133
> URL: https://issues.apache.org/jira/browse/MADLIB-1133
> Project: Apache MADlib
>  Issue Type: Task
>  Components: All Modules
>Reporter: Frank McQuillan
>Assignee: Frank McQuillan
>Priority: Minor
> Fix For: v1.12
>
>
>  Source tree incubation references. Run
>"ack -i incubat" 
> command on the master branch and make appropriate updates as per the output



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1133) TLP graduation - remove references to "incubating" in source tree

2017-08-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16109846#comment-16109846
 ] 

ASF GitHub Bot commented on MADLIB-1133:


Github user asfgit commented on the issue:

https://github.com/apache/incubator-madlib/pull/158
  

Refer to this link for build results (access rights to CI server needed): 
https://builds.apache.org/job/madlib-pr-build/137/



> TLP graduation  - remove references to "incubating" in source tree
> --
>
> Key: MADLIB-1133
> URL: https://issues.apache.org/jira/browse/MADLIB-1133
> Project: Apache MADlib
>  Issue Type: Task
>  Components: All Modules
>Reporter: Frank McQuillan
>Assignee: Frank McQuillan
>Priority: Minor
> Fix For: v1.12
>
>
>  Source tree incubation references. Run
>"ack -i incubat" 
> command on the master branch and make appropriate updates as per the output



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1133) TLP graduation - remove references to "incubating" in source tree

2017-08-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16109785#comment-16109785
 ] 

ASF GitHub Bot commented on MADLIB-1133:


Github user asfgit commented on the issue:

https://github.com/apache/incubator-madlib/pull/158
  

Refer to this link for build results (access rights to CI server needed): 
https://builds.apache.org/job/madlib-pr-build/136/



> TLP graduation  - remove references to "incubating" in source tree
> --
>
> Key: MADLIB-1133
> URL: https://issues.apache.org/jira/browse/MADLIB-1133
> Project: Apache MADlib
>  Issue Type: Task
>  Components: All Modules
>Reporter: Frank McQuillan
>Assignee: Frank McQuillan
>Priority: Minor
> Fix For: v1.12
>
>
>  Source tree incubation references. Run
>"ack -i incubat" 
> command on the master branch and make appropriate updates as per the output



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1109) Do not fail certain modules when optimizer_control GUC is set to off in Greenplum

2017-08-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16109467#comment-16109467
 ] 

ASF GitHub Bot commented on MADLIB-1109:


GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/157

Multiple: Check optimizer_control before updating optimizer

JIRA: MADLIB-1109

This is applicable only for the Greenplum and HAWQ platforms:

We disable/enable ORCA using the 'optimizer' GUC in some functions for
performance reasons. GPDB/HAWQ has another GUC 'optimizer_control' which
allows the user to disable updates to the 'optimizer' GUC. Updating
'optimizer' when 'optimizer_control = off' leads to an ugly error.

This commit adds a check for the value of 'optimizer_control' and
updates 'optimizer' only if 'optimizer_control = on'.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
bugfix/optimizer_control

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/157.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #157


commit b86eab834e0f56f5fcb501bf1ef50556000afe8b
Author: Rahul Iyer 
Date:   2017-08-01T18:01:05Z

Multiple: Check optimizer_control before updating optimizer

JIRA: MADLIB-1109

This is applicable only for the Greenplum and HAWQ platforms:

We disable/enable ORCA using the 'optimizer' GUC in some functions for
performance reasons. GPDB/HAWQ has another GUC 'optimizer_control' which
allows the user to disable updates to the 'optimizer' GUC. Updating
'optimizer' when 'optimizer_control = off' leads to an ugly error.

This commit adds a check for the value of 'optimizer_control' and
updates 'optimizer' only if 'optimizer_control = on'.




> Do not fail certain modules when optimizer_control GUC is set to off in 
> Greenplum
> -
>
> Key: MADLIB-1109
> URL: https://issues.apache.org/jira/browse/MADLIB-1109
> Project: Apache MADlib
>  Issue Type: Task
>  Components: All Modules
>Reporter: Frank McQuillan
>Priority: Minor
> Fix For: v1.12
>
>
> If the optimizer_control GUC is set to off in Greenplum, the following 
> install checks will fail, and these MADlib functions will not work:  
> * decision tree
> * random forest
> * LDA
> * k-Means
> * PMML export for decision tree
> * PMML export for random forest
> There may be others, but these are the ones I am aware of. 
> The parameter optimizer_control 
> https://gpdb.docs.pivotal.io/43130/ref_guide/config_params/guc-list.html#optimizer_control
> controls whether the server configuration parameter optimizer 
> https://gpdb.docs.pivotal.io/43130/ref_guide/config_params/guc-list.html#optimizer
> can be changed. The parameter optimizer controls whether the GPORCA optimizer 
> is enabled when running SQL queries.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1101) Graph - weakly connected components helper functions

2017-07-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105627#comment-16105627
 ] 

ASF GitHub Bot commented on MADLIB-1101:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-madlib/pull/155


> Graph - weakly connected components helper functions
> 
>
> Key: MADLIB-1101
> URL: https://issues.apache.org/jira/browse/MADLIB-1101
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
> Fix For: v1.12
>
>
> Context 
> Follow on from 
> https://issues.apache.org/jira/browse/MADLIB-1071
> Story
> As a data scientist, I want to use helper functions related to weakly 
> connected components, so that I don't have to query the result table myself 
> which is less efficient and subject to error.
> List of helper functions roughly in priority order:
> 1) biggest connected component
> 2) number of nodes per connected component (histogram)
> 3) whether two nodes belong to same or different connected components
> 4) count of connected cpt clusters
> 5) Set of all nodes which can be reached (have a path) from a specified vertex



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1101) Graph - weakly connected components helper functions

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102439#comment-16102439
 ] 

ASF GitHub Bot commented on MADLIB-1101:


GitHub user njayaram2 opened a pull request:

https://github.com/apache/incubator-madlib/pull/155

Feature: Weakly connected components helper functions

JIRA: MADLIB-1101

Add several helper functions that will quickly return back various
useful stats based on the connected components learng from the
madlib.weakly_connected_components() function. Five helper functions
are added as part of this story, along with docs and updated install
check. The helper functions are:
- graph_wcc_largest_cpt(): finds largest components
- graph_wcc_histogram(): finds number of vertices in each component
- graph_wcc_vertex_check(): finds all components that have a given
pair of vertices in them.
- graph_wcc_num_cpts(): finds total number of components.
- graph_wcc_reachable_vertices(): finds all vertices reachable
within a component for a given source vertex.

All these functions are implemented to handle grouping columns too
if the WCC's output table was created with grouping_cols.

Closes #155

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/njayaram2/incubator-madlib features/wcc_helper

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/155.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #155


commit 85e89ef1857ed432f295991a6037aa5732714911
Author: Nandish Jayaram 
Date:   2017-07-18T16:31:09Z

Feature: Weakly connected components helper functions

JIRA: MADLIB-1101

Add several helper functions that will quickly return back various
useful stats based on the connected components learng from the
madlib.weakly_connected_components() function. Five helper functions
are added as part of this story, along with docs and updated install
check. The helper functions are:
- graph_wcc_largest_cpt(): finds largest components
- graph_wcc_histogram(): finds number of vertices in each component
- graph_wcc_vertex_check(): finds all components that have a given
pair of vertices in them.
- graph_wcc_num_cpts(): finds total number of components.
- graph_wcc_reachable_vertices(): finds all vertices reachable
within a component for a given source vertex.

All these functions are implemented to handle grouping columns too
if the WCC's output table was created with grouping_cols.

Closes #155




> Graph - weakly connected components helper functions
> 
>
> Key: MADLIB-1101
> URL: https://issues.apache.org/jira/browse/MADLIB-1101
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
> Fix For: v1.12
>
>
> Context 
> Follow on from 
> https://issues.apache.org/jira/browse/MADLIB-1071
> Story
> As a data scientist, I want to use helper functions related to weakly 
> connected components, so that I don't have to query the result table myself 
> which is less efficient and subject to error.
> List of helper functions roughly in priority order:
> 1) biggest connected component
> 2) number of nodes per connected component (histogram)
> 3) whether two nodes belong to same or different connected components
> 4) count of connected cpt clusters
> 5) Set of all nodes which can be reached (have a path) from a specified vertex



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1138) Add basic code coverage support

2017-07-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16096793#comment-16096793
 ] 

ASF GitHub Bot commented on MADLIB-1138:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-madlib/pull/151


> Add basic code coverage support
> ---
>
> Key: MADLIB-1138
> URL: https://issues.apache.org/jira/browse/MADLIB-1138
> Project: Apache MADlib
>  Issue Type: Task
>  Components: Build System
>Reporter: Ed Espino
>
> For developers, add cmake option configuration option ENABLE_COVERAGE which 
> will introduce gcov compilation and linking options (-fprofile-arcs 
> -ftest-coverage). Two supporting make targets are introduced:
> * GenCoverageReport - Capture gcov counters and generate report
> * ResetCoverageCounters - Zero counters gcov counters
> Features:
> * Counters will be captured in build/CodeCoverage.info file. System and Third 
> party metrics will be filtered out of coverage info file and stored in 
> CodeCoverage-filtered.info
> * HTML report will be created in build/CodeCoverageReport directory
> Usage:
> * cmake -DENABLE_COVERAGE=ON ..
> * 
> * make GenCoverageReport
> * 
> * make ResetCoverageCounters
> * 
> * make GenCoverageReport



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1138) Add basic code coverage support

2017-07-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16096790#comment-16096790
 ] 

ASF GitHub Bot commented on MADLIB-1138:


Github user orhankislal commented on the issue:

https://github.com/apache/incubator-madlib/pull/151
  
Tested on OS X and Linux. LGTM


> Add basic code coverage support
> ---
>
> Key: MADLIB-1138
> URL: https://issues.apache.org/jira/browse/MADLIB-1138
> Project: Apache MADlib
>  Issue Type: Task
>  Components: Build System
>Reporter: Ed Espino
>
> For developers, add cmake option configuration option ENABLE_COVERAGE which 
> will introduce gcov compilation and linking options (-fprofile-arcs 
> -ftest-coverage). Two supporting make targets are introduced:
> * GenCoverageReport - Capture gcov counters and generate report
> * ResetCoverageCounters - Zero counters gcov counters
> Features:
> * Counters will be captured in build/CodeCoverage.info file. System and Third 
> party metrics will be filtered out of coverage info file and stored in 
> CodeCoverage-filtered.info
> * HTML report will be created in build/CodeCoverageReport directory
> Usage:
> * cmake -DENABLE_COVERAGE=ON ..
> * 
> * make GenCoverageReport
> * 
> * make ResetCoverageCounters
> * 
> * make GenCoverageReport



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1138) Add basic code coverage support

2017-07-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16093594#comment-16093594
 ] 

ASF GitHub Bot commented on MADLIB-1138:


Github user cooper-sloan commented on the issue:

https://github.com/apache/incubator-madlib/pull/151
  
LGTM.  Documentation 
[here](https://cwiki.apache.org/confluence/display/MADLIB/Code+Coverage+Guide)


> Add basic code coverage support
> ---
>
> Key: MADLIB-1138
> URL: https://issues.apache.org/jira/browse/MADLIB-1138
> Project: Apache MADlib
>  Issue Type: Task
>  Components: Build System
>Reporter: Ed Espino
>
> For developers, add cmake option configuration option ENABLE_COVERAGE which 
> will introduce gcov compilation and linking options (-fprofile-arcs 
> -ftest-coverage). Two supporting make targets are introduced:
> * GenCoverageReport - Capture gcov counters and generate report
> * ResetCoverageCounters - Zero counters gcov counters
> Features:
> * Counters will be captured in build/CodeCoverage.info file. System and Third 
> party metrics will be filtered out of coverage info file and stored in 
> CodeCoverage-filtered.info
> * HTML report will be created in build/CodeCoverageReport directory
> Usage:
> * cmake -DENABLE_COVERAGE=ON ..
> * 
> * make GenCoverageReport
> * 
> * make ResetCoverageCounters
> * 
> * make GenCoverageReport



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1025) MADlib does not compile with gcc 6.2

2017-07-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16093495#comment-16093495
 ] 

ASF GitHub Bot commented on MADLIB-1025:


Github user iyerr3 closed the pull request at:

https://github.com/apache/incubator-madlib/pull/101


> MADlib does not compile with gcc 6.2
> 
>
> Key: MADLIB-1025
> URL: https://issues.apache.org/jira/browse/MADLIB-1025
> Project: Apache MADlib
>  Issue Type: Bug
>  Components: Build System
>Reporter: Rahul Iyer
>Assignee: Nandish Jayaram
>Priority: Minor
> Fix For: v1.12
>
>
> Compiling with gcc 6.2.0 gives the below error.
> {code}
> [ 84%] Building CXX object 
> src/ports/postgres/9.5/CMakeFiles/madlib_postgresql_9_5.dir/__/__/__/modules/elastic_net/elastic_net_gaussian_fista.cpp.o
> In file included from 
> /var/folders/rm/g9tb1s_53wb86s5_nrsdbxphgn/T/tmp8WXq3S/madlib-1.9.1/src/modules/elastic_net/elastic_net_binomial_igd.cpp:5:0:
> /var/folders/rm/g9tb1s_53wb86s5_nrsdbxphgn/T/tmp8WXq3S/madlib-1.9.1/src/modules/elastic_net/elastic_net_optimizer_igd.hpp:
>  In static member function 'static madlib::dbconnector::postgres::AnyType 
> madlib::modules::elastic_net::Igd<
> Model>::igd_transition(madlib::dbconnector::postgres::AnyType&, const 
> madlib::dbconnector::postgres::Allocator&)':
> /var/folders/rm/g9tb1s_53wb86s5_nrsdbxphgn/T/tmp8WXq3S/madlib-1.9.1/src/modules/elastic_net/elastic_net_optimizer_igd.hpp:69:46:
>  error: call of overloaded 
> 'log(madlib::modules::HandleTraits rayHandle >::ReferenceToUInt32&)' is ambiguous
>  state.p = 2 * log(state.dimension);
>   ^
> In file included from 
> /usr/local/Cellar/gcc/6.2.0/include/c++/6.2.0/cmath:45:0,
>  from /usr/local/Cellar/gcc/6.2.0/include/c++/6.2.0/math.h:36,
>  from 
> /var/folders/rm/g9tb1s_53wb86s5_nrsdbxphgn/T/tmp8WXq3S/madlib-1.9.1/src/ports/postgres/dbconnector/../../../../methods/svec/src/pg_gp/SparseData.h:24,
>  from 
> /var/folders/rm/g9tb1s_53wb86s5_nrsdbxphgn/T/tmp8WXq3S/madlib-1.9.1/src/ports/postgres/dbconnector/../../../../methods/svec/src/pg_gp/sparse_vector.h:10,
>  from 
> /var/folders/rm/g9tb1s_53wb86s5_nrsdbxphgn/T/tmp8WXq3S/madlib-1.9.1/src/ports/postgres/dbconnector/dbconnector.hpp:39,
>  from 
> /var/folders/rm/g9tb1s_53wb86s5_nrsdbxphgn/T/tmp8WXq3S/madlib-1.9.1/src/modules/elastic_net/elastic_net_binomial_igd.cpp:2:
> /usr/local/Cellar/gcc/6.2.0/lib/gcc/6/gcc/x86_64-apple-darwin15.6.0/6.2.0/include-fixed/math.h:402:15:
>  note: candidate: double log(double)
>  extern double log(double);
>^~~
> In file included from 
> /usr/local/Cellar/gcc/6.2.0/include/c++/6.2.0/math.h:36:0,
>  from 
> /var/folders/rm/g9tb1s_53wb86s5_nrsdbxphgn/T/tmp8WXq3S/madlib-1.9.1/src/ports/postgres/dbconnector/../../../../methods/svec/src/pg_gp/SparseData.h:24,
>  from 
> /var/folders/rm/g9tb1s_53wb86s5_nrsdbxphgn/T/tmp8WXq3S/madlib-1.9.1/src/ports/postgres/dbconnector/../../../../methods/svec/src/pg_gp/sparse_vector.h:10,
>  from 
> /var/folders/rm/g9tb1s_53wb86s5_nrsdbxphgn/T/tmp8WXq3S/madlib-1.9.1/src/ports/postgres/dbconnector/dbconnector.hpp:39,
>  from 
> /var/folders/rm/g9tb1s_53wb86s5_nrsdbxphgn/T/tmp8WXq3S/madlib-1.9.1/src/modules/elastic_net/elastic_net_binomial_igd.cpp:2:
> /usr/local/Cellar/gcc/6.2.0/include/c++/6.2.0/cmath:365:3: note: candidate: 
> long double std::log(long double)
>log(long double __x)
>^~~
> /usr/local/Cellar/gcc/6.2.0/include/c++/6.2.0/cmath:361:3: note: candidate: 
> float std::log(float)
>log(float __x)
>^~~
> make[3]: *** 
> [src/ports/postgres/9.5/CMakeFiles/madlib_postgresql_9_5.dir/__/__/__/modules/elastic_net/elastic_net_binomial_igd.cpp.o]
>  Error 1
> make[3]: *** Waiting for unfinished jobs
> make[2]: *** 
> [src/ports/postgres/9.5/CMakeFiles/madlib_postgresql_9_5.dir/all] Error 2
> make[1]: *** [all] Error 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1102) Graph - Breadth First Search / Traversal

2017-07-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16092735#comment-16092735
 ] 

ASF GitHub Bot commented on MADLIB-1102:


GitHub user rashmi815 opened a pull request:

https://github.com/apache/incubator-madlib/pull/153

Graph: BFS algorithm design docs

BFS algorithm design docs created.
BFS algorithm already merged on commit 
[8c9b955](https://github.com/apache/incubator-madlib/commit/8c9b955cd2e3150ad935ad1581e164670723184f).
 BFS algorithm JIRA: MADLIB-1102

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rashmi815/incubator-madlib master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/153.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #153


commit a9a02c21b69d3aa8377f3caf62ffed5f900053dd
Author: Rashmi Raghu 
Date:   2017-07-19T07:47:35Z

BFS design docs.




> Graph - Breadth First Search / Traversal
> 
>
> Key: MADLIB-1102
> URL: https://issues.apache.org/jira/browse/MADLIB-1102
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Rashmi Raghu
>Assignee: Rashmi Raghu
> Fix For: v1.12
>
>
> Story
> As a MADlib user and developer, I want to implement Breadth First Search / 
> Traversal for a graph. BFS is also a core part of the connected components 
> graph algorithm.
> Accpetance:
> 1) Interface defined
> 2) Design doc updated
> 3) Documentation and on-line help
> 4) IC and functional tests
> 5) Scale tests
> References:
> [0] [https://en.wikipedia.org/wiki/Breadth-first_search] 
> "Breadth-first search (BFS) is an algorithm for traversing or searching tree 
> or graph data structures. It starts at the tree root (or some arbitrary node 
> of a graph, sometimes referred to as a 'search key'[1]) and explores the 
> neighbor nodes first, before moving to the next level neighbors."
> [1] [http://www.geeksforgeeks.org/breadth-first-traversal-for-a-graph/]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1138) Add basic code coverage support

2017-07-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088486#comment-16088486
 ] 

ASF GitHub Bot commented on MADLIB-1138:


GitHub user edespino opened a pull request:

https://github.com/apache/incubator-madlib/pull/151

MADLIB-1138. Add basic code coverage support:

For developers, add cmake option configuration option ENABLE_COVERAGE
which will introduce gcov compilation and linking
options (-fprofile-arcs -ftest-coverage). Two supporting make targets
are introduced:

 * GenCoverageReport - Capture gcov counters and generate report
 * ResetCoverageCounters - Zero counters gcov counters

Features:

* Counters will be captured in build/CodeCoverage.info file. System
  and Third party metrics will be filtered out of coverage info file
  and stored in CodeCoverage-filtered.info
* HTML report will be created in build/CodeCoverageReport directory

Usage:
* cmake -DENABLE_COVERAGE=ON ..
* 
* make GenCoverageReport
* ... To view report, open build/CodeCoverageReport/index.html in browser 
...
* make ResetCoverageCounters
*  ... Run another test ...
* make GenCoverageReport

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/edespino/incubator-madlib MADLIB-1138

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/151.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #151


commit 4d4e9e4f6973e9c9bcede7e233e3c4f5c4e90524
Author: Ed Espino 
Date:   2017-07-15T06:06:08Z

MADLIB-1138. Add basic code coverage support:

For developers, add cmake option configuration option ENABLE_COVERAGE
which will introduce gcov compilation and linking
options (-fprofile-arcs -ftest-coverage). Two supporting make targets
are introduced:

 * GenCoverageReport - Capture gcov counters and generate report
 * ResetCoverageCounters - Zero counters gcov counters

Features:

* Counters will be captured in build/CodeCoverage.info file. System
  and Third party metrics will be filtered out of coverage info file
  and stored in CodeCoverage-filtered.info

* HTML report will be created in build/CodeCoverageReport directory

Usage:
* cmake -DENABLE_COVERAGE=ON ..
* 
* make GenCoverageReport
* 
* make ResetCoverageCounters
* 
* make GenCoverageReport




> Add basic code coverage support
> ---
>
> Key: MADLIB-1138
> URL: https://issues.apache.org/jira/browse/MADLIB-1138
> Project: Apache MADlib
>  Issue Type: Task
>  Components: Build System
>Reporter: Ed Espino
>
> For developers, add cmake option configuration option ENABLE_COVERAGE which 
> will introduce gcov compilation and linking options (-fprofile-arcs 
> -ftest-coverage). Two supporting make targets are introduced:
> * GenCoverageReport - Capture gcov counters and generate report
> * ResetCoverageCounters - Zero counters gcov counters
> Features:
> * Counters will be captured in build/CodeCoverage.info file. System and Third 
> party metrics will be filtered out of coverage info file and stored in 
> CodeCoverage-filtered.info
> * HTML report will be created in build/CodeCoverageReport directory
> Usage:
> * cmake -DENABLE_COVERAGE=ON ..
> * 
> * make GenCoverageReport
> * 
> * make ResetCoverageCounters
> * 
> * make GenCoverageReport



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1102) Graph - Breadth First Search / Traversal

2017-07-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084890#comment-16084890
 ] 

ASF GitHub Bot commented on MADLIB-1102:


Github user rashmi815 closed the pull request at:

https://github.com/apache/incubator-madlib/pull/141


> Graph - Breadth First Search / Traversal
> 
>
> Key: MADLIB-1102
> URL: https://issues.apache.org/jira/browse/MADLIB-1102
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Rashmi Raghu
>Assignee: Rashmi Raghu
> Fix For: v1.12
>
>
> Story
> As a MADlib user and developer, I want to implement Breadth First Search / 
> Traversal for a graph. BFS is also a core part of the connected components 
> graph algorithm.
> Accpetance:
> 1) Interface defined
> 2) Design doc updated
> 3) Documentation and on-line help
> 4) IC and functional tests
> 5) Scale tests
> References:
> [0] [https://en.wikipedia.org/wiki/Breadth-first_search] 
> "Breadth-first search (BFS) is an algorithm for traversing or searching tree 
> or graph data structures. It starts at the tree root (or some arbitrary node 
> of a graph, sometimes referred to as a 'search key'[1]) and explores the 
> neighbor nodes first, before moving to the next level neighbors."
> [1] [http://www.geeksforgeeks.org/breadth-first-traversal-for-a-graph/]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-413) Neural Networks - MLP

2017-07-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16082476#comment-16082476
 ] 

ASF GitHub Bot commented on MADLIB-413:
---

GitHub user cooper-sloan opened a pull request:

https://github.com/apache/incubator-madlib/pull/149

MLP: Multilayer Perceptron

JIRA: MADLIB-413

Add train and predict for multilayer perceptron.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cooper-sloan/incubator-madlib mlp_phase1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/149.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #149


commit 3693c70178ea74fb3cb742715c4091ddcc265bdc
Author: Cooper Sloan 
Date:   2017-06-17T00:41:07Z

MLP: Multilayer Perceptron

JIRA: MADLIB-413

Add train and predict for multilayer perceptron.




> Neural Networks - MLP
> -
>
> Key: MADLIB-413
> URL: https://issues.apache.org/jira/browse/MADLIB-413
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Neural Networks
>Reporter: Caleb Welton
>Assignee: Cooper Sloan
> Fix For: v1.12
>
>
> Multilayer perceptron with backpropagation
> Modules:
> * mlp_classification
> * mlp_regression
> Interface
> {code}
> source_table VARCHAR
> output_table VARCHAR
> independent_varname VARCHAR -- Column name for input features, should be a 
> Real Valued array
> dependent_varname VARCHAR, -- Column name for target values, should be Real 
> Valued array of size 1 or greater
> hidden_layer_sizes INTEGER[], -- Number of units per hidden layer (can be 
> empty or null, in which case, no hidden layers)
> optimizer_params VARCHAR, -- Specified below
> weights VARCHAR, -- Column name for weights. Weights the loss for each input 
> vector. Column should contain positive real value
> activation_function VARCHAR, -- One of 'sigmoid' (default), 'tanh', 'relu', 
> or any prefix (eg. 't', 's')
> grouping_cols
> )
> {code}
> where
> {code}
> optimizer_params: -- eg "step_size=0.5, n_tries=5"
> {
> step_size DOUBLE PRECISION, -- Learning rate
> n_iterations INTEGER, -- Number of iterations per try
> n_tries INTEGER, -- Total number of training cycles, with random 
> initializations to avoid local minima.
> tolerance DOUBLE PRECISION, -- Maximum distance between weights before 
> training stops (or until it reaches n_iterations)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1083) Graph - add grouping to connected components

2017-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069214#comment-16069214
 ] 

ASF GitHub Bot commented on MADLIB-1083:


GitHub user njayaram2 opened a pull request:

https://github.com/apache/incubator-madlib/pull/147

Feature: Add grouping to weakly connected components

JIRA: MADLIB-1083

Add grouping support to weakly connected components. Make necessary
changes in the queries involved, docs, and install check. 
@orhankislal 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/njayaram2/incubator-madlib 
feature/graph_wcc_grouping

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/147.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #147


commit 32d71e6d9ee16bd84c4ef268d0f7493a1fa2bb63
Author: Nandish Jayaram 
Date:   2017-06-29T17:17:01Z

Feature: Add grouping to weakly connected components

JIRA: MADLIB-1083

Add grouping support to weakly connected components. Make necessary
changes in the queries involved, docs, and install check.




> Graph - add grouping to connected components
> 
>
> Key: MADLIB-1083
> URL: https://issues.apache.org/jira/browse/MADLIB-1083
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Graph
>Reporter: Frank McQuillan
>Priority: Minor
> Fix For: v1.12
>
>
> Add grouping column to edge table to run separate connected cpts algo by 
> group.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1071) Graph - weakly connect components

2017-06-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16067497#comment-16067497
 ] 

ASF GitHub Bot commented on MADLIB-1071:


Github user njayaram2 closed the pull request at:

https://github.com/apache/incubator-madlib/pull/144


> Graph - weakly connect components
> -
>
> Key: MADLIB-1071
> URL: https://issues.apache.org/jira/browse/MADLIB-1071
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
>Assignee: Nandish Jayaram
> Fix For: v1.12
>
>
> Story
> As a MADlib developer, I want to implement  weakly connected components (ref 
> [0]) in an efficient and scaleable way.
> Acceptance
> 1) Interface defined
> 2) Design document updated
> 3) Documentation and on-line help
> 4) IC and functional tests
> 5) Scale tests
> References
> [0] https://en.wikipedia.org/wiki/Connectivity_(graph_theory)
> "A directed graph is called weakly connected if replacing all of its directed 
> edges with undirected edges produces a connected (undirected) graph."
> [1] Grails paper
> http://pages.cs.wisc.edu/~jignesh/publ/Grail.pdf
> [2] Grails deck
> http://pages.cs.wisc.edu/~jignesh/publ/Grail-slides.pdf
> [3] Grails repo with page rank example
> https://github.com/UWQuickstep/Grail
> https://github.com/UWQuickstep/Grail/blob/master/analytics/wcc.sql
> (weakly connected components implementation)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-986) Stratified sampling

2017-06-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065765#comment-16065765
 ] 

ASF GitHub Bot commented on MADLIB-986:
---

Github user asfgit closed the pull request at:

https://github.com/apache/incubator-madlib/pull/143


> Stratified sampling
> ---
>
> Key: MADLIB-986
> URL: https://issues.apache.org/jira/browse/MADLIB-986
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Sampling
>Reporter: Frank McQuillan
>Assignee: Orhan Kislal
>  Labels: starter
> Fix For: v1.12
>
>
> Story
> As a data scientist, I want to sample a data table in proportion to the 
> number of rows in each group, so that I can do model building on the sampled 
> data sets.
> The MVP for this story is:
> * sample proportion is global, i.e., single fractional value between 0 and 1
> * allow option to sample without replacement (default) and sample with 
> replacement
> * allow option to output a subset of columns to the output table
> Proposed Interface
> {code}
> stratified_sample ( 
>source_table,
>output_table,
>proportion,
>grouping_col -- optional
>with_replacement, -- optional
>target_cols -- optional
> )
> source_table
> TEXT. The name of the table containing the input data.
> output_table
> TEXT. Name of output table that contains the sampled data. 
> The output table contains all the columns present in the source table 
> unless otherwise specified in the 'target_cols' parameter below.
> proportion
> FLOAT8 in the range (0,1).  The size of the sample in each stratum will 
> be taken in proportion to the size of the stratum. 
> grouping_col (optional)
> TEXT, default: NULL. A single column or a list of comma-separated columns
>  that defines how to stratify.  When this parameter is NULL, 
> no grouping is used so the sampling is non-stratified.
> with_replacement (optional) 
> BOOLEAN, default FALSE.  Determines whether to sample with replacement 
> or without replacement (default).
> target_cols (optional)
> TEXT, default NULL. A comma-separated list of columns to appear in the 
> 'output_table'. 
> If NULL, all columns from the 'source_table'  will appear in the 
> 'output_table'.
> {code}
> Other notes
> PDL tools is one example implementation of stratified sampling to review [2]. 
>  
> Please review existing MADlib sample functions [3] to see if these can be 
> used as a basis, or built on, for this stratified sample story. 
> References
> [2] PDL tools sampling modules incl stratified sampling
> http://pivotalsoftware.github.io/PDLTools/group__grp__sampling.html
> [3] Existing MADlib sample function
> http://madlib.incubator.apache.org/docs/latest/group__grp__sample.html
> [4] Pandas/Selecting Random Samples
> http://pandas.pydata.org/pandas-docs/stable/indexing.html#selecting-random-samples
> [5] General
> https://en.wikipedia.org/wiki/Stratified_sampling



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1130) Create a README/HOWTO for anybody interested in reviewing a release

2017-06-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065760#comment-16065760
 ] 

ASF GitHub Bot commented on MADLIB-1130:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-madlib/pull/145


> Create a README/HOWTO for anybody interested in reviewing a release
> ---
>
> Key: MADLIB-1130
> URL: https://issues.apache.org/jira/browse/MADLIB-1130
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Roman Shaposhnik
>Assignee: Roman Shaposhnik
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1130) Create a README/HOWTO for anybody interested in reviewing a release

2017-06-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065702#comment-16065702
 ] 

ASF GitHub Bot commented on MADLIB-1130:


Github user fmcquillan99 commented on the issue:

https://github.com/apache/incubator-madlib/pull/145
  
LGTM 


> Create a README/HOWTO for anybody interested in reviewing a release
> ---
>
> Key: MADLIB-1130
> URL: https://issues.apache.org/jira/browse/MADLIB-1130
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Roman Shaposhnik
>Assignee: Roman Shaposhnik
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1130) Create a README/HOWTO for anybody interested in reviewing a release

2017-06-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16063934#comment-16063934
 ] 

ASF GitHub Bot commented on MADLIB-1130:


Github user asfgit commented on the issue:

https://github.com/apache/incubator-madlib/pull/145
  
Can one of the admins verify this patch?


> Create a README/HOWTO for anybody interested in reviewing a release
> ---
>
> Key: MADLIB-1130
> URL: https://issues.apache.org/jira/browse/MADLIB-1130
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Roman Shaposhnik
>Assignee: Roman Shaposhnik
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1130) Create a README/HOWTO for anybody interested in reviewing a release

2017-06-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16063926#comment-16063926
 ] 

ASF GitHub Bot commented on MADLIB-1130:


GitHub user rvs opened a pull request:

https://github.com/apache/incubator-madlib/pull/145

MADLIB-1130. Create a README/HOWTO for anybody interested in reviewin…

…g a release

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rvs/incubator-madlib master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/145.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #145


commit 836d25f6afca1cad8fdccde69551e3455d06be29
Author: Roman Shaposhnik 
Date:   2017-06-26T22:50:54Z

MADLIB-1130. Create a README/HOWTO for anybody interested in reviewing a 
release




> Create a README/HOWTO for anybody interested in reviewing a release
> ---
>
> Key: MADLIB-1130
> URL: https://issues.apache.org/jira/browse/MADLIB-1130
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Roman Shaposhnik
>Assignee: Roman Shaposhnik
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1071) Graph - weakly connect components

2017-06-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060105#comment-16060105
 ] 

ASF GitHub Bot commented on MADLIB-1071:


GitHub user njayaram2 opened a pull request:

https://github.com/apache/incubator-madlib/pull/144

Feautre: Weakly Connected Components

JIRA: MADLIB-1071

Implement a new module in graph, that finds all weakly connected
components of a directed graph. A weakly connected component is a
subgraph where every node has a path to every other node, ignoring
edge directions.
This does not have grouping support yet, although the interface
has it defined already.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/njayaram2/incubator-madlib feature/graph_wcc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/144.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #144


commit a737d8f006f8c394eccf02274ae837fa72140c42
Author: Nandish Jayaram 
Date:   2017-06-22T21:49:51Z

Feautre: Weakly Connected Components

JIRA: MADLIB-1071

Implement a new module in graph, that finds all weakly connected
components of a directed graph. A weakly connected component is a
subgraph where every node has a path to every other node, ignoring
edge directions.
This does not have grouping support yet, although the interface
has it defined already.




> Graph - weakly connect components
> -
>
> Key: MADLIB-1071
> URL: https://issues.apache.org/jira/browse/MADLIB-1071
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
>Assignee: Nandish Jayaram
> Fix For: v1.12
>
>
> Story
> As a MADlib developer, I want to implement  weakly connected components (ref 
> [0]) in an efficient and scaleable way.
> Acceptance
> 1) Interface defined
> 2) Design document updated
> 3) Documentation and on-line help
> 4) IC and functional tests
> 5) Scale tests
> References
> [0] https://en.wikipedia.org/wiki/Connectivity_(graph_theory)
> "A directed graph is called weakly connected if replacing all of its directed 
> edges with undirected edges produces a connected (undirected) graph."
> [1] Grails paper
> http://pages.cs.wisc.edu/~jignesh/publ/Grail.pdf
> [2] Grails deck
> http://pages.cs.wisc.edu/~jignesh/publ/Grail-slides.pdf
> [3] Grails repo with page rank example
> https://github.com/UWQuickstep/Grail
> https://github.com/UWQuickstep/Grail/blob/master/analytics/wcc.sql
> (weakly connected components implementation)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-986) Stratified sampling

2017-06-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16058421#comment-16058421
 ] 

ASF GitHub Bot commented on MADLIB-986:
---

GitHub user orhankislal opened a pull request:

https://github.com/apache/incubator-madlib/pull/143

Sample: Add stratified sampling

JIRA: MADLIB-986

Add stratified sampling with the following options.
- With or without grouping
- With or without replacement
- A specific set of target columns or all of them

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/orhankislal/incubator-madlib 
feature/strs_take2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/143.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #143


commit 6ef23fc00cf06ac027f69229d7cf0cf444a7f456
Author: Orhan Kislal 
Date:   2017-06-21T23:07:08Z

Sample: Add stratified sampling

JIRA: MADLIB-986

Add stratified sampling with the following options.
- With or without grouping
- With or without replacement
- A specific set of target columns or all of them




> Stratified sampling
> ---
>
> Key: MADLIB-986
> URL: https://issues.apache.org/jira/browse/MADLIB-986
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Sampling
>Reporter: Frank McQuillan
>Assignee: Orhan Kislal
>  Labels: starter
> Fix For: v1.12
>
>
> Story
> As a data scientist, I want to sample a data table in proportion to the 
> number of rows in each group, so that I can do model building on the sampled 
> data sets.
> The MVP for this story is:
> * sample proportion is global, i.e., single fractional value between 0 and 1
> * allow option to sample without replacement (default) and sample with 
> replacement
> * allow option to output a subset of columns to the output table
> Proposed Interface
> {code}
> stratified_sample ( 
>source_table,
>output_table,
>proportion,
>grouping_col -- optional
>with_replacement, -- optional
>target_cols -- optional
> )
> source_table
> TEXT. The name of the table containing the input data.
> output_table
> TEXT. Name of output table that contains the sampled data. 
> The output table contains all the columns present in the source table 
> unless otherwise specified in the 'target_cols' parameter below.
> proportion
> FLOAT8 in the range (0,1).  The size of the sample in each stratum will 
> be taken in proportion to the size of the stratum. 
> grouping_col (optional)
> TEXT, default: NULL. A single column or a list of comma-separated columns
>  that defines how to stratify.  When this parameter is NULL, 
> no grouping is used so the sampling is non-stratified.
> with_replacement (optional) 
> BOOLEAN, default FALSE.  Determines whether to sample with replacement 
> or without replacement (default).
> target_cols (optional)
> TEXT, default NULL. A comma-separated list of columns to appear in the 
> 'output_table'. 
> If NULL, all columns from the 'source_table'  will appear in the 
> 'output_table'.
> {code}
> Other notes
> PDL tools is one example implementation of stratified sampling to review [2]. 
>  
> Please review existing MADlib sample functions [3] to see if these can be 
> used as a basis, or built on, for this stratified sample story. 
> References
> [2] PDL tools sampling modules incl stratified sampling
> http://pivotalsoftware.github.io/PDLTools/group__grp__sampling.html
> [3] Existing MADlib sample function
> http://madlib.incubator.apache.org/docs/latest/group__grp__sample.html
> [4] Pandas/Selecting Random Samples
> http://pandas.pydata.org/pandas-docs/stable/indexing.html#selecting-random-samples
> [5] General
> https://en.wikipedia.org/wiki/Stratified_sampling



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1102) Graph - Breadth First Search / Traversal

2017-06-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051020#comment-16051020
 ] 

ASF GitHub Bot commented on MADLIB-1102:


GitHub user rashmi815 opened a pull request:

https://github.com/apache/incubator-madlib/pull/141

Graph: Add Breadth-first Search algorithm with grouping support

JIRA: MADLIB-1102

Graph: Add Breadth-first Search algorithm with grouping support.
This algorithm searches or traverses connected nodes in a graph in 
breadth-first order starting at a user-specified origin node.

Documentation and install-check to follow in subsequent commits

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rashmi815/incubator-madlib feature/bfs_v02

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/141.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #141


commit 5069dd95d25d1f62190913b13a207e86d62cef13
Author: Rashmi Raghu 
Date:   2017-06-15T20:25:31Z

JIRA: MADLIB-1102

Add Breadth-first Search algorithm with grouping support.
This algorithm searches or traverses connected nodes in a graph in 
breadth-first order starting at a user-specified origin node.

Documentation and install-check to follow in subsequent commits




> Graph - Breadth First Search / Traversal
> 
>
> Key: MADLIB-1102
> URL: https://issues.apache.org/jira/browse/MADLIB-1102
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Rashmi Raghu
>Assignee: Rashmi Raghu
> Fix For: v1.12
>
>
> Story
> As a MADlib user and developer, I want to implement Breadth First Search / 
> Traversal for a graph. BFS is also a core part of the connected components 
> graph algorithm.
> Accpetance:
> 1) Interface defined
> 2) Design doc updated
> 3) Documentation and on-line help
> 4) IC and functional tests
> 5) Scale tests
> References:
> [0] [https://en.wikipedia.org/wiki/Breadth-first_search] 
> "Breadth-first search (BFS) is an algorithm for traversing or searching tree 
> or graph data structures. It starts at the tree root (or some arbitrary node 
> of a graph, sometimes referred to as a 'search key'[1]) and explores the 
> neighbor nodes first, before moving to the next level neighbors."
> [1] [http://www.geeksforgeeks.org/breadth-first-traversal-for-a-graph/]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1117) Add "columns to process per pass" as an optional param for summary()

2017-06-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043374#comment-16043374
 ] 

ASF GitHub Bot commented on MADLIB-1117:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-madlib/pull/138


> Add "columns to process per pass" as an optional param for summary()
> 
>
> Key: MADLIB-1117
> URL: https://issues.apache.org/jira/browse/MADLIB-1117
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Sketch-based Estimators
>Reporter: Frank McQuillan
>Assignee: Rahul Iyer
>Priority: Minor
> Fix For: v1.12
>
>
> Context
> The summary() function
> http://madlib.incubator.apache.org/docs/latest/group__grp__summary.html
> currently processes 15 columns per pass to keep memory usage below 1 GB 
> limit.  This is a somewhat arbitrary limit since memory usage depends on many 
> things including data set, and which params in summary() are set.  If more 
> columns per pass could be used, summary() would run faster.
> Story
> As a MADlib developer, I want to add "columns to process per pass" as an 
> optional param for summary() function.  Default: use 15 columns (which is the 
> current setting).  Suggested param name:  "columns_per_pass" though if you 
> have a better name, that's fine.
> Acceptance
> 1) Add new optional parameter and update docs.  Please add a note so it is 
> clear what this control does.
> 2) Write and pass tests.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1120) Promote cardinality estimators to top level modules

2017-06-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043375#comment-16043375
 ] 

ASF GitHub Bot commented on MADLIB-1120:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-madlib/pull/139


> Promote cardinality estimators to top level modules
> ---
>
> Key: MADLIB-1120
> URL: https://issues.apache.org/jira/browse/MADLIB-1120
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Sketch-based Estimators
>Reporter: Frank McQuillan
>Assignee: Rahul Iyer
>Priority: Minor
> Fix For: v1.12
>
>
> Story
> As a MADlib developer, I want to promote the cardinality estimators
> http://madlib.incubator.apache.org/docs/latest/group__grp__sketches.html
> to top level modules, so that they are more visible to users.
> Acceptance
> 1) What changes are required for the software?
> 2) Define interface changes, if any required.
> 3) Update docs and indicate clearly that these are UDA's not stored 
> procedures. Might be good to add an example as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1120) Promote cardinality estimators to top level modules

2017-06-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16039845#comment-16039845
 ] 

ASF GitHub Bot commented on MADLIB-1120:


GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/139

Sketch: Promote sketch methods to top-level

JIRA: MADLIB-1120

This commit fixes some of the documentation for sketch and moves the
module out of "Early stage development".

Closes #139

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
feature/sketch_top_level

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/139.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #139


commit 6a672d48683f6997ff16831bb11841263a54de9e
Author: Rahul Iyer 
Date:   2017-06-06T23:09:30Z

Sketch: Promote sketch methods to top-level

JIRA: MADLIB-1120

This commit fixes some of the documentation for sketch and moves the
module out of "Early stage development".

Closes #139




> Promote cardinality estimators to top level modules
> ---
>
> Key: MADLIB-1120
> URL: https://issues.apache.org/jira/browse/MADLIB-1120
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Sketch-based Estimators
>Reporter: Frank McQuillan
>Assignee: Rahul Iyer
>Priority: Minor
> Fix For: v1.12
>
>
> Story
> As a MADlib developer, I want to promote the cardinality estimators
> http://madlib.incubator.apache.org/docs/latest/group__grp__sketches.html
> to top level modules, so that they are more visible to users.
> Acceptance
> 1) What changes are required for the software?
> 2) Define interface changes, if any required.
> 3) Update docs and indicate clearly that these are UDA's not stored 
> procedures. Might be good to add an example as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1100) PageRank default threshold value seems to be 1e-5

2017-05-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029704#comment-16029704
 ] 

ASF GitHub Bot commented on MADLIB-1100:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-madlib/pull/137


> PageRank default threshold value seems to be 1e-5
> -
>
> Key: MADLIB-1100
> URL: https://issues.apache.org/jira/browse/MADLIB-1100
> Project: Apache MADlib
>  Issue Type: Bug
>  Components: Module: Graph
>Reporter: Nandish Jayaram
> Fix For: v1.12
>
>
> The threshold parameter in MADlib's pagerank is supposed to be 
> 1/(num_of_vertices * 100) as mentioned in the docs. But the default value 
> seems to be 1e-5 instead. This causes PageRank to converge right after the 
> first iteration for larger graphs (million nodes graph).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1104) Improve efficiency of summary function

2017-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025411#comment-16025411
 ] 

ASF GitHub Bot commented on MADLIB-1104:


Github user fmcquillan99 commented on the issue:

https://github.com/apache/incubator-madlib/pull/135
  
https://issues.apache.org/jira/browse/MADLIB-1104


> Improve efficiency of summary function
> --
>
> Key: MADLIB-1104
> URL: https://issues.apache.org/jira/browse/MADLIB-1104
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Descriptive Statistics
>Reporter: Frank McQuillan
> Fix For: v1.12
>
>
> The summary function
> http://madlib.incubator.apache.org/docs/latest/group__grp__summary.html
> uses some of the cardinality estimators in
> http://madlib.incubator.apache.org/docs/latest/group__grp__sketches.html
> Is there a way to improve run-time performance of these modules?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1108) Fix issues with the download page

2017-05-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021688#comment-16021688
 ] 

ASF GitHub Bot commented on MADLIB-1108:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-madlib-site/pull/7


> Fix issues with the download page
> -
>
> Key: MADLIB-1108
> URL: https://issues.apache.org/jira/browse/MADLIB-1108
> Project: Apache MADlib
>  Issue Type: Improvement
>Reporter: Roman Shaposhnik
>
> sebb identified a few issues with the Download page
> The website looks very nice.
> However there are some problems with the download page.
> The links all appear to point to
> https://dist.apache.org/repos/dist/release/incubator/madlib/
> This is not allowed; download links for current releases must point to
> the ASF mirror system.
> The dist release directory has not been tidied up; only the latest
> release(s) should be present on the mirror system.
> It's unfair on the 3rd party mirrors to expect them to carry old releases.
> There is no information on the download page how to check sigs or
> hashes, and no link to the KEYS file.
> KEYS, sigs and hashes should be linked from the following tree
> https://www.apache.org/dist/incubator/madlib/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1108) Fix issues with the download page

2017-05-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021511#comment-16021511
 ] 

ASF GitHub Bot commented on MADLIB-1108:


Github user fmcquillan99 commented on the issue:

https://github.com/apache/incubator-madlib-site/pull/7
  
Checked new download from this PR for format and checked that new download 
links work for current and previous ASF incubating releases.


![image](https://cloud.githubusercontent.com/assets/10538173/26366865/78fbaf08-3fa1-11e7-9a54-a3d7ea0a83be.png)

For sebb's comment:

"The dist release directory has not been tidied up; only the latest
release(s) should be present on the mirror system.
It's unfair on the 3rd party mirrors to expect them to carry old releases. "

I am not sure what the issue is here since MADlib looks pretty much like 
the other incubating projects I looked at regarding directory structure on dist

https://archive.apache.org/dist/incubator/madlib/

https://archive.apache.org/dist/incubator/ranger/
https://archive.apache.org/dist/incubator/kudu/
etc.




> Fix issues with the download page
> -
>
> Key: MADLIB-1108
> URL: https://issues.apache.org/jira/browse/MADLIB-1108
> Project: Apache MADlib
>  Issue Type: Improvement
>Reporter: Roman Shaposhnik
>
> sebb identified a few issues with the Download page
> The website looks very nice.
> However there are some problems with the download page.
> The links all appear to point to
> https://dist.apache.org/repos/dist/release/incubator/madlib/
> This is not allowed; download links for current releases must point to
> the ASF mirror system.
> The dist release directory has not been tidied up; only the latest
> release(s) should be present on the mirror system.
> It's unfair on the 3rd party mirrors to expect them to carry old releases.
> There is no information on the download page how to check sigs or
> hashes, and no link to the KEYS file.
> KEYS, sigs and hashes should be linked from the following tree
> https://www.apache.org/dist/incubator/madlib/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1108) Fix issues with the download page

2017-05-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021389#comment-16021389
 ] 

ASF GitHub Bot commented on MADLIB-1108:


GitHub user rvs opened a pull request:

https://github.com/apache/incubator-madlib-site/pull/7

MADLIB-1108. Fix issues with the download page



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rvs/incubator-madlib-site asf-site

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib-site/pull/7.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7


commit 924263246bbe82bcd5b413822eb56ab4c33e62be
Author: Roman Shaposhnik 
Date:   2017-05-23T15:25:47Z

MADLIB-1108. Fix issues with the download page




> Fix issues with the download page
> -
>
> Key: MADLIB-1108
> URL: https://issues.apache.org/jira/browse/MADLIB-1108
> Project: Apache MADlib
>  Issue Type: Improvement
>Reporter: Roman Shaposhnik
>
> sebb identified a few issues with the Download page
> The website looks very nice.
> However there are some problems with the download page.
> The links all appear to point to
> https://dist.apache.org/repos/dist/release/incubator/madlib/
> This is not allowed; download links for current releases must point to
> the ASF mirror system.
> The dist release directory has not been tidied up; only the latest
> release(s) should be present on the mirror system.
> It's unfair on the 3rd party mirrors to expect them to carry old releases.
> There is no information on the download page how to check sigs or
> hashes, and no link to the KEYS file.
> KEYS, sigs and hashes should be linked from the following tree
> https://www.apache.org/dist/incubator/madlib/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1097) Random Forest does not allow NULL values in features

2017-05-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005862#comment-16005862
 ] 

ASF GitHub Bot commented on MADLIB-1097:


Github user iyerr3 closed the pull request at:

https://github.com/apache/incubator-madlib/pull/131


> Random Forest does not allow NULL values in features
> 
>
> Key: MADLIB-1097
> URL: https://issues.apache.org/jira/browse/MADLIB-1097
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Random Forest
>Reporter: Nandish Jayaram
>Assignee: Rahul Iyer
>Priority: Minor
> Fix For: v1.12
>
>
> Running forest_train() with features that have NULL values results in the 
> following error:
> {code}
> psql:/tmp/madlib.LkFR_5/recursive_partitioning/test/random_forest.sql_in.tmp:79:
>  ERROR:  spiexceptions.InvalidParameterValue: Function 
> "_rf_cat_imp_score(bytea8,integer[],double 
> precision[],integer[],integer,double precision,boolean,double precision[])": 
> Invalid type conversion. Null where not expected.
> CONTEXT:  Traceback (most recent call last):
>   PL/Python function "forest_train", line 42, in 
> sample_ratio
>   PL/Python function "forest_train", line 605, in forest_train
>   PL/Python function "forest_train", line 1052, in _calculate_oob_prediction
> PL/Python function "forest_train"
> {code}
> The following are the input table and parameters used:
> {code:sql}
> CREATE TABLE dt_golf (
> id integer NOT NULL,
> "OUTLOOK" text,
> temperature double precision,
> humidity double precision,
> windy boolean,
> class text
> ) ;
> INSERT INTO dt_golf (id,"OUTLOOK",temperature,humidity,windy,class) VALUES
> (1, 'sunny', 85, 85, false, 'Don''t Play'),
> (2, 'sunny', 80, 90, true, 'Don''t Play'),
> (3, 'overcast', 83, 78, false, 'Play'),
> (4, 'rain', NULL, 96, false, 'Play'),
> (5, 'rain', 68, 80, NULL, 'Play'),
> (6, 'rain', 65, 70, true, 'Don''t Play'),
> (7, 'overcast', 64, 65, true, 'Play'),
> (8, 'sunny', 72, 95, false, 'Don''t Play'),
> (9, 'sunny', 69, 70, false, 'Play'),
> (10, 'rain', 75, 80, false, 'Play'),
> (11, 'sunny', 75, 70, true, 'Play'),
> (12, 'overcast', 72, 90, true, 'Play'),
> (13, 'overcast', 81, 75, false, 'Play'),
> (14, 'rain', 71, 80, true, 'Don''t Play');
> SELECT forest_train(
>   'dt_golf'::TEXT, -- source table
>   'train_output'::TEXT,-- output model table
>   'id'::TEXT,  -- id column
>   'class'::TEXT,   -- response
>   'windy, temperature'::TEXT,   -- features
>   NULL::TEXT,-- exclude columns
>   NULL::TEXT,-- no grouping
>   5,-- num of trees
>   1, -- num of random features
>   TRUE::BOOLEAN,-- importance
>   1::INTEGER,   -- num_permutations
>   10::INTEGER,   -- max depth
>   1::INTEGER,-- min split
>   1::INTEGER,-- min bucket
>   8::INTEGER,-- number of bins per continuous variable
>   'max_surrogates=0',
>   FALSE
>   );
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-965) RF and DT should accept array input for feature vector

2017-05-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005863#comment-16005863
 ] 

ASF GitHub Bot commented on MADLIB-965:
---

Github user iyerr3 closed the pull request at:

https://github.com/apache/incubator-madlib/pull/132


> RF and DT should accept array input for feature vector
> --
>
> Key: MADLIB-965
> URL: https://issues.apache.org/jira/browse/MADLIB-965
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Decision Tree, Module: Random Forest
>Reporter: Rashmi Raghu
>Assignee: Rahul Iyer
>Priority: Minor
> Fix For: v1.12
>
> Attachments: DT and RF work1.ipynb
>
>
> We were trying to test whether the RF module could handle a column containing 
> array of features as input (instead of each feature in a separate column). 
> The result was an error message but that message is unclear as to source of 
> error (i.e. is it because of the array feature input column or something 
> else). Example table, query and error can be found below:
> {quote}
> -- Executing query:
> DROP TABLE IF EXISTS dt_golf;
> CREATE TABLE dt_golf (
> id integer NOT NULL,
> "OUTLOOK" text,
> temperature double precision,
> humidity double precision,
> windy text,
> class text
> ) ;
> -- Executing query:
> INSERT INTO dt_golf (id,"OUTLOOK",temperature,humidity,windy,class) VALUES
> (1, 'sunny', 85, 85, 'false', 'Don''t Play'),
> (2, 'sunny', 80, 90, 'true', 'Don''t Play'),
> (3, 'overcast', 83, 78, 'false', 'Play'),
> (4, 'rain', 70, 96, 'false', 'Play'),
> (5, 'rain', 68, 80, 'false', 'Play'),
> (6, 'rain', 65, 70, 'true', 'Don''t Play'),
> (7, 'overcast', 64, 65, 'true', 'Play'),
> (8, 'sunny', 72, 95, 'false', 'Don''t Play'),
> (9, 'sunny', 69, 70, 'false', 'Play'),
> (10, 'rain', 75, 80, 'false', 'Play'),
> (11, 'sunny', 75, 70, 'true', 'Play'),
> (12, 'overcast', 72, 90, 'true', 'Play'),
> (13, 'overcast', 81, 75, 'false', 'Play'),
> (14, 'rain', 71, 80, 'true', 'Don''t Play');
> DROP TABLE IF EXISTS dt_golf_array;
> CREATE TABLE dt_golf_array as 
> select id, array[temperature, humidity] as input_array, class
> from dt_golf
> distributed by (id);
> DROP TABLE IF EXISTS train_output, train_output_group, train_output_summary;
> SELECT madlib.forest_train('dt_golf_array', -- source table
>'train_output',-- output model table
>'id',  -- id column
>'class',   -- response
>'input_array',   -- features
>NULL,  -- exclude columns
>NULL,  -- grouping columns
>20::integer,   -- number of trees
>1::integer,-- number of random features
>TRUE::boolean, -- variable importance
>1::integer,-- num_permutations
>8::integer,-- max depth
>3::integer,-- min split
>1::integer,-- min bucket
>10::integer-- number of splits per 
> continuous variable
>);
> NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 
> 'id' as the Greenplum Database data distribution key for this table.
> HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make 
> sure column(s) chosen are the optimal data distribution key to minimize skew.
> query result with 1 row discarded.
> ERROR:  plpy.SPIError: invalid array length (plpython.c:4648)
> DETAIL:  array_of_bigint: Size should be in [1, 1e7], 0 given
> CONTEXT:  Traceback (most recent call last):
>   PL/Python function "forest_train", line 42, in 
> sample_ratio
>   PL/Python function "forest_train", line 589, in forest_train
>   PL/Python function "forest_train", line 1037, in _calculate_oob_prediction
> PL/Python function "forest_train"
> ** Error **
> ERROR: plpy.SPIError: invalid array length (plpython.c:4648)
> SQL state: XX000
> Detail: array_of_bigint: Size should be in [1, 1e7], 0 given
> Context: Traceback (most recent call last):
>   PL/Python function "forest_train", line 42, in 
> sample_ratio
>   PL/Python function "forest_train", line 589, in forest_train
>   PL/Python function "forest_train", line 1037, in _calculate_oob_prediction
> PL/Python function "forest_train"
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-965) RF and DT should accept array input for feature vector

2017-05-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005650#comment-16005650
 ] 

ASF GitHub Bot commented on MADLIB-965:
---

GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/132

DT/RF: Allow array input for features

JIRA: MADLIB-965

Currently array columns are not allowed features in decision tree and
random forest train functions. This commit adds support for a mixed list
of features: arrays and individual columns of multiple types can be
combined into a single list. Each array is expanded to treat each element
of the array as a feature.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
feature/dt_array_feature_support

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/132.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #132


commit 2f1ddee5ab957684988dac575627760a1dfd67bb
Author: Rahul Iyer 
Date:   2017-05-09T21:50:52Z

DT/RF: Allow array input for features

JIRA: MADLIB-965

Currently array columns are not allowed features in decision tree and
random forest train functions. This commit adds support for a mixed list
of features: arrays and individual columns of multiple types can be
combined into a single list. Each array is expanded to treat each element
of the array as a feature.




> RF and DT should accept array input for feature vector
> --
>
> Key: MADLIB-965
> URL: https://issues.apache.org/jira/browse/MADLIB-965
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Decision Tree, Module: Random Forest
>Reporter: Rashmi Raghu
>Assignee: Rahul Iyer
>Priority: Minor
> Fix For: v1.12
>
> Attachments: DT and RF work1.ipynb
>
>
> We were trying to test whether the RF module could handle a column containing 
> array of features as input (instead of each feature in a separate column). 
> The result was an error message but that message is unclear as to source of 
> error (i.e. is it because of the array feature input column or something 
> else). Example table, query and error can be found below:
> {quote}
> -- Executing query:
> DROP TABLE IF EXISTS dt_golf;
> CREATE TABLE dt_golf (
> id integer NOT NULL,
> "OUTLOOK" text,
> temperature double precision,
> humidity double precision,
> windy text,
> class text
> ) ;
> -- Executing query:
> INSERT INTO dt_golf (id,"OUTLOOK",temperature,humidity,windy,class) VALUES
> (1, 'sunny', 85, 85, 'false', 'Don''t Play'),
> (2, 'sunny', 80, 90, 'true', 'Don''t Play'),
> (3, 'overcast', 83, 78, 'false', 'Play'),
> (4, 'rain', 70, 96, 'false', 'Play'),
> (5, 'rain', 68, 80, 'false', 'Play'),
> (6, 'rain', 65, 70, 'true', 'Don''t Play'),
> (7, 'overcast', 64, 65, 'true', 'Play'),
> (8, 'sunny', 72, 95, 'false', 'Don''t Play'),
> (9, 'sunny', 69, 70, 'false', 'Play'),
> (10, 'rain', 75, 80, 'false', 'Play'),
> (11, 'sunny', 75, 70, 'true', 'Play'),
> (12, 'overcast', 72, 90, 'true', 'Play'),
> (13, 'overcast', 81, 75, 'false', 'Play'),
> (14, 'rain', 71, 80, 'true', 'Don''t Play');
> DROP TABLE IF EXISTS dt_golf_array;
> CREATE TABLE dt_golf_array as 
> select id, array[temperature, humidity] as input_array, class
> from dt_golf
> distributed by (id);
> DROP TABLE IF EXISTS train_output, train_output_group, train_output_summary;
> SELECT madlib.forest_train('dt_golf_array', -- source table
>'train_output',-- output model table
>'id',  -- id column
>'class',   -- response
>'input_array',   -- features
>NULL,  -- exclude columns
>NULL,  -- grouping columns
>20::integer,   -- number of trees
>1::integer,-- number of random features
>TRUE::boolean, -- variable importance
>1::integer,-- num_permutations
>8::integer,-- max depth
>3::integer,-- min split
>1::integer,-- min bucket
>10::integer-- number of splits per 
> continuous variable
>);
> NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 
> 'id' as the 

[jira] [Commented] (MADLIB-1097) Random Forest does not allow NULL values in features

2017-05-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005621#comment-16005621
 ] 

ASF GitHub Bot commented on MADLIB-1097:


GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/131

RF: Filter NULL dependent values in OOB

JIRA: MADLIB-1097

Added `filter_null` string obtained from decision_tree.py into the OOB
view to exclude rows that have NULL dependent values.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
bugfix/rf_null_dep_values

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/131.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #131


commit 9b45ecaaadb9e0d4999dc49e72df8a97cb7692d2
Author: Rahul Iyer 
Date:   2017-05-04T00:07:55Z

RF: Filter NULL dependent values in OOB

JIRA: MADLIB-1097

Added `filter_null` string obtained from decision_tree.py into the OOB
view to exclude rows that have NULL dependent values.




> Random Forest does not allow NULL values in features
> 
>
> Key: MADLIB-1097
> URL: https://issues.apache.org/jira/browse/MADLIB-1097
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Random Forest
>Reporter: Nandish Jayaram
>Assignee: Rahul Iyer
>Priority: Minor
> Fix For: v1.12
>
>
> Running forest_train() with features that have NULL values results in the 
> following error:
> {code}
> psql:/tmp/madlib.LkFR_5/recursive_partitioning/test/random_forest.sql_in.tmp:79:
>  ERROR:  spiexceptions.InvalidParameterValue: Function 
> "_rf_cat_imp_score(bytea8,integer[],double 
> precision[],integer[],integer,double precision,boolean,double precision[])": 
> Invalid type conversion. Null where not expected.
> CONTEXT:  Traceback (most recent call last):
>   PL/Python function "forest_train", line 42, in 
> sample_ratio
>   PL/Python function "forest_train", line 605, in forest_train
>   PL/Python function "forest_train", line 1052, in _calculate_oob_prediction
> PL/Python function "forest_train"
> {code}
> The following are the input table and parameters used:
> {code:sql}
> CREATE TABLE dt_golf (
> id integer NOT NULL,
> "OUTLOOK" text,
> temperature double precision,
> humidity double precision,
> windy boolean,
> class text
> ) ;
> INSERT INTO dt_golf (id,"OUTLOOK",temperature,humidity,windy,class) VALUES
> (1, 'sunny', 85, 85, false, 'Don''t Play'),
> (2, 'sunny', 80, 90, true, 'Don''t Play'),
> (3, 'overcast', 83, 78, false, 'Play'),
> (4, 'rain', NULL, 96, false, 'Play'),
> (5, 'rain', 68, 80, NULL, 'Play'),
> (6, 'rain', 65, 70, true, 'Don''t Play'),
> (7, 'overcast', 64, 65, true, 'Play'),
> (8, 'sunny', 72, 95, false, 'Don''t Play'),
> (9, 'sunny', 69, 70, false, 'Play'),
> (10, 'rain', 75, 80, false, 'Play'),
> (11, 'sunny', 75, 70, true, 'Play'),
> (12, 'overcast', 72, 90, true, 'Play'),
> (13, 'overcast', 81, 75, false, 'Play'),
> (14, 'rain', 71, 80, true, 'Don''t Play');
> SELECT forest_train(
>   'dt_golf'::TEXT, -- source table
>   'train_output'::TEXT,-- output model table
>   'id'::TEXT,  -- id column
>   'class'::TEXT,   -- response
>   'windy, temperature'::TEXT,   -- features
>   NULL::TEXT,-- exclude columns
>   NULL::TEXT,-- no grouping
>   5,-- num of trees
>   1, -- num of random features
>   TRUE::BOOLEAN,-- importance
>   1::INTEGER,   -- num_permutations
>   10::INTEGER,   -- max depth
>   1::INTEGER,-- min split
>   1::INTEGER,-- min bucket
>   8::INTEGER,-- number of bins per continuous variable
>   'max_surrogates=0',
>   FALSE
>   );
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1098) Corrections for MADlib naming consistency

2017-05-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997225#comment-15997225
 ] 

ASF GitHub Bot commented on MADLIB-1098:


Github user rvs commented on the issue:

https://github.com/apache/incubator-madlib/pull/130
  
Good point. Let me update the PR.


> Corrections for MADlib naming consistency
> -
>
> Key: MADLIB-1098
> URL: https://issues.apache.org/jira/browse/MADLIB-1098
> Project: Apache MADlib
>  Issue Type: Improvement
>Reporter: Rashmi Raghu
>Assignee: Rashmi Raghu
>Priority: Minor
> Fix For: v1.11
>
>
> Several locations (e.g. Read Me screen / Intro screen and others) which 
> contain the name MADlib should be changed to 'Apache MADlib (Incubating)'. 
> Based on observations from the community on dev and user mailing lists (see 
> below for excerpts from those discussions).
> 
> Copying relevant excerpts from Ed's email:
> http://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201705.mbox/%3CCAHAuQDyn-drvZ64%2B4MYL2A%2BbDdea%3DtTew2boSz8ChdUPH2Aj_Q%40mail.gmail.com%3E
> > ==
> > Source miscelaneous: HAWQ_Install.txt
> >
> >   Observation:
> >
> >   - The file references the product name as "MADlib" and not "Apache
> > MADlib (Incubating). Is this file still valid?
> >
> > ==
> > CONVENIENCE BINARIES
> > --
> >
> > --
> > Mac Installer DMG file: apache-madlib-1.11-incubating-bin-Darwin.dmg
> > --
> >
> >   Observation:
> >
> >   - The DMG(apache-madlib-1.11-incubating-bin-Darwin.dmg) contains a
> > pkg file named "madlib-1.11-Darwin.pkg". Shouldn't it be called
> > "apache-madlib-1.11-incubating-Darwin.pkg"?
> >
> > Similarly, the DMG base folder name is madlib-1.11.Darwin.
> >
> > Mac Installer Package
> >
> > o Introduction screen
> >
> >   Observation:
> >
> >   - The introduction screen identifies the product name as
> > "MADlib". Shouldn't there be a mention of the project name being
> > "Apache MADlib (Incubating)".
> >
> > o Read Me screen
> >
> >   Observation:
> >
> >   - Similar to initial screen, there is no mention to the Apache
> > project except for the link to the project's wiki.
> >
> > o Remaining screens look reasonable (with exception of no Apache
> >   references).
> >
> > o The default application window name is "Install MADlib"
> >
> > Observation:
> >
> >   - Similar to Introduction sreen, should the name be "Install Apache
> > MADlib (Incubating)"?
> >
> >   - Look for other opportunities to reference the product name as
> > "Apache MADlib (Incubating)".
> >
> > --
> > Linux RPM: apache-madlib-1.11-incubating-bin-Linux.rpm
> > --
> >
> >   Observation:
> >
> >   - It appears the SPEC file used (possibly generated) references the
> > product name as "madlib".  Again, shouldn't there be references to
> > the product name as "Apache MADlib" scattered about?
> > Unfortunately, I am not sure if this should change or not. It
> > might help for someone on the team to review other Apache projects
> > convenience binary RPMs to see if something should be
> > addressed. The podling's mentor might be able to provide
> > additional direction as well.
> >
> > This can be seen in the following "rpm -qi madlib" output:
> >
> > [root@e0f4d3349d2d MADlib]# rpm -qi madlib
> > Name: madlib
> > Version : 1.11
> > Release : 1
> > Architecture: x86_64
> > Install Date: Wed May  3 04:00:10 2017
> > Group   : Development/Libraries
> > Size: 83575356
> > License : ASL 2.0
> > Signature   : (none)
> > Source RPM  : madlib-1.11-1.src.rpm
> > Build Date  : Tue May  2 19:03:21 2017
> > Build Host  : gpdb1.eng.pivotal.io
> > Relocations : /usr/local
> > Vendor  : MADlib
> > Summary : Open-Source Library for Scalable in-Database
> > Analytics
> > Description :
> > MADlib is an open-source library for scalable in-database
> > analytics. It
> > provides data-parallel implementations of mathematical,
> > statistical and
> > machine learning methods for structured and unstructured data.
> >
> > The MADlib mission: to foster widespread development of 

[jira] [Commented] (MADLIB-1098) Corrections for MADlib naming consistency

2017-05-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997165#comment-15997165
 ] 

ASF GitHub Bot commented on MADLIB-1098:


GitHub user rvs opened a pull request:

https://github.com/apache/incubator-madlib/pull/130

MADLIB-1098. Corrections for MADlib naming consistency



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rvs/incubator-madlib master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/130.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #130


commit eeed91b570120fe4d47cc2f2f07ed1aa304acc14
Author: Roman Shaposhnik 
Date:   2017-05-04T18:16:42Z

MADLIB-1098. Corrections for MADlib naming consistency




> Corrections for MADlib naming consistency
> -
>
> Key: MADLIB-1098
> URL: https://issues.apache.org/jira/browse/MADLIB-1098
> Project: Apache MADlib
>  Issue Type: Improvement
>Reporter: Rashmi Raghu
>Assignee: Rashmi Raghu
>Priority: Minor
> Fix For: v1.11
>
>
> Several locations (e.g. Read Me screen / Intro screen and others) which 
> contain the name MADlib should be changed to 'Apache MADlib (Incubating)'. 
> Based on observations from the community on dev and user mailing lists (see 
> below for excerpts from those discussions).
> 
> Copying relevant excerpts from Ed's email:
> http://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201705.mbox/%3CCAHAuQDyn-drvZ64%2B4MYL2A%2BbDdea%3DtTew2boSz8ChdUPH2Aj_Q%40mail.gmail.com%3E
> > ==
> > Source miscelaneous: HAWQ_Install.txt
> >
> >   Observation:
> >
> >   - The file references the product name as "MADlib" and not "Apache
> > MADlib (Incubating). Is this file still valid?
> >
> > ==
> > CONVENIENCE BINARIES
> > --
> >
> > --
> > Mac Installer DMG file: apache-madlib-1.11-incubating-bin-Darwin.dmg
> > --
> >
> >   Observation:
> >
> >   - The DMG(apache-madlib-1.11-incubating-bin-Darwin.dmg) contains a
> > pkg file named "madlib-1.11-Darwin.pkg". Shouldn't it be called
> > "apache-madlib-1.11-incubating-Darwin.pkg"?
> >
> > Similarly, the DMG base folder name is madlib-1.11.Darwin.
> >
> > Mac Installer Package
> >
> > o Introduction screen
> >
> >   Observation:
> >
> >   - The introduction screen identifies the product name as
> > "MADlib". Shouldn't there be a mention of the project name being
> > "Apache MADlib (Incubating)".
> >
> > o Read Me screen
> >
> >   Observation:
> >
> >   - Similar to initial screen, there is no mention to the Apache
> > project except for the link to the project's wiki.
> >
> > o Remaining screens look reasonable (with exception of no Apache
> >   references).
> >
> > o The default application window name is "Install MADlib"
> >
> > Observation:
> >
> >   - Similar to Introduction sreen, should the name be "Install Apache
> > MADlib (Incubating)"?
> >
> >   - Look for other opportunities to reference the product name as
> > "Apache MADlib (Incubating)".
> >
> > --
> > Linux RPM: apache-madlib-1.11-incubating-bin-Linux.rpm
> > --
> >
> >   Observation:
> >
> >   - It appears the SPEC file used (possibly generated) references the
> > product name as "madlib".  Again, shouldn't there be references to
> > the product name as "Apache MADlib" scattered about?
> > Unfortunately, I am not sure if this should change or not. It
> > might help for someone on the team to review other Apache projects
> > convenience binary RPMs to see if something should be
> > addressed. The podling's mentor might be able to provide
> > additional direction as well.
> >
> > This can be seen in the following "rpm -qi madlib" output:
> >
> > [root@e0f4d3349d2d MADlib]# rpm -qi madlib
> > Name: madlib
> > Version : 1.11
> > Release : 1
> > Architecture: x86_64
> > Install Date: Wed May  3 04:00:10 2017
> > Group   : Development/Libraries
> > Size: 83575356
> > License : ASL 2.0
> > Signature   : (none)
> > Source 

[jira] [Commented] (MADLIB-1092) Elastic Net gives inconsistent results with grouping

2017-04-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15989660#comment-15989660
 ] 

ASF GitHub Bot commented on MADLIB-1092:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-madlib/pull/126


> Elastic Net gives inconsistent results with grouping
> 
>
> Key: MADLIB-1092
> URL: https://issues.apache.org/jira/browse/MADLIB-1092
> Project: Apache MADlib
>  Issue Type: Bug
>  Components: Module: Regularized Regression
>Reporter: Nandish Jayaram
> Fix For: v1.11
>
>
> Elastic net train seems to be giving incorrect results when used with 
> grouping.
> Steps:
> - Run elastic net (train) on a table and obtain a model (M1). 
> - Create a new table with all rows in the original input table and assign 
> group value 1 for it.
> - Replicate the rows in the table and assign group value 2 for the replicated 
> rows.
> - Run the elastic net train function with grouping while keeping the same 
> optimization parameters for the function.
> Result:
> - The model (for each group) when run with grouping is different from the 
> model M1.
> - The model for both the groups is the same, but not same as M1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1086) Unnest 2-D array by one level (i.e. into rows of 1-D arrays)

2017-04-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985339#comment-15985339
 ] 

ASF GitHub Bot commented on MADLIB-1086:


Github user rashmi815 closed the pull request at:

https://github.com/apache/incubator-madlib/pull/116


> Unnest 2-D array by one level (i.e. into rows of 1-D arrays)
> 
>
> Key: MADLIB-1086
> URL: https://issues.apache.org/jira/browse/MADLIB-1086
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Utilities
>Reporter: Frank McQuillan
>Assignee: Rashmi Raghu
>Priority: Minor
> Fix For: v1.11
>
>
> Context
> Currently k-means returns the following
> {code}
> centroids| 
> {{13.75333,1.905,2.425,16.06667,90.3,2.805,2.98,0.29,2.005,5.406633,1.041667,
>  3.318333,1020.833},
>
> {14.255,1.9325,2.5025,16.05,110.5,3.055,2.9775,0.2975,1.845,6.2125,0.9975,3.365,1378.75}}
> cluster_variance | {122999.110416013,30561.74805}
> objective_fn | 153560.858466013
> frac_reassigned  | 0
> num_iterations   | 3
> {code}
> Story
> As a data scientist, I want to unnest 2-D array by one level (i.e. into rows 
> of 1-D arrays) in K-means, so that I can get one centroid per row for follow 
> on operations.
> Acceptance
> 1) Add function to array operations
> http://madlib.incubator.apache.org/docs/latest/group__grp__array.html
> 2) Add an example in k-means
>  http://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html
> to demonstrate usage



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1092) Elastic Net gives inconsistent results with grouping

2017-04-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985188#comment-15985188
 ] 

ASF GitHub Bot commented on MADLIB-1092:


GitHub user njayaram2 opened a pull request:

https://github.com/apache/incubator-madlib/pull/126

Bugfix: Elastic net gives inconsistent result

JIRA: MADLIB-1092

- Elastic net used to consider the number of rows as the total number
of rows in the table even when grouping was used. This fix changes
that to consider the number of rows in a group while computing IGD.
- Elastic net used to consider mean and standard deviation for both
independent and dependent variables based on the entire table even
when grouping was used. This is now computed based on a group,
which is used to computed the scaled data when standardize=TRUE
for Gaussian IGD.
- One approximation still remains. During gradient computation (C++),
every value in the independent variable (for each dimension) is
subtracted with the mean computed based on the entire table and
not groups. This approximiation was adopted since it is messy to
pass group specific mean values for every row in the table to the
C++ layer.

@iyerr3 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/njayaram2/incubator-madlib 
bugfix/elastic_net_grouping

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/126.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #126


commit 92bbd3d08d457c5c7096aadf1403fc5e9df6ed7a
Author: Nandish Jayaram 
Date:   2017-04-24T16:46:03Z

Bugfix: Elastic net gives inconsistent result

JIRA: MADLIB-1092

- Elastic net used to consider the number of rows as the total number
of rows in the table even when grouping was used. This fix changes
that to consider the number of rows in a group while computing IGD.
- Elastic net used to consider mean and standard deviation for both
independent and dependent variables based on the entire table even
when grouping was used. This is now computed based on a group,
which is used to computed the scaled data when standardize=TRUE
for Gaussian IGD.
- One approximation still remains. During gradient computation (C++),
every value in the independent variable (for each dimension) is
subtracted with the mean computed based on the entire table and
not groups. This approximiation was adopted since it is messy to
pass group specific mean values for every row in the table to the
C++ layer.




> Elastic Net gives inconsistent results with grouping
> 
>
> Key: MADLIB-1092
> URL: https://issues.apache.org/jira/browse/MADLIB-1092
> Project: Apache MADlib
>  Issue Type: Bug
>  Components: Module: Regularized Regression
>Reporter: Nandish Jayaram
> Fix For: v1.11
>
>
> Elastic net train seems to be giving incorrect results when used with 
> grouping.
> Steps:
> - Run elastic net (train) on a table and obtain a model (M1). 
> - Create a new table with all rows in the original input table and assign 
> group value 1 for it.
> - Replicate the rows in the table and assign group value 2 for the replicated 
> rows.
> - Run the elastic net train function with grouping while keeping the same 
> optimization parameters for the function.
> Result:
> - The model (for each group) when run with grouping is different from the 
> model M1.
> - The model for both the groups is the same, but not same as M1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1095) Use populated parts of feature vector even if it contains one or more NULL entries

2017-04-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984196#comment-15984196
 ] 

ASF GitHub Bot commented on MADLIB-1095:


GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/125

DT: Include rows with NULL features in training

JIRA: MADLIB-1095

This commit enables the capability of decision tree to include rows with
NULL feature values in the training dataset. Features that have NULL
values are not used during the training of respective row,
but the features with non-null values can be used.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib bugfix/dt_null_rows

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/125.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #125


commit 7d41ee5f091c5aa56580095b555a6722b519f009
Author: Rahul Iyer 
Date:   2017-04-26T05:15:35Z

DT: Include rows with NULL features in training

JIRA: MADLIB-1095

This commit enables the capability of decision tree to include rows with
NULL feature values in the training dataset. Features that have NULL
values are not used during the training of respective row,
but the features with non-null values can be used.




> Use populated parts of feature vector even if it contains one or more NULL 
> entries
> --
>
> Key: MADLIB-1095
> URL: https://issues.apache.org/jira/browse/MADLIB-1095
> Project: Apache MADlib
>  Issue Type: Bug
>  Components: Module: Decision Tree
>Reporter: Frank McQuillan
>Priority: Minor
> Fix For: v1.11
>
>
> Context 
> Currently in DT/RF if the feature vector contains any NULLs, the whole row 
> will be ignored in the training data.  This is not ideal, especially in the 
> case where training data is sparse.
> Story
> As a data scientist, I want the DT/RF modules to use the non-NULL parts of 
> the feature vector, and not discard the whole row, so that I can get better 
> accuracy for classification/regression in the case of sparse data.
> Acceptance
> TBD



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1057) Reduce memory footprint for DT

2017-04-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983698#comment-15983698
 ] 

ASF GitHub Bot commented on MADLIB-1057:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-madlib/pull/120


> Reduce memory footprint for DT
> --
>
> Key: MADLIB-1057
> URL: https://issues.apache.org/jira/browse/MADLIB-1057
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Decision Tree
>Reporter: Frank McQuillan
>Assignee: Rahul Iyer
> Fix For: v1.11
>
>
> Follow on from spike 
> https://issues.apache.org/jira/browse/MADLIB-1035
> Step 1
> As a madlib developer I want to recreate the RF memory issue (reported in 
> https://issues.apache.org/jira/browse/MADLIB-1035). 
> The current datasets we have are 
> dt_adult : 32K rows 14 columns
> ecommerce : 1M rows 4 columns (ecommerce isn’t actually suitable for DT/RF)
> We need a table with ~2.2M rows and ~130 features (the actual target table 
> has ~1300 features). Randomly filling them might help diagnosing the issue 
> but ideally we would want a somewhat sensible dataset. The problem seems to 
> involve relatively short trees (depth 5) which means a random dataset will 
> probably fill the whole tree which might not be true for a structured dataset.
> Step 2
> Refactoring DT for for smaller memory footprint.
> Tree Accumulator has 2 matrices for continuous and categorical variables. 
> The whole structure is recreated at every level. 
> Every matrix has 2^i rows (i is the level)
> The categorical matrix size depends on the total number of categories 
> (weather : {sunny, cloudy, rainy}, isWeekend : {true, false} means this total 
> is 3+2=5) 
> The continuous matrix size depends on the number of cont. features * the 
> number of bins.
> Tree accumulator works like an array not a linked list. Even if the output is 
> not a complete tree, the tree accumulator creates rows for nonexistent 
> branches in proper order and fills them with 0 values. 
> The refactored version would create a small index table that has the same 
> number of rows as the old tree accumulator (a complete tree) but only a 
> single index column that points to the new tree accumulator row. 
> This will allow us to keep most of the internal function interfaces same but 
> the code to access (read/write) the tree accumulator will have to change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1076) Review LICENSE file and README.md

2017-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15977609#comment-15977609
 ] 

ASF GitHub Bot commented on MADLIB-1076:


Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/123#discussion_r112568270
  
--- Diff: licenses/MADlib.txt ---
@@ -1,10 +0,0 @@
-Portions of this software Copyright (c) 2010-2013 by EMC Corporation.  All 
rights reserved.
--- End diff --

Thanks, Roman. 
Symlink would be the best option if we have to keep the file. 

Alternatively, we can change the 
`"${CMAKE_SOURCE_DIR}/licenses/MADlib.txt"` in 
`deploy/PackageMaker/CMakeLists.txt` to `"${CMAKE_SOURCE_DIR}/LICENSE"` and 
remove this file. 


> Review LICENSE file and README.md
> -
>
> Key: MADLIB-1076
> URL: https://issues.apache.org/jira/browse/MADLIB-1076
> Project: Apache MADlib
>  Issue Type: Task
>  Components: Documentation
>Reporter: Frank McQuillan
>Assignee: Roman Shaposhnik
> Fix For: v1.11
>
>
> Comments from Ed Espino on 1.10 RC-2 review on thread
> https://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201703.mbox/%3CCAHAuQDzarS7K4u-rOsLLhbwSHCyFn5cKSyjLinE%2BZ%3DjSpU59qw%40mail.gmail.com%3E
> {code}
> LICENSE
>   Shouldn't the components with files in licenses/third_party be
>   referenced in LICENSE file?
> Boost_Software_License_v1.txt
> Eigen_v3.1.2.txt
> PyXB_v1.2.3.txt
> PyYAML_v3.10.txt
> Python_License_v2.7.1.txt
> UseLATEX_v1.9.4.txt
> _M_widen_init.txt
> argparse_v1.2.1.txt
> From README.md, I only saw an incomplete reference to the third party 
> components.
>   Third Party Components
>   MADlib incorporates material from the following third-party components
>   
>   argparse 1.2.1 "provides an easy, declarative interface for creating 
> command line tools"
>   Boost 1.47.0 (or newer) "provides peer-reviewed portable C++ source 
> libraries"
>   Eigen 3.2.2 "is a C++ template library for linear algebra"
>   PyYAML 3.10 "is a YAML parser and emitter for Python"
>   PyXB 1.2.4 "is a Python library for XML Schema Bindings"
> {code}
> To dos:
> 1) Confirm that LICENSE file is up to date
> 2) Update README.md with any required 3rd aprty/licensing clarifications



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1076) Review LICENSE file and README.md

2017-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15977402#comment-15977402
 ] 

ASF GitHub Bot commented on MADLIB-1076:


Github user rvs commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/123#discussion_r112552157
  
--- Diff: src/CMakeLists.txt ---
@@ -18,10 +18,10 @@ set(BITBUCKET_BASE_URL
 "${MADLIB_REDIRECT_PREFIX}https://bitbucket.org;
 CACHE STRING
 "Base URL for Bitbucket projects. May be overridden for testing 
purposes.")
-set(GITHUB_MADLIB_BASE_URL
+set(EIGEN_BASE_URL
--- End diff --

Will do. Thanks!


> Review LICENSE file and README.md
> -
>
> Key: MADLIB-1076
> URL: https://issues.apache.org/jira/browse/MADLIB-1076
> Project: Apache MADlib
>  Issue Type: Task
>  Components: Documentation
>Reporter: Frank McQuillan
>Assignee: Roman Shaposhnik
> Fix For: v1.11
>
>
> Comments from Ed Espino on 1.10 RC-2 review on thread
> https://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201703.mbox/%3CCAHAuQDzarS7K4u-rOsLLhbwSHCyFn5cKSyjLinE%2BZ%3DjSpU59qw%40mail.gmail.com%3E
> {code}
> LICENSE
>   Shouldn't the components with files in licenses/third_party be
>   referenced in LICENSE file?
> Boost_Software_License_v1.txt
> Eigen_v3.1.2.txt
> PyXB_v1.2.3.txt
> PyYAML_v3.10.txt
> Python_License_v2.7.1.txt
> UseLATEX_v1.9.4.txt
> _M_widen_init.txt
> argparse_v1.2.1.txt
> From README.md, I only saw an incomplete reference to the third party 
> components.
>   Third Party Components
>   MADlib incorporates material from the following third-party components
>   
>   argparse 1.2.1 "provides an easy, declarative interface for creating 
> command line tools"
>   Boost 1.47.0 (or newer) "provides peer-reviewed portable C++ source 
> libraries"
>   Eigen 3.2.2 "is a C++ template library for linear algebra"
>   PyYAML 3.10 "is a YAML parser and emitter for Python"
>   PyXB 1.2.4 "is a Python library for XML Schema Bindings"
> {code}
> To dos:
> 1) Confirm that LICENSE file is up to date
> 2) Update README.md with any required 3rd aprty/licensing clarifications



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1076) Review LICENSE file and README.md

2017-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15977347#comment-15977347
 ] 

ASF GitHub Bot commented on MADLIB-1076:


Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/123
  
Jenkins, OK to test. 


> Review LICENSE file and README.md
> -
>
> Key: MADLIB-1076
> URL: https://issues.apache.org/jira/browse/MADLIB-1076
> Project: Apache MADlib
>  Issue Type: Task
>  Components: Documentation
>Reporter: Frank McQuillan
>Assignee: Roman Shaposhnik
> Fix For: v1.11
>
>
> Comments from Ed Espino on 1.10 RC-2 review on thread
> https://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201703.mbox/%3CCAHAuQDzarS7K4u-rOsLLhbwSHCyFn5cKSyjLinE%2BZ%3DjSpU59qw%40mail.gmail.com%3E
> {code}
> LICENSE
>   Shouldn't the components with files in licenses/third_party be
>   referenced in LICENSE file?
> Boost_Software_License_v1.txt
> Eigen_v3.1.2.txt
> PyXB_v1.2.3.txt
> PyYAML_v3.10.txt
> Python_License_v2.7.1.txt
> UseLATEX_v1.9.4.txt
> _M_widen_init.txt
> argparse_v1.2.1.txt
> From README.md, I only saw an incomplete reference to the third party 
> components.
>   Third Party Components
>   MADlib incorporates material from the following third-party components
>   
>   argparse 1.2.1 "provides an easy, declarative interface for creating 
> command line tools"
>   Boost 1.47.0 (or newer) "provides peer-reviewed portable C++ source 
> libraries"
>   Eigen 3.2.2 "is a C++ template library for linear algebra"
>   PyYAML 3.10 "is a YAML parser and emitter for Python"
>   PyXB 1.2.4 "is a Python library for XML Schema Bindings"
> {code}
> To dos:
> 1) Confirm that LICENSE file is up to date
> 2) Update README.md with any required 3rd aprty/licensing clarifications



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1076) Review LICENSE file and README.md

2017-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15977256#comment-15977256
 ] 

ASF GitHub Bot commented on MADLIB-1076:


Github user fmcquillan99 commented on the issue:

https://github.com/apache/incubator-madlib/pull/123
  
Looks good, thanks for the PR.

One double check:  the LICENSE file you are proposing refers explicitly to 
libstemmer, useLatex and pyyaml, but does not call out other 3rd party 
components by name.  Is that your intention?


> Review LICENSE file and README.md
> -
>
> Key: MADLIB-1076
> URL: https://issues.apache.org/jira/browse/MADLIB-1076
> Project: Apache MADlib
>  Issue Type: Task
>  Components: Documentation
>Reporter: Frank McQuillan
>Assignee: Roman Shaposhnik
> Fix For: v1.11
>
>
> Comments from Ed Espino on 1.10 RC-2 review on thread
> https://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201703.mbox/%3CCAHAuQDzarS7K4u-rOsLLhbwSHCyFn5cKSyjLinE%2BZ%3DjSpU59qw%40mail.gmail.com%3E
> {code}
> LICENSE
>   Shouldn't the components with files in licenses/third_party be
>   referenced in LICENSE file?
> Boost_Software_License_v1.txt
> Eigen_v3.1.2.txt
> PyXB_v1.2.3.txt
> PyYAML_v3.10.txt
> Python_License_v2.7.1.txt
> UseLATEX_v1.9.4.txt
> _M_widen_init.txt
> argparse_v1.2.1.txt
> From README.md, I only saw an incomplete reference to the third party 
> components.
>   Third Party Components
>   MADlib incorporates material from the following third-party components
>   
>   argparse 1.2.1 "provides an easy, declarative interface for creating 
> command line tools"
>   Boost 1.47.0 (or newer) "provides peer-reviewed portable C++ source 
> libraries"
>   Eigen 3.2.2 "is a C++ template library for linear algebra"
>   PyYAML 3.10 "is a YAML parser and emitter for Python"
>   PyXB 1.2.4 "is a Python library for XML Schema Bindings"
> {code}
> To dos:
> 1) Confirm that LICENSE file is up to date
> 2) Update README.md with any required 3rd aprty/licensing clarifications



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1057) Reduce memory footprint for DT

2017-04-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15973311#comment-15973311
 ] 

ASF GitHub Bot commented on MADLIB-1057:


Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/117
  
No, that's a separate JIRA: MADLIB-1057
. This one is just about
setting the defaults to a more reasonable value considering the data that
users have shared.

The commit is a little more than just changing two numbers since I updated
the way these defaults are set. Previously they were set in overloaded
function declaration (in SQL). Changed this to set the default in the main
function definition, eliminating redundancy.

Thanks,
Rahul



> Reduce memory footprint for DT
> --
>
> Key: MADLIB-1057
> URL: https://issues.apache.org/jira/browse/MADLIB-1057
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Decision Tree
>Reporter: Frank McQuillan
>Assignee: Rahul Iyer
> Fix For: v1.11
>
>
> Follow on from spike 
> https://issues.apache.org/jira/browse/MADLIB-1035
> Step 1
> As a madlib developer I want to recreate the RF memory issue (reported in 
> https://issues.apache.org/jira/browse/MADLIB-1035). 
> The current datasets we have are 
> dt_adult : 32K rows 14 columns
> ecommerce : 1M rows 4 columns (ecommerce isn’t actually suitable for DT/RF)
> We need a table with ~2.2M rows and ~130 features (the actual target table 
> has ~1300 features). Randomly filling them might help diagnosing the issue 
> but ideally we would want a somewhat sensible dataset. The problem seems to 
> involve relatively short trees (depth 5) which means a random dataset will 
> probably fill the whole tree which might not be true for a structured dataset.
> Step 2
> Refactoring DT for for smaller memory footprint.
> Tree Accumulator has 2 matrices for continuous and categorical variables. 
> The whole structure is recreated at every level. 
> Every matrix has 2^i rows (i is the level)
> The categorical matrix size depends on the total number of categories 
> (weather : {sunny, cloudy, rainy}, isWeekend : {true, false} means this total 
> is 3+2=5) 
> The continuous matrix size depends on the number of cont. features * the 
> number of bins.
> Tree accumulator works like an array not a linked list. Even if the output is 
> not a complete tree, the tree accumulator creates rows for nonexistent 
> branches in proper order and fills them with 0 values. 
> The refactored version would create a small index table that has the same 
> number of rows as the old tree accumulator (a complete tree) but only a 
> single index column that points to the new tree accumulator row. 
> This will allow us to keep most of the internal function interfaces same but 
> the code to access (read/write) the tree accumulator will have to change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1086) Unnest 2-D array by one level (i.e. into rows of 1-D arrays)

2017-04-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971914#comment-15971914
 ] 

ASF GitHub Bot commented on MADLIB-1086:


GitHub user rashmi815 opened a pull request:

https://github.com/apache/incubator-madlib/pull/116

Unnest 2d array

Array Operations: Add function to unnest 2-D arrays into rows of 1-D arrays

JIRA:  MADLIB-1086

Function to unnest 2-D array by one level (i.e. into rows of 1-D arrays).
This is needed, for instance, in K-means, so that we can get one centroid 
per row for follow on operations.
- Added function to array operations
- Added an example in k-means to demonstrate usage

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rashmi815/incubator-madlib unnest_2d_array

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/116.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #116


commit 18e562813702d12d620594598f471161a990fbbd
Author: Rashmi Raghu 
Date:   2017-04-15T00:08:17Z

Unnest function, install-check tests completed. Initial docs included

commit 2a4baffa29c8f976d3260931c1790cfc125e91f4
Author: Rashmi Raghu 
Date:   2017-04-15T06:20:01Z

Refactored names of function output columns

commit a3eae964adc84382fa674e4d95c486f472b14099
Author: Rashmi Raghu 
Date:   2017-04-17T23:45:32Z

Updated docs (array_ops and k-means) and minor update to install-check tests




> Unnest 2-D array by one level (i.e. into rows of 1-D arrays)
> 
>
> Key: MADLIB-1086
> URL: https://issues.apache.org/jira/browse/MADLIB-1086
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Utilities
>Reporter: Frank McQuillan
>Assignee: Rashmi Raghu
>Priority: Minor
> Fix For: v1.11
>
>
> Context
> Currently k-means returns the following
> {code}
> centroids| 
> {{13.75333,1.905,2.425,16.06667,90.3,2.805,2.98,0.29,2.005,5.406633,1.041667,
>  3.318333,1020.833},
>
> {14.255,1.9325,2.5025,16.05,110.5,3.055,2.9775,0.2975,1.845,6.2125,0.9975,3.365,1378.75}}
> cluster_variance | {122999.110416013,30561.74805}
> objective_fn | 153560.858466013
> frac_reassigned  | 0
> num_iterations   | 3
> {code}
> Story
> As a data scientist, I want to unnest 2-D array by one level (i.e. into rows 
> of 1-D arrays) in K-means, so that I can get one centroid per row for follow 
> on operations.
> Acceptance
> 1) Add function to array operations
> http://madlib.incubator.apache.org/docs/latest/group__grp__array.html
> 2) Add an example in k-means
>  http://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html
> to demonstrate usage



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1081) Graph - add grouping to shortest path

2017-04-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971444#comment-15971444
 ] 

ASF GitHub Bot commented on MADLIB-1081:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-madlib/pull/113


> Graph - add grouping to shortest path
> -
>
> Key: MADLIB-1081
> URL: https://issues.apache.org/jira/browse/MADLIB-1081
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Graph
>Reporter: Frank McQuillan
>Assignee: Orhan Kislal
>Priority: Minor
> Fix For: v1.11
>
>
> * Add a GROUP BY column to the edge table
> * Because wants to run SSSP on the different server graphs defined for users, 
> i.e., group by userID



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1078) Skip install check for PMML modules

2017-04-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969553#comment-15969553
 ] 

ASF GitHub Bot commented on MADLIB-1078:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-madlib/pull/115


> Skip install check for PMML modules
> ---
>
> Key: MADLIB-1078
> URL: https://issues.apache.org/jira/browse/MADLIB-1078
> Project: Apache MADlib
>  Issue Type: Task
>  Components: Build System
>Reporter: Frank McQuillan
>Priority: Minor
> Fix For: v1.11
>
>
> Comments from Ed Espino on 1.10 RC-2 review on thread
> https://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201703.mbox/%3CCAHAuQDzarS7K4u-rOsLLhbwSHCyFn5cKSyjLinE%2BZ%3DjSpU59qw%40mail.gmail.com%3E
> {code}
> BUILD, INSTALL and INSTALL-CHECK
> I was able to build the package and successfully ran MADlib
> install-check against PostgreSQL 9.6.2.
> Issue: There is no obvious reference to the PostgreSQL libxml
>dependency in dev documentation. The madpack install-check
>has failures (see below) if "--with-libxml" configure
>option is not specified for PostgreSQL.
>install-check errors encountered due to PostgreSQL
>configuration without "--with-libxml" option:
>  psql:/tmp/madlib.0UIPlZ/pmml/test/table_to_pmml.sql_in.tmp:73: 
> ERROR:  unsupported XML feature
>  DETAIL:  This functionality requires the server to be built with 
> libxml support.
>  HINT:  You need to rebuild PostgreSQL using --with-libxml.
>  CONTEXT:  while creating return value
>  PL/Python function "pmml"
> {code}
> Story
> As a MADlib installer, I want IC tests that use lib-xml to be skipped, so 
> that my install is clean and I do not have to wonder if there is a problem.  
> For now this is  only PMML modules, so just skip those ICs.
> More:  if at PMML export is called that requires lib-xml, then it will fail 
> with a run-time db error which is fine.  This is not a commonly used function.
> Acceptance:
> 1) madpack install succeeds and IC passes even tho PG does not have libxml 
> installed



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1078) Skip install check for PMML modules

2017-04-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969419#comment-15969419
 ] 

ASF GitHub Bot commented on MADLIB-1078:


GitHub user njayaram2 opened a pull request:

https://github.com/apache/incubator-madlib/pull/115

Task: Skip install-check for pmml

JIRA: MADLIB-1078

Skip install-check for pmml when run without the '-t' option. We
can still run install-check for pmml if the '-t' option is
specified.

@iyerr3 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/njayaram2/incubator-madlib 
task/install_check/skip-pmml

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/115.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #115


commit e1b9ea91a9add21f6325385876909436a48bcaf2
Author: Nandish Jayaram 
Date:   2017-04-14T18:45:35Z

Task: Skip install-check for pmml

JIRA: MADLIB-1078

Skip install-check for pmml when run without the '-t' option. We
can still run install-check for pmml if the '-t' option is
specified.




> Skip install check for PMML modules
> ---
>
> Key: MADLIB-1078
> URL: https://issues.apache.org/jira/browse/MADLIB-1078
> Project: Apache MADlib
>  Issue Type: Task
>  Components: Build System
>Reporter: Frank McQuillan
>Priority: Minor
> Fix For: v1.11
>
>
> Comments from Ed Espino on 1.10 RC-2 review on thread
> https://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201703.mbox/%3CCAHAuQDzarS7K4u-rOsLLhbwSHCyFn5cKSyjLinE%2BZ%3DjSpU59qw%40mail.gmail.com%3E
> {code}
> BUILD, INSTALL and INSTALL-CHECK
> I was able to build the package and successfully ran MADlib
> install-check against PostgreSQL 9.6.2.
> Issue: There is no obvious reference to the PostgreSQL libxml
>dependency in dev documentation. The madpack install-check
>has failures (see below) if "--with-libxml" configure
>option is not specified for PostgreSQL.
>install-check errors encountered due to PostgreSQL
>configuration without "--with-libxml" option:
>  psql:/tmp/madlib.0UIPlZ/pmml/test/table_to_pmml.sql_in.tmp:73: 
> ERROR:  unsupported XML feature
>  DETAIL:  This functionality requires the server to be built with 
> libxml support.
>  HINT:  You need to rebuild PostgreSQL using --with-libxml.
>  CONTEXT:  while creating return value
>  PL/Python function "pmml"
> {code}
> Story
> As a MADlib installer, I want IC tests that use lib-xml to be skipped, so 
> that my install is clean and I do not have to wonder if there is a problem.  
> For now this is  only PMML modules, so just skip those ICs.
> More:  if at PMML export is called that requires lib-xml, then it will fail 
> with a run-time db error which is fine.  This is not a commonly used function.
> Acceptance:
> 1) madpack install succeeds and IC passes even tho PG does not have libxml 
> installed



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1082) Graph - add grouping to page rank

2017-04-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968341#comment-15968341
 ] 

ASF GitHub Bot commented on MADLIB-1082:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-madlib/pull/112


> Graph - add grouping to page rank
> -
>
> Key: MADLIB-1082
> URL: https://issues.apache.org/jira/browse/MADLIB-1082
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Graph
>Reporter: Frank McQuillan
>Assignee: Nandish Jayaram
>Priority: Minor
> Fix For: v1.11
>
>
> Add grouping column to edge table to support separate page rank calculations 
> by group



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1082) Graph - add grouping to page rank

2017-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965176#comment-15965176
 ] 

ASF GitHub Bot commented on MADLIB-1082:


Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/112#discussion_r111041205
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -158,44 +313,198 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 # https://en.wikipedia.org/wiki/PageRank#Damping_factor
 
 # The query below computes the PageRank of each node using the 
above formula.
+# A small explanatory note on ignore_group_clause:
+# This is used only when grouping is set. This essentially will 
have
+# the condition that will help skip the PageRank computation on 
groups
+# that have converged.
 plpy.execute("""
 CREATE TABLE {message} AS
-SELECT {edge_temp_table}.{dest} AS {vertex_id},
-
SUM({v1}.pagerank/{out_cnts}.{out_cnts_cnt})*{damping_factor}+{random_prob} AS 
pagerank
+SELECT {grouping_cols_select} {edge_temp_table}.{dest} AS 
{vertex_id},
+
SUM({v1}.pagerank/{out_cnts}.{out_cnts_cnt})*{damping_factor}+{random_jump_prob}
 AS pagerank
 FROM {edge_temp_table}
-INNER JOIN {cur} ON 
{edge_temp_table}.{dest}={cur}.{vertex_id}
-INNER JOIN {out_cnts} ON 
{out_cnts}.{vertex_id}={edge_temp_table}.{src}
-INNER JOIN {cur} AS {v1} ON 
{v1}.{vertex_id}={edge_temp_table}.{src}
-GROUP BY {edge_temp_table}.{dest}
-""".format(**locals()))
+INNER JOIN {cur} ON {cur_join_clause}
+INNER JOIN {out_cnts} ON {out_cnts_join_clause}
+INNER JOIN {cur} AS {v1} ON {v1_join_clause}
+{vertices_per_group_inner_join}
+{ignore_group_clause}
+GROUP BY {grouping_cols_select} {edge_temp_table}.{dest}
+""".format(grouping_cols_select=edge_grouping_cols_select+', '
+if grouping_cols else '',
+
random_jump_prob='MIN({vpg}.{random_prob})'.format(**locals())
+if grouping_cols else random_probability,
+vertices_per_group_inner_join="""INNER JOIN 
{vertices_per_group}
+AS {vpg} ON {vpg_join_clause}""".format(**locals())
+if grouping_cols else '',
+ignore_group_clause=' WHERE '+get_ignore_groups(
+summary_table, edge_temp_table, grouping_cols_list)
+if iteration_num>0 and grouping_cols else '',
+**locals()))
 # If there are nodes that have no incoming edges, they are not 
captured in the message table.
 # Insert entries for such nodes, with random_prob.
 plpy.execute("""
 INSERT INTO {message}
-SELECT {vertex_id}, {random_prob}::DOUBLE PRECISION AS 
pagerank
-FROM {cur}
-WHERE {vertex_id} NOT IN (
+SELECT {grouping_cols_select} {cur}.{vertex_id}, 
{random_jump_prob} AS pagerank
+FROM {cur} {vpg_from_clause}
+WHERE {vpg_where_clause} {vertex_id} NOT IN (
 SELECT {vertex_id}
 FROM {message}
+{message_grp_where}
 )
-""".format(**locals()))
-# Check for convergence will be done as part of grouping support 
for pagerank:
-# https://issues.apache.org/jira/browse/MADLIB-1082. So, the 
threshold parameter
-# is a dummy variable at the moment, the PageRank computation 
happens for
-# {max_iter} number of times.
+{ignore_group_clause}
+GROUP BY {grouping_cols_select} {cur}.{vertex_id}
+""".format(grouping_cols_select=cur_grouping_cols_select+','
+if grouping_cols else '',
+vpg_from_clause=', {vertices_per_group} AS 
{vpg}'.format(**locals())
+if grouping_cols else '',
+vpg_where_clause='{vpg_cur_join_clause} AND 
'.format(**locals())
+if grouping_cols else '',
+message_grp_where='WHERE {message_grp}'.format(**locals())
+if grouping_cols else '',
+
random_jump_prob='MIN({vpg}.{random_prob})'.format(**locals())
+if grouping_cols else random_probability,
+  

[jira] [Commented] (MADLIB-1082) Graph - add grouping to page rank

2017-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965174#comment-15965174
 ] 

ASF GitHub Bot commented on MADLIB-1082:


Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/112#discussion_r111033110
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -158,44 +313,198 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 # https://en.wikipedia.org/wiki/PageRank#Damping_factor
 
 # The query below computes the PageRank of each node using the 
above formula.
+# A small explanatory note on ignore_group_clause:
+# This is used only when grouping is set. This essentially will 
have
+# the condition that will help skip the PageRank computation on 
groups
+# that have converged.
 plpy.execute("""
 CREATE TABLE {message} AS
-SELECT {edge_temp_table}.{dest} AS {vertex_id},
-
SUM({v1}.pagerank/{out_cnts}.{out_cnts_cnt})*{damping_factor}+{random_prob} AS 
pagerank
+SELECT {grouping_cols_select} {edge_temp_table}.{dest} AS 
{vertex_id},
+
SUM({v1}.pagerank/{out_cnts}.{out_cnts_cnt})*{damping_factor}+{random_jump_prob}
 AS pagerank
 FROM {edge_temp_table}
-INNER JOIN {cur} ON 
{edge_temp_table}.{dest}={cur}.{vertex_id}
-INNER JOIN {out_cnts} ON 
{out_cnts}.{vertex_id}={edge_temp_table}.{src}
-INNER JOIN {cur} AS {v1} ON 
{v1}.{vertex_id}={edge_temp_table}.{src}
-GROUP BY {edge_temp_table}.{dest}
-""".format(**locals()))
+INNER JOIN {cur} ON {cur_join_clause}
+INNER JOIN {out_cnts} ON {out_cnts_join_clause}
+INNER JOIN {cur} AS {v1} ON {v1_join_clause}
+{vertices_per_group_inner_join}
+{ignore_group_clause}
+GROUP BY {grouping_cols_select} {edge_temp_table}.{dest}
+""".format(grouping_cols_select=edge_grouping_cols_select+', '
+if grouping_cols else '',
+
random_jump_prob='MIN({vpg}.{random_prob})'.format(**locals())
+if grouping_cols else random_probability,
+vertices_per_group_inner_join="""INNER JOIN 
{vertices_per_group}
+AS {vpg} ON {vpg_join_clause}""".format(**locals())
+if grouping_cols else '',
+ignore_group_clause=' WHERE '+get_ignore_groups(
+summary_table, edge_temp_table, grouping_cols_list)
+if iteration_num>0 and grouping_cols else '',
+**locals()))
 # If there are nodes that have no incoming edges, they are not 
captured in the message table.
 # Insert entries for such nodes, with random_prob.
 plpy.execute("""
 INSERT INTO {message}
-SELECT {vertex_id}, {random_prob}::DOUBLE PRECISION AS 
pagerank
-FROM {cur}
-WHERE {vertex_id} NOT IN (
+SELECT {grouping_cols_select} {cur}.{vertex_id}, 
{random_jump_prob} AS pagerank
+FROM {cur} {vpg_from_clause}
+WHERE {vpg_where_clause} {vertex_id} NOT IN (
 SELECT {vertex_id}
 FROM {message}
+{message_grp_where}
 )
-""".format(**locals()))
-# Check for convergence will be done as part of grouping support 
for pagerank:
-# https://issues.apache.org/jira/browse/MADLIB-1082. So, the 
threshold parameter
-# is a dummy variable at the moment, the PageRank computation 
happens for
-# {max_iter} number of times.
+{ignore_group_clause}
+GROUP BY {grouping_cols_select} {cur}.{vertex_id}
+""".format(grouping_cols_select=cur_grouping_cols_select+','
+if grouping_cols else '',
+vpg_from_clause=', {vertices_per_group} AS 
{vpg}'.format(**locals())
+if grouping_cols else '',
+vpg_where_clause='{vpg_cur_join_clause} AND 
'.format(**locals())
+if grouping_cols else '',
+message_grp_where='WHERE {message_grp}'.format(**locals())
+if grouping_cols else '',
+
random_jump_prob='MIN({vpg}.{random_prob})'.format(**locals())
+if grouping_cols else random_probability,
+  

[jira] [Commented] (MADLIB-1081) Graph - add grouping to shortest path

2017-04-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961304#comment-15961304
 ] 

ASF GitHub Bot commented on MADLIB-1081:


GitHub user orhankislal opened a pull request:

https://github.com/apache/incubator-madlib/pull/113

Graph: Add grouping support to SSSP

JIRA: MADLIB-1081

- This commit adds grouping support for SSSP as well as its path function.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/orhankislal/incubator-madlib 
feature/graph_sssp_gr_take2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/113.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #113


commit ced9164dbd0caac45835bb4b3dabe91ae4e5b505
Author: Orhan Kislal 
Date:   2017-04-07T18:41:35Z

Graph: Add grouping support to SSSP

JIRA: MADLIB-1081

- This commit adds grouping support for SSSP as well as its path function.




> Graph - add grouping to shortest path
> -
>
> Key: MADLIB-1081
> URL: https://issues.apache.org/jira/browse/MADLIB-1081
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Graph
>Reporter: Frank McQuillan
>Assignee: Orhan Kislal
>Priority: Minor
> Fix For: v1.11
>
>
> * Add a GROUP BY column to the edge table
> * Because wants to run SSSP on the different server graphs defined for users, 
> i.e., group by userID



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1082) Graph - add grouping to page rank

2017-04-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15959532#comment-15959532
 ] 

ASF GitHub Bot commented on MADLIB-1082:


GitHub user njayaram2 opened a pull request:

https://github.com/apache/incubator-madlib/pull/112

Feautre: Add grouping support for PageRank

MADLIB-1082

- Add grouping support for pagerank, which will compute a PageRank
probability distribution for the graph represented by each group.
- Add convergence test, so that PageRank computation terminates
if the pagerank value of no node changes beyond a threshold across
two consecutive iterations (or max_iters number of iterations are
done, whichever happens first). In case of grouping, the algorithm
terminates only after all groups have converged.
- Create a summary table apart from the output table that records
the number of iterations required for convergence. Iterations
required for convergence of each group is recorded when grouping
is used. This implementation also ensures that we don't compute
PageRank for groups that have already converged.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/njayaram2/incubator-madlib 
feature/pagerank_grouping

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/112.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #112


commit 5b95581e0dd9981086f17097e46a59376ce0b603
Author: Nandish Jayaram 
Date:   2017-04-01T00:03:50Z

Feautre: Add grouping support for PageRank

MADLIB-1082

- Add grouping support for pagerank, which will compute a PageRank
probability distribution for the graph represented by each group.
- Add convergence test, so that PageRank computation terminates
if the pagerank value of no node changes beyond a threshold across
two consecutive iterations (or max_iters number of iterations are
done, whichever happens first). In case of grouping, the algorithm
terminates only after all groups have converged.
- Create a summary table apart from the output table that records
the number of iterations required for convergence. Iterations
required for convergence of each group is recorded when grouping
is used. This implementation also ensures that we don't compute
PageRank for groups that have already converged.




> Graph - add grouping to page rank
> -
>
> Key: MADLIB-1082
> URL: https://issues.apache.org/jira/browse/MADLIB-1082
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Graph
>Reporter: Frank McQuillan
>Assignee: Nandish Jayaram
>Priority: Minor
> Fix For: v1.11
>
>
> Add grouping column to edge table to support separate page rank calculations 
> by group



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1066) Pivoting - support array and svec output

2017-03-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951426#comment-15951426
 ] 

ASF GitHub Bot commented on MADLIB-1066:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-madlib/pull/108


> Pivoting - support array and svec output
> 
>
> Key: MADLIB-1066
> URL: https://issues.apache.org/jira/browse/MADLIB-1066
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Utilities
>Reporter: Frank McQuillan
>Priority: Minor
> Fix For: v1.11
>
>
> Background
> Follow on to these JIRAs
> https://issues.apache.org/jira/browse/MADLIB-908
> https://issues.apache.org/jira/browse/MADLIB-1004
> this capability is to carry over some good ideas from
> https://issues.apache.org/jira/browse/MADLIB-1038
> Story
> Support array output format to allow > 1600 output columns (or PostgreSQL 
> limit).  i.e., many MADlib algos take array input so pivot should support 
> array output.  Base this on how it is done in encoding categorical variables 
> http://madlib.incubator.apache.org/docs/latest/group__grp__encode__categorical.html
> Add 'output_type' to interface:
> {code}
> pivot(
> source_table,
> output_table,
> index,
> pivot_cols,
> pivot_values,
> aggregate_func,
> fill_value,
> keep_null,
> output_col_dictionary,
> output_type  -- New
> )
> {code}
> where
> {code}
> output_type (optional)
> VARCHAR. default: 'column'. This parameter controls the output format.  If 
> 'column', a column is created for each output variable. PostgreSQL limits the 
> number of columns in a table. If the total number of columns exceeds the 
> limit, then make this parameter either 'array' to combine the indicator 
> columns into an array or 'svec' to cast the array output to 'madlib.svec' 
> type.
> Since the array output for any single tuple would be sparse, the 'svec' 
> output would be most efficient for storage. The 'array' output is useful if 
> the array is used for post-processing, including concatenating with other 
> non-categorical features.
> A dictionary will be created when 'output_type' is 'array' or 'svec' to 
> define an index into the array. The dictionary table will be given the name 
> of the 'output_table' appended by '_dictionary'.
> {code}
> See code in
> http://madlib.incubator.apache.org/docs/latest/group__grp__encode__categorical.html
> need to support NULL (=default 'column').  Also 'a' and 'Array' and 'arr' 
> should be interpreted as 'array.  Same idea with 'column' and 'svec'



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1069) Graph - page rank

2017-03-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937156#comment-15937156
 ] 

ASF GitHub Bot commented on MADLIB-1069:


GitHub user njayaram2 opened a pull request:

https://github.com/apache/incubator-madlib/pull/109

Feature: PageRank

JIRA: MADLIB-1069

- Introduces a new module that computes the PageRank of all nodes
in a directed graph.
- Implements the original PageRank algorithm that assumes a random
surfer model (https://en.wikipedia.org/wiki/PageRank#Damping_factor)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/njayaram2/incubator-madlib 
feature/pagerank_pysql

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/109.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #109


commit 3fe103fc92d74048eb2db17dcfd93e1ebd827229
Author: Nandish Jayaram 
Date:   2017-03-16T19:02:40Z

Feature: PageRank

JIRA: MADLIB-1069

- Introduces a new module that computes the PageRank of all nodes
in a directed graph.
- Implements the original PageRank algorithm that assumes a random
surfer model (https://en.wikipedia.org/wiki/PageRank#Damping_factor)




> Graph - page rank
> -
>
> Key: MADLIB-1069
> URL: https://issues.apache.org/jira/browse/MADLIB-1069
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
>Assignee: Nandish Jayaram
> Fix For: v1.11
>
>
> Story
> As a MADlib developer, I want to implement page rank in an efficient and 
> scaleable way.
> Acceptance
> 1) Interface defined
> 2) Design document updated
> 3) Documentation and on-line help
> 4) IC and functional tests
> 5) Scale tests
> References
> [1] Grails paper
> http://pages.cs.wisc.edu/~jignesh/publ/Grail.pdf
> [2] Grails deck
> http://pages.cs.wisc.edu/~jignesh/publ/Grail-slides.pdf
> [3] Grails repo with page rank example
> https://github.com/UWQuickstep/Grail
> https://github.com/UWQuickstep/Grail/blob/master/analytics/pagerank.sql
> [4] PDL tools implementation
> http://pivotalsoftware.github.io/PDLTools/group__grp__pagerank__alg.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-920) Build process on Apache infrastructure

2017-03-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15924581#comment-15924581
 ] 

ASF GitHub Bot commented on MADLIB-920:
---

Github user asfgit closed the pull request at:

https://github.com/apache/incubator-madlib/pull/103


> Build process on Apache infrastructure
> --
>
> Key: MADLIB-920
> URL: https://issues.apache.org/jira/browse/MADLIB-920
> Project: Apache MADlib
>  Issue Type: Test
>  Components: Build System
>Reporter: Frank McQuillan
>Assignee: Rahul Iyer
> Fix For: v1.11
>
>
> As a MADlib developer, I would like to build a complete docker file + image 
> with instructions on how to start/build/test MADlib, so that community 
> developers can use this to test.  Please put the file in src code, put 
> instructions on wiki.
> We are doing this in part because access to Apache Build system for MADlib 
> gives access to all incubating projects, and we don't want to facilitate that.
> Needed for graduation to TLP.
> The target platforms are:
> 1) PostgreSQL 9.6
> 2) gpdb 4.3.x (later 5.x)
> Please indicate of gpdb part is taking a ton of time and if so, maybe we live 
> with PG for now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1024) CREATE EXTENSION madlib fails

2017-02-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15886542#comment-15886542
 ] 

ASF GitHub Bot commented on MADLIB-1024:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-madlib/pull/106


> CREATE EXTENSION madlib fails
> -
>
> Key: MADLIB-1024
> URL: https://issues.apache.org/jira/browse/MADLIB-1024
> Project: Apache MADlib
>  Issue Type: Bug
>Reporter: Martin Jensen
>Assignee: Rahul Iyer
> Fix For: v1.11
>
> Attachments: Dockerfile
>
>
> When installing madlib with PGXN on PG 9.5.4, Ubuntu 16.04 I get the 
> following error when trying to create the extension:
> ERROR:  aggregate cannot accept shell type public.svec
> ** Error **
> ERROR: aggregate cannot accept shell type public.svec
> SQL state: 42P13



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1024) CREATE EXTENSION madlib fails

2017-02-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883680#comment-15883680
 ] 

ASF GitHub Bot commented on MADLIB-1024:


GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/106

Build: Fix module sort order for PGXN installation

JIRA: MADLIB-1024

PGXN installation involves creating a single extension sql file that
contains all the SQL commands run during MADlib deployment. The modules
added into this extension file are to be placed in the right order,
taking dependencies into account.

MADlib has a function that compares a given file path with topologically
sorted modules to decide the order of concatenation to extension file.
This comparison is faulty since the module name was searched for in the
whole path, leading to false positive with modules that have another
module name as substring.  The specific bug was related to 'svec_util'
being flagged in same order as 'svec'.

This commit fixes this issue taking advantage of the file path names being
of the form '.../modules//...', hence comparing the
complete module name.

Closes #106

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
bugfix/module_sort_order

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/106.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #106


commit b106e064edf5d3d03631d222a51953d0d21015d4
Author: Rahul Iyer 
Date:   2017-02-24T22:32:32Z

Build: Fix module sort order for PGXN installation

JIRA: MADLIB-1024

PGXN installation involves creating a single extension sql file that
contains all the SQL commands run during MADlib deployment. The modules
added into this extension file are to be placed in the right order,
taking dependencies into account.

MADlib has a function that compares a given file path with topologically
sorted modules to decide the order of concatenation to extension file.
This comparison is faulty since the module name was searched for in the
whole path, leading to false positive with modules that have another
module name as substring.  The specific bug was related to 'svec_util'
being flagged in same order as 'svec'.

This commit fixes this issue taking advantage of the file path names being
of the form '.../modules//...', hence comparing the
complete module name.

Closes #106




> CREATE EXTENSION madlib fails
> -
>
> Key: MADLIB-1024
> URL: https://issues.apache.org/jira/browse/MADLIB-1024
> Project: Apache MADlib
>  Issue Type: Bug
>Reporter: Martin Jensen
>Assignee: Orhan Kislal
> Fix For: v1.11
>
> Attachments: Dockerfile
>
>
> When installing madlib with PGXN on PG 9.5.4, Ubuntu 16.04 I get the 
> following error when trying to create the extension:
> ERROR:  aggregate cannot accept shell type public.svec
> ** Error **
> ERROR: aggregate cannot accept shell type public.svec
> SQL state: 42P13



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1065) CMake: Give informative error message when no files found in serverdir

2017-02-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872758#comment-15872758
 ] 

ASF GitHub Bot commented on MADLIB-1065:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-madlib/pull/102


> CMake: Give informative error message when no files found in serverdir 
> ---
>
> Key: MADLIB-1065
> URL: https://issues.apache.org/jira/browse/MADLIB-1065
> Project: Apache MADlib
>  Issue Type: Bug
>  Components: Build System
>Reporter: Rahul Iyer
>Priority: Minor
> Fix For: v1.11
>
>
> Error reported by user trying to compile MADlib on Ubuntu against Postgresql 
> 9.6. The postgres installation was missing server headers, which led to an 
> error during {{cmake}}. 
> The only error shown was: 
> {{-- Could NOT find PostgreSQL (missing:  POSTGRESQL_EXECUTABLE) }}. 
> This can be improved to give the actual source of error, with instructions on 
> how to fix. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-920) Build process on Apache infrastructure

2017-02-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868708#comment-15868708
 ] 

ASF GitHub Bot commented on MADLIB-920:
---

GitHub user njayaram2 opened a pull request:

https://github.com/apache/incubator-madlib/pull/103

Docker Build for MADlib

JIRA: MADLIB-920

- Add docker files that would help developers download a docker image
with Postgres-9.6 and MADlib depedencies installed. A developer's
local source code changes can be built on this image's container
to quickly build and run install-checks. Requires docker installed
on the developer's environment.
- Add a bash script (jenkins_build.sh) that would be a starting
point towards getting a Jenkins build for MADlib master branch.
- This is work under heavy development and we would want to add in
similar support for Greenplum as well in the future. There is a
placeholder Dockerfile for GPDB in this commit, must be modified
to get it working.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/njayaram2/incubator-madlib docker-build

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/103.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #103


commit d9d9d296ec50272b2ce50ad1e975c78574ceaed2
Author: Nandish Jayaram 
Date:   2017-02-15T22:21:17Z

Docker Build for MADlib

JIRA: MADLIB-920

- Add docker files that would help developers download a docker image
with Postgres-9.6 and MADlib depedencies installed. A developer's
local source code changes can be built on this image's container
to quickly build and run install-checks. Requires docker installed
on the developer's environment.
- Add a bash script (jenkins_build.sh) that would be a starting
point towards getting a Jenkins build for MADlib master branch.
- This is work under heavy development and we would want to add in
similar support for Greenplum as well in the future. There is a
placeholder Dockerfile for GPDB in this commit, must be modified
to get it working.




> Build process on Apache infrastructure
> --
>
> Key: MADLIB-920
> URL: https://issues.apache.org/jira/browse/MADLIB-920
> Project: Apache MADlib
>  Issue Type: Test
>  Components: Build System
>Reporter: Frank McQuillan
>Assignee: Rahul Iyer
> Fix For: v1.11
>
>
> As a MADlib developer, I would like to build a complete docker file + image 
> with instructions on how to start/build/test MADlib, so that community 
> developers can use this to test.  Please put the file in src code, put 
> instructions on wiki.
> We are doing this in part because access to Apache Build system for MADlib 
> gives access to all incubating projects, and we don't want to facilitate that.
> Needed for graduation to TLP.
> The target platforms are:
> 1) PostgreSQL 9.6
> 2) gpdb 4.3.x (later 5.x)
> Please indicate of gpdb part is taking a ton of time and if so, maybe we live 
> with PG for now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-1025) MADlib does not compile with gcc 6.2

2017-02-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862109#comment-15862109
 ] 

ASF GitHub Bot commented on MADLIB-1025:


GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/101

Multiple: Add casting to allow compilation with GCC 6+

JIRA: MADLIB-1025

GCC 6+ introduced stricter rules for implicit casting where loss of
information is possible.

Closes #101

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
bugfix/gcc6_error_fixes

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/101.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #101


commit 8e03ffda2e8a4b2f2ab2baf01c2e6727345c91be
Author: Rahul Iyer 
Date:   2017-02-11T00:29:08Z

Multiple: Add casting to allow compilation in GCC 6+

JIRA: MADLIB-1025

GCC 6+ introduced stricter rules for implicit casting where loss of
information is possible.

Closes #101




> MADlib does not compile with gcc 6.2
> 
>
> Key: MADLIB-1025
> URL: https://issues.apache.org/jira/browse/MADLIB-1025
> Project: Apache MADlib
>  Issue Type: Bug
>  Components: Build System
>Reporter: Rahul Iyer
>Assignee: Nandish Jayaram
>Priority: Minor
> Fix For: v2.0
>
>
> Compiling with gcc 6.2.0 gives the below error.
> {code}
> [ 84%] Building CXX object 
> src/ports/postgres/9.5/CMakeFiles/madlib_postgresql_9_5.dir/__/__/__/modules/elastic_net/elastic_net_gaussian_fista.cpp.o
> In file included from 
> /var/folders/rm/g9tb1s_53wb86s5_nrsdbxphgn/T/tmp8WXq3S/madlib-1.9.1/src/modules/elastic_net/elastic_net_binomial_igd.cpp:5:0:
> /var/folders/rm/g9tb1s_53wb86s5_nrsdbxphgn/T/tmp8WXq3S/madlib-1.9.1/src/modules/elastic_net/elastic_net_optimizer_igd.hpp:
>  In static member function 'static madlib::dbconnector::postgres::AnyType 
> madlib::modules::elastic_net::Igd<
> Model>::igd_transition(madlib::dbconnector::postgres::AnyType&, const 
> madlib::dbconnector::postgres::Allocator&)':
> /var/folders/rm/g9tb1s_53wb86s5_nrsdbxphgn/T/tmp8WXq3S/madlib-1.9.1/src/modules/elastic_net/elastic_net_optimizer_igd.hpp:69:46:
>  error: call of overloaded 
> 'log(madlib::modules::HandleTraits rayHandle >::ReferenceToUInt32&)' is ambiguous
>  state.p = 2 * log(state.dimension);
>   ^
> In file included from 
> /usr/local/Cellar/gcc/6.2.0/include/c++/6.2.0/cmath:45:0,
>  from /usr/local/Cellar/gcc/6.2.0/include/c++/6.2.0/math.h:36,
>  from 
> /var/folders/rm/g9tb1s_53wb86s5_nrsdbxphgn/T/tmp8WXq3S/madlib-1.9.1/src/ports/postgres/dbconnector/../../../../methods/svec/src/pg_gp/SparseData.h:24,
>  from 
> /var/folders/rm/g9tb1s_53wb86s5_nrsdbxphgn/T/tmp8WXq3S/madlib-1.9.1/src/ports/postgres/dbconnector/../../../../methods/svec/src/pg_gp/sparse_vector.h:10,
>  from 
> /var/folders/rm/g9tb1s_53wb86s5_nrsdbxphgn/T/tmp8WXq3S/madlib-1.9.1/src/ports/postgres/dbconnector/dbconnector.hpp:39,
>  from 
> /var/folders/rm/g9tb1s_53wb86s5_nrsdbxphgn/T/tmp8WXq3S/madlib-1.9.1/src/modules/elastic_net/elastic_net_binomial_igd.cpp:2:
> /usr/local/Cellar/gcc/6.2.0/lib/gcc/6/gcc/x86_64-apple-darwin15.6.0/6.2.0/include-fixed/math.h:402:15:
>  note: candidate: double log(double)
>  extern double log(double);
>^~~
> In file included from 
> /usr/local/Cellar/gcc/6.2.0/include/c++/6.2.0/math.h:36:0,
>  from 
> /var/folders/rm/g9tb1s_53wb86s5_nrsdbxphgn/T/tmp8WXq3S/madlib-1.9.1/src/ports/postgres/dbconnector/../../../../methods/svec/src/pg_gp/SparseData.h:24,
>  from 
> /var/folders/rm/g9tb1s_53wb86s5_nrsdbxphgn/T/tmp8WXq3S/madlib-1.9.1/src/ports/postgres/dbconnector/../../../../methods/svec/src/pg_gp/sparse_vector.h:10,
>  from 
> /var/folders/rm/g9tb1s_53wb86s5_nrsdbxphgn/T/tmp8WXq3S/madlib-1.9.1/src/ports/postgres/dbconnector/dbconnector.hpp:39,
>  from 
> /var/folders/rm/g9tb1s_53wb86s5_nrsdbxphgn/T/tmp8WXq3S/madlib-1.9.1/src/modules/elastic_net/elastic_net_binomial_igd.cpp:2:
> /usr/local/Cellar/gcc/6.2.0/include/c++/6.2.0/cmath:365:3: note: candidate: 
> long double std::log(long double)
>log(long double __x)
>^~~
> /usr/local/Cellar/gcc/6.2.0/include/c++/6.2.0/cmath:361:3: note: candidate: 
> float std::log(float)
>log(float __x)
>^~~
> make[3]: *** 
> 

[jira] [Commented] (MADLIB-1018) Fix K-means support for array input for data points

2017-02-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15849007#comment-15849007
 ] 

ASF GitHub Bot commented on MADLIB-1018:


Github user orhankislal closed the pull request at:

https://github.com/apache/incubator-madlib/pull/89


> Fix K-means support for array input for data points
> ---
>
> Key: MADLIB-1018
> URL: https://issues.apache.org/jira/browse/MADLIB-1018
> Project: Apache MADlib
>  Issue Type: Bug
>  Components: Module: k-Means Clustering
>Reporter: Frank McQuillan
>Priority: Minor
> Fix For: v1.10
>
>
> For k-means, normally you should be able to do array[col1, col2…] for the 2nd 
> parameter, but that does not work.  This JIRA is to be able to support 
> array[col1, col2…].
> {code}
> expr_point
> TEXT. The name of the column with point coordinates.
> {code}
> {code}
> SELECT madlib.kmeans_random('customers_train',
>'array[creditamount, accountbalance]',
>3
>  );
> {code}
> produces
> {code}
> ---
> InternalError Traceback (most recent call last)
>  in ()
> > 1 get_ipython().run_cell_magic(u'sql', u'', u"\nSELECT 
> madlib.kmeans_random('customers_train',\n   'array[creditamount, 
> accountbalance]',\n   3\n );\n")
> /Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc
>  in run_cell_magic(self, magic_name, line, cell)
>2291 magic_arg_s = self.var_expand(line, stack_depth)
>2292 with self.builtin_trap:
> -> 2293 result = fn(magic_arg_s, cell)
>2294 return result
>2295 
> /Users/fmcquillan/anaconda/lib/python2.7/site-packages/sql/magic.pyc in 
> execute(self, line, cell, local_ns)
> /Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/core/magic.pyc 
> in (f, *a, **k)
> 191 # but it's overkill for just that one bit of state.
> 192 def magic_deco(arg):
> --> 193 call = lambda f, *a, **k: f(*a, **k)
> 194 
> 195 if callable(arg):
> /Users/fmcquillan/anaconda/lib/python2.7/site-packages/sql/magic.pyc in 
> execute(self, line, cell, local_ns)
> /Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/core/magic.pyc 
> in (f, *a, **k)
> 191 # but it's overkill for just that one bit of state.
> 192 def magic_deco(arg):
> --> 193 call = lambda f, *a, **k: f(*a, **k)
> 194 
> 195 if callable(arg):
> /Users/fmcquillan/anaconda/lib/python2.7/site-packages/sql/magic.pyc in 
> execute(self, line, cell, local_ns)
>  78 return self._persist_dataframe(parsed['sql'], conn, 
> user_ns)
>  79 try:
> ---> 80 result = sql.run.run(conn, parsed['sql'], self, user_ns)
>  81 return result
>  82 except (ProgrammingError, OperationalError) as e:
> /Users/fmcquillan/anaconda/lib/python2.7/site-packages/sql/run.pyc in 
> run(conn, sql, config, user_namespace)
> 270 raise Exception("ipython_sql does not support 
> transactions")
> 271 txt = sqlalchemy.sql.text(statement)
> --> 272 result = conn.session.execute(txt, user_namespace)
> 273 try:
> 274 conn.session.execute('commit')
> /Users/fmcquillan/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.pyc
>  in execute(self, object, *multiparams, **params)
> 912 type(object))
> 913 else:
> --> 914 return meth(self, multiparams, params)
> 915 
> 916 def _execute_function(self, func, multiparams, params):
> /Users/fmcquillan/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/elements.pyc
>  in _execute_on_connection(self, connection, multiparams, params)
> 321 
> 322 def _execute_on_connection(self, connection, multiparams, params):
> --> 323 return connection._execute_clauseelement(self, multiparams, 
> params)
> 324 
> 325 def unique_params(self, *optionaldict, **kwargs):
> /Users/fmcquillan/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.pyc
>  in _execute_clauseelement(self, elem, multiparams, params)
>1008 compiled_sql,
>1009 distilled_params,
> -> 1010 compiled_sql, distilled_params
>1011 )
>1012 if self._has_events or self.engine._has_events:
> /Users/fmcquillan/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.pyc
>  in _execute_context(self, dialect, constructor, statement, parameters, *args)
>1144 parameters,
>1145 cursor,
> -> 1146 context)

[jira] [Commented] (MADLIB-927) Initial implementation of k-NN

2017-01-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847838#comment-15847838
 ] 

ASF GitHub Bot commented on MADLIB-927:
---

Github user njayaram2 commented on the issue:

https://github.com/apache/incubator-madlib/pull/81
  
Go ahead and make the commit. I had a couple of changes to make, will open 
a PR on your branch for those changes.


> Initial implementation of k-NN
> --
>
> Key: MADLIB-927
> URL: https://issues.apache.org/jira/browse/MADLIB-927
> Project: Apache MADlib
>  Issue Type: New Feature
>Reporter: Rahul Iyer
>  Labels: starter
> Fix For: v1.10
>
>
> k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors 
> of data points in a metric feature space according to a specified distance 
> function. It is considered one of the canonical algorithms of data science. 
> It is a nonparametric method, which makes it applicable to a lot of 
> real-world problems where the data doesn’t satisfy particular distribution 
> assumptions. It can also be implemented as a lazy algorithm, which means 
> there is no training phase where information in the data is condensed into 
> coefficients, but there is a costly testing phase where all data (or some 
> subset) is used to make predictions.
> This JIRA involves implementing the naïve approach - i.e. compute the k 
> nearest neighbors by going through all points.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MADLIB-927) Initial implementation of k-NN

2017-01-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847815#comment-15847815
 ] 

ASF GitHub Bot commented on MADLIB-927:
---

Github user auonhaidar commented on the issue:

https://github.com/apache/incubator-madlib/pull/81
  
Hi NJ, Orhan
I am done with adding following validation cases:

- Check if train and test table is valid
- if columns specified are present in these tables
- if k>0 or not
- if k<= number of rows in train table or not
- Are feature column of array type or not
- Are NULL values present in these feature columns or not
- Is Id column of test table integer or not
- Is label valid (float, integer, boolean) or not


I will be committing these changes tomorrow.
Please suggest if I am leaving anything.



Auon





> Initial implementation of k-NN
> --
>
> Key: MADLIB-927
> URL: https://issues.apache.org/jira/browse/MADLIB-927
> Project: Apache MADlib
>  Issue Type: New Feature
>Reporter: Rahul Iyer
>  Labels: starter
> Fix For: v1.10
>
>
> k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors 
> of data points in a metric feature space according to a specified distance 
> function. It is considered one of the canonical algorithms of data science. 
> It is a nonparametric method, which makes it applicable to a lot of 
> real-world problems where the data doesn’t satisfy particular distribution 
> assumptions. It can also be implemented as a lazy algorithm, which means 
> there is no training phase where information in the data is condensed into 
> coefficients, but there is a costly testing phase where all data (or some 
> subset) is used to make predictions.
> This JIRA involves implementing the naïve approach - i.e. compute the k 
> nearest neighbors by going through all points.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] (MADLIB-927) Initial implementation of k-NN

2017-01-30 Thread ASF GitHub Bot (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 ASF GitHub Bot commented on  MADLIB-927 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
  Re: Initial implementation of k-NN  
 
 
 
 
 
 
 
 
 
 
Github user njayaram2 commented on the issue: 
 https://github.com/apache/incubator-madlib/pull/81 
 On your feature branch, try: ``` git fetch git push --force-with-lease ``` 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (MADLIB-927) Initial implementation of k-NN

2017-01-30 Thread ASF GitHub Bot (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 ASF GitHub Bot commented on  MADLIB-927 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
  Re: Initial implementation of k-NN  
 
 
 
 
 
 
 
 
 
 
Github user njayaram2 commented on the issue: 
 https://github.com/apache/incubator-madlib/pull/81 
 I don't see any conflicts and it looks like the rebase is fine. Trying pushing it to your branch. 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] [Commented] (MADLIB-927) Initial implementation of k-NN

2017-01-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843758#comment-15843758
 ] 

ASF GitHub Bot commented on MADLIB-927:
---

Github user auonhaidar commented on the issue:

https://github.com/apache/incubator-madlib/pull/81
  
Hey NJ,
I think the rebase is not happening in the desired way. I first pulled the 
changes from apache repo to my local master.
Output:

haidar@haidar-XPS-L501X:~/MADLIB-AUON/GIT/Madlib/incubator-madlib$ git log 
--graph --decorate --oneline --all
*   c069a42 (origin/features/knn) Merge pull request #1 from 
orhankislal/features/knn
|\  
| * d9fb5c0 KNN: Documentation updates
|/  
* 9a01440 JIRA: MADLIB-927 Documentation Added
* 29969c2 License added:Assertions added
* 573edc4 changes in knn function of knn_sql.in:distance calculation 
optimized:error messages
* 22db2e1 JIRA: MADLIB-927 Changes made in KNN-help message-test cases-etc
* b1a8d10 KNN Added
| * 0e00a27 (HEAD, origin/master, origin/HEAD, master) Include 
boost::format in MathToolkit_impl.hpp.
| * f7cb980 Madpack: Add password into connection args
| * 29acc53 Documentation: Fix misc errors
| * faec6be Reverses the changes to the madlib.mode function to maintain 
backwards compatibility
| * 13203ba Update dateformat in multiple install-checks
| * 9d04b7d Minor fixes
| * 8e5da2f Association Rules: Add rule counts and limit itemset size 
feature
| * e384c1f RF: Fixes the online help and example
| * 498c559 Graph: SSSP
| * 02a7ef4 PCA: Add grouping support to PCA
| * e0439ed Madpack: Disable psqlrc when executing queries
| * c564e31 Build: Update madpack versioning to include _ and +
| * 3cf3f67 Build: Exclude AggCheckCallContext for GPDB5
| * e75a944 Elastic Net: Add CV examples, clean user docs
| * 6f12264 CV: Fix order of validation output table columns
| * e1f37bb Utilities: Fix incorrect flag for distribution
| * 02f4602 DT and RF: Adds verbose option for the dot output format.
| * c56b209 Build: Correct madlib version in gppkg spec file
| * e43b449 New module: Encode categorical variables
| * d2289b0 Fixes the kmeans_state related bug
| * 6021f67 Minor error message corrections
| * b045f7e Adds cluster variance to kmeans for PivotalR support.
| * 6939fd6 Elastic net: Add cross validation
| * 38d1e87 Fix post process for gppkg to link to hyphenated directories
|/  
* 6138b00 Elastic Net: Add grouping support
* 21bec82 Build: Ensure gppkg version does not contain hyphen
* 82e56a4 Build: Fix version used in rpm installation
* 150459d Madpack: Disable unittest flag
* 39efdb9 Build: Fix madpack revision parsing
* ac1bcfa Assoc rules: Clean + elaborate documentation



 I then checked out my features/knn branch and ran 'git rebase master' but 
it showed: 
git rebase master
First, rewinding head to replay your work on top of it...
Applying: KNN Added
Using index info to reconstruct a base tree...
M   src/config/Modules.yml
:135: space before tab in indent.
DROP TABLE IF EXISTS pg_temp.knn_label;
:136: space before tab in indent.
CREATE TABLE pg_temp.knn_label(pid integer, predlabel float);
:138: trailing whitespace.

:142: trailing whitespace.

:159: trailing whitespace.

warning: squelched 4 whitespace errors
warning: 9 lines add whitespace errors.
Falling back to patching base and 3-way merge...
Auto-merging src/config/Modules.yml
Applying: JIRA: MADLIB-927 Changes made in KNN-help message-test cases-etc
Applying: changes in knn function of knn_sql.in:distance calculation 
optimized:error messages
Applying: License added:Assertions added
Applying: JIRA: MADLIB-927 Documentation Added
Applying: KNN: Documentation updates


And after that my repo looks like:

git log --graph --decorate --oneline --all
* 9cc0b0a (HEAD, features/knn) KNN: Documentation updates
* 8be68b9 JIRA: MADLIB-927 Documentation Added
* 35d976d License added:Assertions added
* 67b466f changes in knn function of knn_sql.in:distance calculation 
optimized:error messages
* a718a1e JIRA: MADLIB-927 Changes made in KNN-help message-test cases-etc
* 6922da1 KNN Added
* 0e00a27 (origin/master, origin/HEAD, master) Include boost::format in 
MathToolkit_impl.hpp.
* f7cb980 Madpack: Add password into connection args
* 29acc53 Documentation: Fix misc errors
* faec6be Reverses the changes to the madlib.mode function to maintain 
backwards compatibility
* 13203ba Update dateformat in multiple install-checks
* 9d04b7d Minor fixes
* 8e5da2f Association Rules: Add rule counts and limit itemset size feature
* e384c1f RF: Fixes the online help and example
   

[jira] [Commented] (MADLIB-927) Initial implementation of k-NN

2017-01-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843611#comment-15843611
 ] 

ASF GitHub Bot commented on MADLIB-927:
---

Github user auonhaidar commented on the issue:

https://github.com/apache/incubator-madlib/pull/81
  
Cool. I will have a look and start with the implementations.
Thanks NJ!


> Initial implementation of k-NN
> --
>
> Key: MADLIB-927
> URL: https://issues.apache.org/jira/browse/MADLIB-927
> Project: Apache MADlib
>  Issue Type: New Feature
>Reporter: Rahul Iyer
>  Labels: gsoc2016, starter
>
> k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors 
> of data points in a metric feature space according to a specified distance 
> function. It is considered one of the canonical algorithms of data science. 
> It is a nonparametric method, which makes it applicable to a lot of 
> real-world problems where the data doesn’t satisfy particular distribution 
> assumptions. It can also be implemented as a lazy algorithm, which means 
> there is no training phase where information in the data is condensed into 
> coefficients, but there is a costly testing phase where all data (or some 
> subset) is used to make predictions.
> This JIRA involves implementing the naïve approach - i.e. compute the k 
> nearest neighbors by going through all points.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MADLIB-927) Initial implementation of k-NN

2017-01-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843371#comment-15843371
 ] 

ASF GitHub Bot commented on MADLIB-927:
---

Github user auonhaidar commented on the issue:

https://github.com/apache/incubator-madlib/pull/81
  
I think you have already covered a lot of validation cases @njayaram2 . I 
will work on that and If I get stuck somewhere I will let  you know. Meanwhile, 
could you please point me to the python files that have examples of such 
functions you were talking about? That will save me a lot of time.
Thanks!


> Initial implementation of k-NN
> --
>
> Key: MADLIB-927
> URL: https://issues.apache.org/jira/browse/MADLIB-927
> Project: Apache MADlib
>  Issue Type: New Feature
>Reporter: Rahul Iyer
>  Labels: gsoc2016, starter
>
> k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors 
> of data points in a metric feature space according to a specified distance 
> function. It is considered one of the canonical algorithms of data science. 
> It is a nonparametric method, which makes it applicable to a lot of 
> real-world problems where the data doesn’t satisfy particular distribution 
> assumptions. It can also be implemented as a lazy algorithm, which means 
> there is no training phase where information in the data is condensed into 
> coefficients, but there is a costly testing phase where all data (or some 
> subset) is used to make predictions.
> This JIRA involves implementing the naïve approach - i.e. compute the k 
> nearest neighbors by going through all points.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MADLIB-927) Initial implementation of k-NN

2017-01-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840874#comment-15840874
 ] 

ASF GitHub Bot commented on MADLIB-927:
---

Github user auonhaidar commented on the issue:

https://github.com/apache/incubator-madlib/pull/81
  
Sure NJ. But I will be free from my work after 5 tomorrow. Would that work 
for you?


> Initial implementation of k-NN
> --
>
> Key: MADLIB-927
> URL: https://issues.apache.org/jira/browse/MADLIB-927
> Project: Apache MADlib
>  Issue Type: New Feature
>Reporter: Rahul Iyer
>  Labels: gsoc2016, starter
>
> k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors 
> of data points in a metric feature space according to a specified distance 
> function. It is considered one of the canonical algorithms of data science. 
> It is a nonparametric method, which makes it applicable to a lot of 
> real-world problems where the data doesn’t satisfy particular distribution 
> assumptions. It can also be implemented as a lazy algorithm, which means 
> there is no training phase where information in the data is condensed into 
> coefficients, but there is a costly testing phase where all data (or some 
> subset) is used to make predictions.
> This JIRA involves implementing the naïve approach - i.e. compute the k 
> nearest neighbors by going through all points.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >