[GitHub] madlib issue #343: Linear Regression: Support for JSON and special character...

2019-01-10 Thread jingyimei
Github user jingyimei commented on the issue:

https://github.com/apache/madlib/pull/343
  
@hpandeycodeit Can you please rebase this branch from master so that we can 
do full test? Thanks!


---


[GitHub] madlib issue #343: Linear Regression: Support for JSON and special character...

2019-01-09 Thread jingyimei
Github user jingyimei commented on the issue:

https://github.com/apache/madlib/pull/343
  
Jenkins Ok to test


---


[GitHub] madlib issue #343: Linear Regression: Support for JSON and special character...

2018-12-27 Thread jingyimei
Github user jingyimei commented on the issue:

https://github.com/apache/madlib/pull/343
  
@hpandeycodeit Seems couple of dev check test cases failed in Jenkins 
build. Can you please look into them and fix? Thanks!


---


[GitHub] madlib pull request #337: Madpack: Add UDO and UDOC automation

2018-11-12 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/337#discussion_r232721653
  
--- Diff: src/madpack/diff_udo.sql ---
@@ -0,0 +1,81 @@

+--
+-- Licensed to the Apache Software Foundation (ASF) under one
+-- or more contributor license agreements.  See the NOTICE file
+-- distributed with this work for additional information
+-- regarding copyright ownership.  The ASF licenses this file
+-- to you under the Apache License, Version 2.0 (the
+-- "License"); you may not use this file except in compliance
+-- with the License.  You may obtain a copy of the License at
+
+--   http://www.apache.org/licenses/LICENSE-2.0
+
+-- Unless required by applicable law or agreed to in writing,
+-- software distributed under the License is distributed on an
+-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+-- KIND, either express or implied.  See the License for the
+-- specific language governing permissions and limitations
+-- under the License.

+--
+
+SET client_min_messages to ERROR;
+\x on
+
+CREATE OR REPLACE FUNCTION filter_schema(argstr text, schema_name text)
+RETURNS text AS $$
+if argstr is None:
+return "NULL"
+return argstr.replace(schema_name + ".", '')
+$$ LANGUAGE plpythonu;
+
+CREATE OR REPLACE FUNCTION alter_schema(argstr text, schema_name text)
+RETURNS text AS $$
+if argstr is None:
+return "NULL"
+return argstr.replace(schema_name + ".", 'schema_madlib.')
+$$ LANGUAGE plpythonu;
+
+
+CREATE OR REPLACE FUNCTION get_udos(table_name text, schema_name text,
+ type_filter text)
+RETURNS VOID AS
+$$
+import plpy
+
+plpy.execute("""
+create table {table_name} AS
+SELECT *
+FROM (
+SELECT n.nspname AS "Schema",
+   o.oprname AS name,
+   filter_schema(o.oprcode::text, '{schema_name}') AS 
oprcode,
+   alter_schema(pg_catalog.format_type(o.oprleft, 
NULL), '{schema_name}') AS oprleft,
+   alter_schema(pg_catalog.format_type(o.oprright, 
NULL), '{schema_name}') AS oprright,
+   alter_schema(pg_catalog.format_type(o.oprresult, 
NULL), '{schema_name}') AS rettype
+FROM pg_catalog.pg_operator o
+LEFT JOIN pg_catalog.pg_namespace n ON n.oid = 
o.oprnamespace
+WHERE n.nspname OPERATOR(pg_catalog.~) '^({schema_name})$'
--- End diff --

Got it. Thanks for the explanation!


---


[GitHub] madlib pull request #337: Madpack: Add UDO and UDOC automation

2018-11-08 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/337#discussion_r231972206
  
--- Diff: src/madpack/diff_udo.sql ---
@@ -0,0 +1,81 @@

+--
+-- Licensed to the Apache Software Foundation (ASF) under one
+-- or more contributor license agreements.  See the NOTICE file
+-- distributed with this work for additional information
+-- regarding copyright ownership.  The ASF licenses this file
+-- to you under the Apache License, Version 2.0 (the
+-- "License"); you may not use this file except in compliance
+-- with the License.  You may obtain a copy of the License at
+
+--   http://www.apache.org/licenses/LICENSE-2.0
+
+-- Unless required by applicable law or agreed to in writing,
+-- software distributed under the License is distributed on an
+-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+-- KIND, either express or implied.  See the License for the
+-- specific language governing permissions and limitations
+-- under the License.

+--
+
+SET client_min_messages to ERROR;
+\x on
+
+CREATE OR REPLACE FUNCTION filter_schema(argstr text, schema_name text)
+RETURNS text AS $$
+if argstr is None:
+return "NULL"
+return argstr.replace(schema_name + ".", '')
+$$ LANGUAGE plpythonu;
+
+CREATE OR REPLACE FUNCTION alter_schema(argstr text, schema_name text)
+RETURNS text AS $$
+if argstr is None:
+return "NULL"
+return argstr.replace(schema_name + ".", 'schema_madlib.')
+$$ LANGUAGE plpythonu;
+
+
+CREATE OR REPLACE FUNCTION get_udos(table_name text, schema_name text,
+ type_filter text)
+RETURNS VOID AS
+$$
+import plpy
+
+plpy.execute("""
+create table {table_name} AS
+SELECT *
+FROM (
+SELECT n.nspname AS "Schema",
+   o.oprname AS name,
+   filter_schema(o.oprcode::text, '{schema_name}') AS 
oprcode,
+   alter_schema(pg_catalog.format_type(o.oprleft, 
NULL), '{schema_name}') AS oprleft,
+   alter_schema(pg_catalog.format_type(o.oprright, 
NULL), '{schema_name}') AS oprright,
+   alter_schema(pg_catalog.format_type(o.oprresult, 
NULL), '{schema_name}') AS rettype
+FROM pg_catalog.pg_operator o
+LEFT JOIN pg_catalog.pg_namespace n ON n.oid = 
o.oprnamespace
+WHERE n.nspname OPERATOR(pg_catalog.~) '^({schema_name})$'
--- End diff --

can you please explain this where clause a bit? I am a bit confused about 
the 'OPERATOR' part


---


[GitHub] madlib pull request #337: Madpack: Add UDO and UDOC automation

2018-11-08 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/337#discussion_r231953942
  
--- Diff: src/madpack/create_changelist.py ---
@@ -237,6 +325,13 @@
 print "Something went wrong! The changelist might be wrong/corrupted."
 raise
 finally:
-os.system("rm -f /tmp/madlib_tmp_nm.txt /tmp/madlib_tmp_udf.txt "
-  "/tmp/madlib_tmp_udt.txt /tmp/madlib_tmp_cl.yaml "
--- End diff --

what is this madlib_tmp_cl.yaml file here? why we don't remove it now


---


[GitHub] madlib pull request #332: Update Dockerfile to use ubuntu 16.04

2018-10-25 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/332#discussion_r228394441
  
--- Diff: tool/docker/base/Dockerfile_ubuntu16_postgres10 ---
@@ -0,0 +1,108 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+FROM ubuntu:16.04
+
+### Get necessary libraries to add postgresql apt repository
+RUN apt-get update && apt-get install -y lsb-core 
software-properties-common wget
+
+### Add postgresql apt repository
+RUN add-apt-repository "deb http://apt.postgresql.org/pub/repos/apt/ 
$(lsb_release -sc)-pgdg main" && wget --quiet -O - 
https://www.postgresql.org/media/keys/ACCC4CF8.asc | apt-key add - 
+
+### Have to update after getting new repository
+RUN apt-get update
+
+### Get postgres10 and postgres specific add-ons
+RUN apt-get install -y postgresql-10 \
+   postgresql-client-10 \
+   postgresql-plpython-10 \ 
+   postgresql-server-dev-10 \
+   libpq-dev \
+   build-essential \
+   openssl \
+   libssl-dev \
+   libboost-all-dev \
+   m4 \
+   vim \
+   pgxnclient \
+   flex \
+   bison \
+   graphviz
+
+### Reset pg_hba.conf file to allow no password prompt
+### Docker file doesn't support heardoc, like cat > 
/etc/postgresql/10/main/pg_hba.conf<<-EOF,
+### and this echo and \n\ are workaround to write the file
+RUN echo " \n\
+# Database administrative login by Unix domain socket \n\
+local   all all trust 
\n\
+
+# TYPE  DATABASEUSERADDRESS METHOD 
\n\
+
+# "local" is for Unix domain socket connections only \n\
+local   all all trust 
\n\
+# IPv4 local connections: \n\
+hostall all 127.0.0.1/32trust 
\n\
+# IPv6 local connections: \n\
+hostall all ::1/128 trust 
\n\
+" > /etc/postgresql/10/main/pg_hba.conf
+
+### We need to set nproc to unlimited to be able to run scripts as
+### the user 'postgres'. This is actually useful when we try to setup
+### and start a Postgres server.
+RUN echo " * soft nproc unlimited " > 
/etc/security/limits.d/postgres-limits.conf
+
+
+### Always start postgres server when login
+RUN echo "service postgresql start" >> ~/.bashrc
+
+### Build custom CMake with SSQL support
+RUN wget https://cmake.org/files/v3.6/cmake-3.6.1.tar.gz && \
+  tar -zxvf cmake-3.6.1.tar.gz && \
+  cd cmake-3.6.1 && \
+  sed -i 's/-DCMAKE_BOOTSTRAP=1/-DCMAKE_BOOTSTRAP=1 
-DCMAKE_USE_OPENSSL=ON/g' bootstrap && \
+  ./configure &&  \
+  make -j2 && \
+  make install && \
+  cd ..
+
+### Install doxygen-1.8.13:
+RUN wget http://ftp.stack.nl/pub/users/dimitri/doxygen-1.8.13.src.tar.gz 
&& \
--- End diff --

`make doc` is optional when people want to install madlib, and that's why I 
removed the line `make doc` before. But I think it is OK to bake doxygen in the 
image and people can run `make doc` when needed.


---


[GitHub] madlib pull request #332: Update Dockerfile to use ubuntu 16.04

2018-10-25 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/332#discussion_r228394202
  
--- Diff: tool/docker/base/Dockerfile_ubuntu16_postgres10 ---
@@ -0,0 +1,108 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+FROM ubuntu:16.04
+
+### Get necessary libraries to add postgresql apt repository
+RUN apt-get update && apt-get install -y lsb-core 
software-properties-common wget
+
+### Add postgresql apt repository
+RUN add-apt-repository "deb http://apt.postgresql.org/pub/repos/apt/ 
$(lsb_release -sc)-pgdg main" && wget --quiet -O - 
https://www.postgresql.org/media/keys/ACCC4CF8.asc | apt-key add - 
+
+### Have to update after getting new repository
+RUN apt-get update
+
+### Get postgres10 and postgres specific add-ons
+RUN apt-get install -y postgresql-10 \
+   postgresql-client-10 \
+   postgresql-plpython-10 \ 
+   postgresql-server-dev-10 \
+   libpq-dev \
+   build-essential \
+   openssl \
+   libssl-dev \
+   libboost-all-dev \
+   m4 \
+   vim \
+   pgxnclient \
+   flex \
+   bison \
+   graphviz
+
+### Reset pg_hba.conf file to allow no password prompt
+### Docker file doesn't support heardoc, like cat > 
/etc/postgresql/10/main/pg_hba.conf<<-EOF,
+### and this echo and \n\ are workaround to write the file
+RUN echo " \n\
+# Database administrative login by Unix domain socket \n\
+local   all all trust 
\n\
+
+# TYPE  DATABASEUSERADDRESS METHOD 
\n\
+
+# "local" is for Unix domain socket connections only \n\
+local   all all trust 
\n\
+# IPv4 local connections: \n\
+hostall all 127.0.0.1/32trust 
\n\
+# IPv6 local connections: \n\
+hostall all ::1/128 trust 
\n\
+" > /etc/postgresql/10/main/pg_hba.conf
+
+### We need to set nproc to unlimited to be able to run scripts as
+### the user 'postgres'. This is actually useful when we try to setup
+### and start a Postgres server.
+RUN echo " * soft nproc unlimited " > 
/etc/security/limits.d/postgres-limits.conf
+
+
+### Always start postgres server when login
+RUN echo "service postgresql start" >> ~/.bashrc
+
+### Build custom CMake with SSQL support
+RUN wget https://cmake.org/files/v3.6/cmake-3.6.1.tar.gz && \
+  tar -zxvf cmake-3.6.1.tar.gz && \
+  cd cmake-3.6.1 && \
+  sed -i 's/-DCMAKE_BOOTSTRAP=1/-DCMAKE_BOOTSTRAP=1 
-DCMAKE_USE_OPENSSL=ON/g' bootstrap && \
+  ./configure &&  \
+  make -j2 && \
+  make install && \
+  cd ..
+
+### Install doxygen-1.8.13:
+RUN wget http://ftp.stack.nl/pub/users/dimitri/doxygen-1.8.13.src.tar.gz 
&& \
+  tar xf doxygen-1.8.13.src.tar.gz && \
+  cd doxygen-1.8.13 && \
+  mkdir build && \
+  cd build && \
+  cmake -G "Unix Makefiles" .. && \
+  make && \
+  make install
+
+### Optional: install LaTex
+### uncomment the following 'RUN apt-get' line to bake LaTex into the image
+### Note: if you run the following line, please tag the image as
+### madlib/postgres_10:LaTex, and don't tag it as latest
+# RUN apt-get install -y texlive-full
--- End diff --

This is for making design doc (pdf). LaTex is pretty big and thus we make 
different images so people can pull it when needed.


---


[GitHub] madlib pull request #332: Update Dockerfile to use ubuntu 16.04

2018-10-19 Thread jingyimei
GitHub user jingyimei opened a pull request:

https://github.com/apache/madlib/pull/332

Update Dockerfile to use ubuntu 16.04

This commit adds a new dockerfile to bake postgres 10.5 on ubuntu
16.04. Also updates docker_start.sh and README to pull the new docker image 
instead
of the old one (Postgres9.6 on Ubuntu 8.9).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib update_dockerfile

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/332.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #332


commit e745949b270bb633d9d02ca328363a1929478ea0
Author: Jingyi Mei 
Date:   2018-10-19T14:43:07Z

Update dockerfile to use ubuntu 16.04

This commit adds a new dockerfile to bake postgres 10.5 on ubuntu
16.04. Also updates docker_start.sh and README to pull the new docker image 
instead
of the old one (Postgres9.6 on Ubuntu 8.9).




---


[GitHub] madlib pull request #326: Install/Dev check: Add new test cases for some mod...

2018-10-01 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/326#discussion_r221714858
  
--- Diff: src/ports/postgres/modules/pmml/test/pmml.ic.sql_in ---
@@ -0,0 +1,119 @@
+/* --- 
*//**
+ *
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ *
+ *//* 
--- */
+DROP TABLE IF EXISTS abalone CASCADE;
+
+CREATE TABLE abalone (
+id integer,
+sex text,
+length double precision,
+diameter double precision,
+height double precision,
+whole double precision,
+shucked double precision,
+viscera double precision,
+shell double precision,
+rings integer
+);
+
+INSERT INTO abalone VALUES
+(3151, 'F', 0.655027, 0.505004, 
0.165008, 1.36699, 0.583519, 
0.351479, 0.396019, 10),
+(2026, 'F', 0.550044, 0.469973, 
0.149994, 0.920485, 0.381005, 
0.243494, 0.267516, 10),
+(3751, 'I', 0.434998, 0.375, 0.110001, 
0.41548, 0.170012, 0.0759981, 
0.14499, 8),
+(720, 'I', 0.149994, 0.16, 
0.0250014, 0.0149994, 0.00449966, 
0.0048, 0.0050001, 2),
+(1635, 'F', 0.574956, 0.469973, 
0.154999, 1.1161, 0.509008, 
0.237989, 0.340024, 10),
+(2648, 'I', 0.5, 0.390013, 0.125, 0.582963, 
0.293984, 0.132006, 0.160504, 8),
+(1796, 'F', 0.57996, 0.429993, 
0.170012, 1.47998, 0.65347, 
0.32401, 0.41548, 10),
+(209, 'F', 0.525022, 0.41498, 
0.170012, 0.832518, 0.275523, 
0.168511, 0.309998, 13),
+(1451, 'I', 0.455016, 0.33502, 
0.135009, 0.501001, 0.274021, 
0.0995051, 0.106497, 7),
+(1108, 'I', 0.510009, 0.380004, 
0.115005, 0.515458, 0.214997, 
0.113504, 0.166009, 8),
+(3675, 'F', 0.594973, 0.450011, 
0.165008, 1.08096, 0.489991, 
0.252502, 0.279026, 12),
+(2108, 'F', 0.675044, 0.550044, 
0.179993, 1.688499989, 0.562055, 
0.370496, 0.599978, 15),
+(3312, 'F', 0.479982, 0.380004, 
0.135009, 0.507006, 0.191504, 
0.13651, 0.154999, 12),
+(882, 'M', 0.655027, 0.520018, 
0.165008, 1.40948, 0.585965, 
0.290981, 0.405027, 9),
+(3402, 'M', 0.479982, 0.395018, 
0.149994, 0.681495, 0.214496, 
0.140514, 0.2495, 18),
+(829, 'I', 0.409976, 0.325011, 
0.16, 0.394017, 0.20799, 
0.0655027, 0.105997, 6),
+(1305, 'M', 0.535031, 0.434998, 
0.149994, 0.716971, 0.347476, 
0.14449, 0.194006, 9),
+(3613, 'M', 0.599978, 0.46002, 
0.179993, 1.1399, 0.422987, 
0.257507, 0.364991, 10),
+(1068, 'I', 0.340024, 0.265013, 
0.0800017, 0.201512, 0.0899967, 
0

[GitHub] madlib pull request #326: Install/Dev check: Add new test cases for some mod...

2018-10-01 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/326#discussion_r221712133
  
--- Diff: methods/stemmer/src/pg_gp/test/porter_stemmer.sql_in ---
@@ -0,0 +1,38 @@
+/* --- 
*//**
+ *
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ *
+ *//* 
--- */
+
+CREATE TABLE token_tbl ( id integer,
--- End diff --

Noticed the install check and dev check test cases are exactly the same as 
each other. Is it better to differentiate them? 


---


[GitHub] madlib issue #325: Madpack/ic func schema

2018-09-28 Thread jingyimei
Github user jingyimei commented on the issue:

https://github.com/apache/madlib/pull/325
  
@iyerr3 

> I'm getting confused here, so maybe an example would help.
> 
> My understanding of @orhankislal 's comment was that say `page_rank` and 
`bfs` both use the same table called "vertex" then we will have to drop this 
table at the beginning of each of the test files. This may then collide with a 
user-created "vertex" table - hence this commit.

> My suggestions to solve the problem:
> 
> * Use a function-specific name (`"pagerank_vertex"`) instead of a generic 
name with (`"vertex"`) no drop by in the beginning.

This seems doesn't solve the potential conflict with user created table. 
Besides, in some IC/DC file we drop tables in between and create table with the 
same name again. In this case the second drop/creation will fail if user 
already created it.
> * Create a single `"vertex"` table (with no drop by statement) and use 
that for all graph functions

This may cause some assertion based on returning number of records fail as 
I mentioned in the previous comment, and also needs effort of audition and 
combining test data.
> * Schema-qualify the data table with the install-check schema name.
> 
(not sure if I totally understand this) install-check schema is based on 
module, not algo, for example pagerank and hits and sssp use the same IC 
schema, and there will be issue between running two sql files for 
droping/creating same table.

> Please note I'm not suggesting pooling **all** install-check data into a 
single file (at least not as a solution to this problem).

I understood your point. What I was trying to say was multiple algos in the 
same module share the same data creation file.




---


[GitHub] madlib issue #325: Madpack/ic func schema

2018-09-28 Thread jingyimei
Github user jingyimei commented on the issue:

https://github.com/apache/madlib/pull/325
  
@iyerr3 The install-check schema is by module, and we may have multiple 
files for one module. 

For your proposed solution, what do you expect in the single source of 
table creation? Do you mean audit all the table creation and try to combine all 
the data into one big table and every sql just use the same table? For example, 
for pagerank and hits in `graph` module,  we only create one vertex and one 
edge table with all the special cases/data, and always use those two tables for 
all the sql files? Besides, how do we deal with some special test case when we 
only need 1 row/ few row in the table instead of a bigger table? Also in some 
DEV check we assert a specific sql returns a specific number of rows, and those 
assertion will fail because of the change of input table.

The other concern is that this solution requires lots of audition and 
changes in every sql file. Developers need to understand how to add new row in 
the single table creation file and how sql files under the same module work 
together with each other. For example, if someone wants to add a test case in 
HITS under graph, instead of adding a new case in hits sql file, he needs to go 
to another file and add table, and go back to hits sql file and add test case. 
It sounds a bit confusing and inconvenient to me.


---


[GitHub] madlib issue #325: Madpack/ic func schema

2018-09-27 Thread jingyimei
Github user jingyimei commented on the issue:

https://github.com/apache/madlib/pull/325
  
> @orhankislal Since this is a really big change, could you please add more 
info in the commit on why this change is necessary and the best available 
solution?

+ 1


---


[GitHub] madlib pull request #323: Build: Add single quote while setting AppendOnly g...

2018-09-24 Thread jingyimei
GitHub user jingyimei opened a pull request:

https://github.com/apache/madlib/pull/323

Build: Add single quote while setting AppendOnly guc

Commit 3db98babe3326fb5e2cd16d0639a2bef264f4b04 added a context manager
for setting appendonly to false for all madlib modules. The commit was
missing a quote around the `gp_default_storage_options` guc because of
which the set command always failed. This commit adds a quote while
setting the guc.

Co-authored-by: Nikhil Kak


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib test_ao_table

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/323.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #323


commit ebf528f413fc94d9fdeb40045011cb1b374bf12b
Author: Jingyi Mei 
Date:   2018-09-24T18:23:55Z

Build: Add single quote while setting AppendOnly guc

Commit #3db98babe3326fb5e2cd16d0639a2bef264f4b04 added a context manager
for setting appendonly to false for all madlib modules. The commit was
missing a quote around the `gp_default_storage_options` guc because of
which the set command always failed. This commit adds a quote while
setting the guc.

Co-authored-by: Nikhil Kak

commit b4aabdd1ef0deec4ec89e7acb8dd0c51692a97c9
Author: Jingyi Mei 
Date:   2018-09-24T18:48:51Z

Build: Remove primary key constraint in IC/DC

Some of the install-check/dev-check tests were setting primary key
constraints. If the user sets the GUC
gp_default_storage_options='appendonly=true', IC/DC will fail while
table creation because appendonly tables don't support primary key. This
commit removes these constrains since they are unnecessary for those
test cases.

Co-authored-by: Nikhil Kak 




---


[GitHub] madlib pull request #301: DT/RF: Don't eliminate single-level categorical va...

2018-07-27 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/301#discussion_r205911065
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/test/decision_tree.sql_in ---
@@ -273,7 +274,7 @@ SELECT tree_train('dt_golf'::text, -- source 
table
  'train_output'::text,-- output model table
  'id'::text,  -- id column
  'temperature::double precision'::text,   
-- response
- 'humidity, windy, "Cont_features"'::text,   -- 
features
+ 'humidity, windy2, "Cont_features"'::text,   -- 
features
--- End diff --

we can leave a comment here explaining why we use windy2 with all valid 
value 'false'


---


[GitHub] madlib pull request #301: DT/RF: Don't eliminate single-level categorical va...

2018-07-27 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/301#discussion_r205910952
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/test/decision_tree.sql_in ---
@@ -282,13 +283,16 @@ SELECT tree_train('dt_golf'::text, -- source 
table
  6::integer,-- min split
  2::integer,-- min bucket
  3::integer,-- number of bins per 
continuous variable
- 'cp=0.01, n_folds=2'  -- cost-complexity 
pruning parameter
+ 'cp=0.01, n_folds=2',  -- cost-complexity 
pruning parameter
+ 'null_as_category=True'
  );
 
 SELECT _print_decision_tree(tree) from train_output;
 SELECT tree_display('train_output', False);
 SELECT impurity_var_importance FROM train_output;
+SELECT cat_levels_in_text, cat_n_levels, impurity_var_importance FROM 
train_output;
--- End diff --

Is it better to assert cat_n_levels == {1} instead of just quering?


---


[GitHub] madlib pull request #298: misc 1.15 user doc updates

2018-07-25 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/298#discussion_r205301880
  
--- Diff: doc/mainpage.dox.in ---
@@ -100,13 +86,14 @@ complete matrix stored as a distributed table.
 @defgroup grp_matrix Matrix Operations
 
 @defgroup grp_matrix_factorization Matrix Factorization
-@brief Matrix Factorization methods including Singular Value 
Decomposition and Low-rank Matrix Factorization
+Linear algebra methods that factorize a matrix
--- End diff --

Is it possible to keep two copies here, with one has @brief and the other 
one without brief so that they show up in both places in webpage?


---


[GitHub] madlib issue #298: misc 1.15 user doc updates

2018-07-25 Thread jingyimei
Github user jingyimei commented on the issue:

https://github.com/apache/madlib/pull/298
  
Jenkins OK to test


---


[GitHub] madlib pull request #:

2018-07-25 Thread jingyimei
Github user jingyimei commented on the pull request:


https://github.com/apache/madlib/commit/1fe308c70d2c91fef508d29d81ed0e93da429eb6#commitcomment-29832374
  
In src/madpack/madpack.py:
In src/madpack/madpack.py on line 973:
Make sense. 


---


[GitHub] madlib pull request #:

2018-07-24 Thread jingyimei
Github user jingyimei commented on the pull request:


https://github.com/apache/madlib/commit/1fe308c70d2c91fef508d29d81ed0e93da429eb6#commitcomment-29823247
  
In src/madpack/madpack.py:
In src/madpack/madpack.py on line 973:
Instead of versions do not mathch, it would be clearer if we mention schema 
doesn't exisits or madlib are not installed. 


---


[GitHub] madlib issue #294: Pagerank: Remove duplicate entries from grouping output

2018-07-16 Thread jingyimei
Github user jingyimei commented on the issue:

https://github.com/apache/madlib/pull/294
  
The current master has another issue:
When you run
```
DROP TABLE IF EXISTS vertex, "EDGE";
CREATE TABLE vertex(
id INTEGER
);
CREATE TABLE "EDGE"(
src INTEGER,
dest INTEGER,
user_id INTEGER
);
INSERT INTO vertex VALUES
(0),
(1),
(2);
INSERT INTO "EDGE" VALUES
(0, 1, 1),
(0, 2, 1),
(1, 2, 1),
(2, 1, 1),
(0, 1, 2);


DROP TABLE IF EXISTS pagerank_ppr_grp_out;
DROP TABLE IF EXISTS pagerank_ppr_grp_out_summary;
SELECT pagerank(
'vertex', -- Vertex table
'id', -- Vertix id column
'"EDGE"', -- "EDGE" table
'src=src, dest=dest', -- "EDGE" args
'pagerank_ppr_grp_out', -- Output table of PageRank
NULL, -- Default damping factor (0.85)
NULL, -- Default max iters (100)
NULL, -- Default Threshold 
'user_id');
```

you will get the following result:
```
madlib=# select * from pagerank_ppr_grp_out order by user_id, id; user_id | 
id | pagerank
-++---
1 | 0 | 0.05
1 | 0 | 0.05
1 | 1 | 0.614906399170753
1 | 2 | 0.614906399170753
2 | 0 | 0.075
2 | 1 | 0.13875
(6 rows)
```

where for user_id=1 the pagerank scores don't sum up to 1 where they should 
have to. This PR actually fix this issue and gives the right number. However 
the dev check didn't have a case to catch this issue before. Suggest to add 
this corner case in dev check to test future changes.


---


[GitHub] madlib pull request #289: RF: Add impurity variable importance

2018-07-10 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/289#discussion_r201489373
  
--- Diff: src/modules/recursive_partitioning/DT_impl.hpp ---
@@ -1512,6 +1512,9 @@ DecisionTree::computeVariableImportance(
 uint64_t maj_count = getMajorityCount(node_index);
 uint64_t min_count =  node_count - maj_count;
 
+if (min_count == 0) {
--- End diff --

We can leave some comment here to address it later


---


[GitHub] madlib pull request #289: RF: Add impurity variable importance

2018-07-10 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/289#discussion_r201501719
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/random_forest.py_in ---
@@ -1291,38 +1300,64 @@ def _create_group_table(
 schema_madlib, output_table_name, oob_error_table,
 importance_table, cat_features_info_table, grp_key_to_grp_cols,
 grouping_cols, tree_terminated):
-""" Ceate the group table for random forest"""
+""" Create the group table for random forest"""
+
+cat_var_importance_str = ''
+con_var_importance_str = ''
+impurity_var_importance_str = ''
+left_join_importance_table_str = ''
+join_impurity_table_str = ''
+
+if importance_table:
+impurity_var_importance_table_name = unique_string(desp='impurity')
+plpy.execute("""
+CREATE TEMP TABLE {impurity_var_importance_table_name} AS
+SELECT
+gid,
+{schema_madlib}.array_avg(impurity_var_importance, False) 
AS impurity_var_importance
+FROM {output_table_name}
+GROUP BY gid
+""".format(**locals()))
+
+cat_var_importance_str = ", cat_var_importance AS 
oob_cat_var_importance,"
+con_var_importance_str = "con_var_importance AS 
oob_con_var_importance,"
+impurity_var_importance_str = "impurity_var_importance"
+left_join_importance_table_str = """LEFT OUTER JOIN 
{importance_table}
+USING (gid)""".format(importance_table=importance_table)
+join_impurity_table_str = """JOIN 
{impurity_var_importance_table_name} USING 
(gid)""".format(impurity_var_importance_table_name=impurity_var_importance_table_name)
+
 grouping_cols_str = ('' if grouping_cols is None
  else grouping_cols + ",")
 group_table_name = add_postfix(output_table_name, "_group")
+
 sql_create_group_table = """
 CREATE TABLE {group_table_name} AS
 SELECT
 gid,
 {grouping_cols_str}
-grp_finished as success,
+grp_finished AS success,
 cat_n_levels,
 cat_levels_in_text,
-oob_error,
-cat_var_importance,
-con_var_importance
+oob_error
+{cat_var_importance_str}
+{con_var_importance_str}
+{impurity_var_importance_str}
 FROM
 {oob_error_table}
 JOIN
 {grp_key_to_grp_cols}
 USING (gid)
 JOIN (
 SELECT
-unnest($1) as grp_key,
-unnest($2) as grp_finished
+unnest($1) AS grp_key,
+unnest($2) AS grp_finished
 ) tree_terminated
 USING (grp_key)
 JOIN
 {cat_features_info_table}
 USING (gid)
-LEFT OUTER JOIN
-{importance_table}
-USING (gid)
+{left_join_importance_table_str}
--- End diff --

We can name it oob_importance_table_* to explicitly distinguish it from 
impurity importance table


---


[GitHub] madlib pull request #289: RF: Add impurity variable importance

2018-07-10 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/289#discussion_r201462356
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/random_forest.py_in ---
@@ -616,23 +628,20 @@ def forest_train(
 _calculate_oob_error(schema_madlib, oob_prediction_table,
  oob_error_table, id_col_name,
  is_classification)
-
-importance_table = unique_string()
-sql_create_empty_imp_tbl = """
-CREATE TEMP TABLE {importance_table}
-(
-gid integer,
-cat_var_importance float8[],
-con_var_importance float8[]
-);
-""".format(**locals())
-
plpy.notice("sql_create_empty_imp_tbl:\n"+sql_create_empty_imp_tbl)
-plpy.execute(sql_create_empty_imp_tbl)
-
-# we populate the importance_table only if variable 
importance is to be
-# calculated, otherwise we use an empty table which will 
be used later
-# for an outer join.
+importance_table = ''
 if importance:
+importance_table = unique_string()
+sql_create_empty_imp_tbl = """
+CREATE TEMP TABLE {importance_table}
+(
+gid integer,
+cat_var_importance float8[],
+con_var_importance float8[]
+);
+""".format(**locals())
+
plpy.notice("sql_create_empty_imp_tbl:\n"+sql_create_empty_imp_tbl)
--- End diff --

Do we need this notice?


---


[GitHub] madlib pull request #285: Madpack: fix install/reinstall not giving proper e...

2018-07-01 Thread jingyimei
GitHub user jingyimei opened a pull request:

https://github.com/apache/madlib/pull/285

Madpack: fix install/reinstall not giving proper error message

Previously, uninstalling or reinstalling on a database that does not have 
MADlib already
installed fails as expected. However, the info messages do not mention
this failure and also keeps going to prepare database objects to
install. This commit fixes this issue. If there is no madlib installed
in database, madpack will show nothing found to uninstall/reinstall and
stop.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib bug_reinstall

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/285.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #285


commit 677e76d202f620b71159367f1a1fd3aaab5513fc
Author: Jingyi Mei 
Date:   2018-07-02T00:49:24Z

Madpack: fix install/reinstall not giving proper error message

Previously, uninstalling or reinstalling on a database that does not have 
MADlib already
installed fails as expected. However, the info messages do not mention
this failure and also keeps going to prepare database objects to
install. This commit fixes this issue. If there is no madlib installed
in database, madpack will show nothing found to uninstall/reinstall and
stop.




---


[GitHub] madlib pull request #273: Minibatch Preprocessing: fix dependent var with sp...

2018-06-13 Thread jingyimei
Github user jingyimei closed the pull request at:

https://github.com/apache/madlib/pull/273


---


[GitHub] madlib issue #273: Minibatch Preprocessing: fix dependent var with special c...

2018-06-13 Thread jingyimei
Github user jingyimei commented on the issue:

https://github.com/apache/madlib/pull/273
  
Will handle those in another PR, closing this one.


---


[GitHub] madlib issue #274: Handling special characters in MLP and Encode Categorical...

2018-06-13 Thread jingyimei
Github user jingyimei commented on the issue:

https://github.com/apache/madlib/pull/274
  
Will handle those in another PR, closing this one.


---


[GitHub] madlib pull request #274: Handling special characters in MLP and Encode Cate...

2018-06-13 Thread jingyimei
Github user jingyimei closed the pull request at:

https://github.com/apache/madlib/pull/274


---


[GitHub] madlib pull request #274: Handling special characters in MLP and Encode Cate...

2018-06-06 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/274#discussion_r193525716
  
--- Diff: src/ports/postgres/modules/utilities/utilities.py_in ---
@@ -296,11 +305,24 @@ def py_list_to_sql_string(array, array_type=None, 
long_format=None):
 if not array:
 return "'{{ }}'::{0}".format(array_type)
 else:
-array_str = "ARRAY[ {0} ]" if long_format else "'{{ {0} }}'"
-return (array_str + "::{1}").format(','.join(map(str, array)), 
array_type)
+# For non-character array types:
+if long_format:
+array_str = "ARRAY[ {0} ]";
+return (array_str + "::{1}").format(
+','.join(map(str, array)), array_type)
+else:
+# For character array types, we have to deal with special 
characters in
+# elements an array, i.e, {'ele''one', "ele,two", "ele$three"} :
+# We firstly use ",,," to join elements in python list and then 
call
+# postgres string_to_array with a delimiter ",,," to form the final
+# psql array, because this sequence of characters will be very
+# unlikely to show up in a real world use case and some special
+# case (such as "M,M") will be handled.
+array_str = "string_to_array($${0}$$, ',,,')"
+return (array_str + "::{1}").format(
+',,,'.join(map(str, array)), array_type)
--- End diff --

Agree


---


[GitHub] madlib pull request #274: Handling special characters in MLP and Encode Cate...

2018-06-06 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/274#discussion_r193503140
  
--- Diff: src/ports/postgres/modules/utilities/utilities.py_in ---
@@ -296,11 +305,24 @@ def py_list_to_sql_string(array, array_type=None, 
long_format=None):
 if not array:
 return "'{{ }}'::{0}".format(array_type)
 else:
-array_str = "ARRAY[ {0} ]" if long_format else "'{{ {0} }}'"
-return (array_str + "::{1}").format(','.join(map(str, array)), 
array_type)
+# For non-character array types:
+if long_format:
+array_str = "ARRAY[ {0} ]";
+return (array_str + "::{1}").format(
+','.join(map(str, array)), array_type)
+else:
+# For character array types, we have to deal with special 
characters in
+# elements an array, i.e, {'ele''one', "ele,two", "ele$three"} :
+# We firstly use ",,," to join elements in python list and then 
call
+# postgres string_to_array with a delimiter ",,," to form the final
+# psql array, because this sequence of characters will be very
+# unlikely to show up in a real world use case and some special
+# case (such as "M,M") will be handled.
+array_str = "string_to_array($${0}$$, ',,,')"
+return (array_str + "::{1}").format(
+',,,'.join(map(str, array)), array_type)
--- End diff --

This sounds like a more obscure, but there is always some corner cases we 
can't cover given this approach.


---


[GitHub] madlib pull request #273: Minibatch Preprocessing: fix dependent var with sp...

2018-05-25 Thread jingyimei
GitHub user jingyimei opened a pull request:

https://github.com/apache/madlib/pull/273

Minibatch Preprocessing: fix dependent var with special character

JIRA:MADLIB-1237

Previously, minibatch processing would error out when the specified
dependent variable has special characters within its values. We
fixed this in two places:
1. in the query with WHERE condition, we use $$ to queto string instead
   of ' ' to do string equals.
2. in the query with creating an array column, instead of using
   SELECT '{ele'with*special_char, 'M,M', 'M$M'}'::text[], we call
   SELECT string_to_array(''ele'with*special_char', 'M"M', 'M$M'', 
',')::text[]

Install check test cases also get updated.
Co-Authored-by: Jingyi Mei <j...@pivotal.io>

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib bug_minibatch_preprocessor

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/273.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #273


commit 3a55cafd7fad1e87996b4584f7e0431529da7a52
Author: Jingyi Mei <jmei@...>
Date:   2018-05-23T23:29:54Z

Minibatch Preprocessing: fix dependent var with special character

JIRA:MADLIB-1237

Previously, minibatch processing would error out when the specified
dependent variable has special characters within its values. We
fixed this in two places:
1. in the query with WHERE condition, we use $$ to queto string instead
   of ' ' to do string equals.
2. in the query with creating an array column, instead of using
   SELECT '{ele'with*special_char, 'M,M', 'M$M'}'::text[], we call
   SELECT string_to_array(''ele'with*special_char', 'M"M', 'M$M'', 
',')::text[]

Install check test cases also get updated.
Co-Authored-by: Jingyi Mei <j...@pivotal.io>




---


[GitHub] madlib issue #265: Release: Add v1.14 release notes

2018-04-19 Thread jingyimei
Github user jingyimei commented on the issue:

https://github.com/apache/madlib/pull/265
  
LGTM


---


[GitHub] madlib pull request #265: Release: Add v1.14 release notes

2018-04-19 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/265#discussion_r182843994
  
--- Diff: RELEASE_NOTES ---
@@ -9,6 +9,56 @@ commit history located at 
https://github.com/apache/madlib/commits/master.
 
 Current list of bugs and issues can be found at 
https://issues.apache.org/jira/browse/MADLIB.
 
—-
+MADlib v1.14:
+
+Release Date: 2018-April-28
+
+New features:
+* New module - Balanced datasets: A sampling module to balance 
classification
+datasets by resampling using various techniques including 
undersampling,
+oversampling, uniform sampling or user-defined proportion sampling
+(MADLIB-1168)
+* Mini-batch: Added a mini-batch optimizer for MLP and a preprocessor 
function
+necessary to create batches from the data (MADLIB-1200, MADLIB-1206)
+* k-NN: Added weighted averaging/voting by distance (MADLIB-1181)
+* Summary: Added additional stats: number of positive, negative, zero 
values and
+95% confidence intervals for the mean (MADLIB-1167)
+* Encode categorical: Updated to produce lower-case column names when 
possible
+(MADLIB-1202)
+* MLP: Added support for already one-hot encoded categorical dependent 
variable
+in a classification task (MADLIB-1222)
+* Pagerank: Added option for personalized vertices that allows higher 
weightage
+for a subset of vertices which will have a higher jump probability as
+compared to other vertices and a random surfer is more likely to
+jump to these personalization vertices (MADLIB-1084)
+
+Bug fixes:
+- Fixed issue with invalid calls of construct_array that led to 
problems
+in Postgresql 10 (MADLIB-1185)
+- Added newline between file concatenation during PGXN install 
(MADLIB-1194)
+- Fixed upgrade issues in knn (MADLIB-1197)
+- Added fix to ensure RF variable importance are always non-negative
+- Fixed inconsistency in LDA output and improved usability
+(MADLIB-1160, MADLIB-1201)
+- Fixed MLP and RF predict for models trained in earlier versions to
+ensure misisng optional parameters are given appropriate default 
values
+(MADLIB-1207)
+- Fixed a scenario in DT where no features exist due categorical 
columns
+with single level being dropped led to the database crashing
+- Fixed step size initialization in MLP based on learning rate policy
+(MADLIB-1212)
+- Fixed PCA issue that leads to failure when grouping column is a TEXT 
type
+(MADLIB-1215)
+- Fixed cat levels output in DT when grouping is enabled (MADLIB-1218)
+- Fixed and simplified initialization of model coefficients in MLP
+- Removed source table dependency for predicting regression models in 
MLP
+(MADLIB-1223)
+- Print loss of first iteration in MLP (MADLIB-1228)
+
--- End diff --

We should mention MADLIB-1209 Neural net related bug fix.


---


[GitHub] madlib pull request #256: Minibatch Preprocessing: change default buffer siz...

2018-04-05 Thread jingyimei
GitHub user jingyimei opened a pull request:

https://github.com/apache/madlib/pull/256

Minibatch Preprocessing: change default buffer size formula for grouping

This commit changes the previous calculation formula for default buffer
size. Previously, we used num_rows_processed/num_of_segments to indicate
data distribution in each segment. To adjust this to a grouping
scenario, we use avg_num_rows_processed/num_of_segment to indicate data
distribution when there are more than one groups of data. Other code changes
are due to this change.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib 
feature/minibatch-preprocessing-default_buffer_size

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/256.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #256


commit b33b00504ef46c7084b60f9c54d8f6797660542b
Author: Jingyi Mei <jmei@...>
Date:   2018-04-04T00:50:57Z

Minibatch Preprocessing: change default buffer size formula to fit grouping

This commit changes the previous calculation formula for default buffer
size. Previously, we used num_rows_processed/num_of_segments to indicate
data distribution in each segment. To adjust this to a grouping
scenario, we use avg_num_rows_processed/num_of_segment to indicate data
distribution when there are more than one groups of data. Other code changes
are due to this change.




---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177916814
  
--- Diff: src/ports/postgres/modules/graph/test/pagerank.sql_in ---
@@ -95,6 +101,49 @@ SELECT assert(relative_error(SUM(pagerank), 1) < 
0.1,
 ) FROM pagerank_gr_out WHERE user_id=2;
 
 
+-- Tests for Personalized Page Rank
+
+-- Test without grouping 
+
+DROP TABLE IF EXISTS pagerank_ppr_out;
+DROP TABLE IF EXISTS pagerank_ppr_out_summary;
+SELECT pagerank(
+ 'vertex',-- Vertex table
+ 'id',-- Vertix id column
+ '"EDGE"',  -- "EDGE" table
+ 'src=src, dest=dest', -- "EDGE" args
+ 'pagerank_ppr_out', -- Output table of PageRank
+ NULL,  -- Default damping factor (0.85)
+ NULL,  -- Default max iters (100)
+ NULL,  -- Default Threshold 
+ NULL, -- Grouping column
+'{1,3}'); -- Personlized Nodes
+
+
+-- View the PageRank of all vertices, sorted by their scores.
+SELECT assert(relative_error(SUM(pagerank), 1) < 0.00124,
--- End diff --

Is this  0.00124 based on current test result? Can we make it smaller?


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177899442
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -211,19 +261,30 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 distinct_grp_table, grouping_cols_list)
 # Find number of vertices in each group, this is the 
normalizing factor
 # for computing the random_prob
+where_clause_ppr = ''
+if nodes_of_interest > 0:
+where_clause_ppr = """where __vertices__ = 
ANY(ARRAY{nodes_of_interest})""".format(
+**locals())
+random_prob_grp = 1.0 - damping_factor
+init_prob_grp = 1.0 / len(nodes_of_interest)
+else:
+random_prob_grp  = 
"""{rand_damp}/COUNT(__vertices__)::DOUBLE PRECISION
+ """.format(**locals())
+init_prob_grp  =  """1/COUNT(__vertices__)::DOUBLE 
PRECISION""".format(
+**locals())
+
 plpy.execute("DROP TABLE IF EXISTS 
{0}".format(vertices_per_group))
 plpy.execute("""CREATE TEMP TABLE {vertices_per_group} AS
 SELECT {distinct_grp_table}.*,
-1/COUNT(__vertices__)::DOUBLE PRECISION AS {init_pr},
-{rand_damp}/COUNT(__vertices__)::DOUBLE PRECISION
-AS {random_prob}
+{init_prob_grp} AS {init_pr},
+{random_prob_grp} as {random_prob}
 FROM {distinct_grp_table} INNER JOIN (
 SELECT {grouping_cols}, {src} AS __vertices__
 FROM {edge_temp_table}
 UNION
 SELECT {grouping_cols}, {dest} FROM 
{edge_temp_table}
 ){subq}
-ON {grouping_where_clause}
+ON {grouping_where_clause} {where_clause_ppr}
--- End diff --

put {where_clause_ppr} in a new line


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177912288
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -527,14 +615,55 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 """.format(**locals()))
 
 # Step 4: Cleanup
-plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2},{3},{4},{5},{6}
+plpy.execute("""DROP TABLE IF EXISTS 
{0},{1},{2},{3},{4},{5},{6},{7}
 """.format(out_cnts, edge_temp_table, cur, message, cur_unconv,
-   message_unconv, nodes_with_no_incoming_edges))
+   message_unconv, nodes_with_no_incoming_edges, 
personalized_nodes))
--- End diff --

This "personalized_nodes" table doesn't get created before


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177897977
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -211,19 +261,30 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 distinct_grp_table, grouping_cols_list)
 # Find number of vertices in each group, this is the 
normalizing factor
 # for computing the random_prob
+where_clause_ppr = ''
+if nodes_of_interest > 0:
+where_clause_ppr = """where __vertices__ = 
ANY(ARRAY{nodes_of_interest})""".format(
+**locals())
+random_prob_grp = 1.0 - damping_factor
+init_prob_grp = 1.0 / len(nodes_of_interest)
--- End diff --

len(nodes_of_interest) == total_ppr_nodes ? so that we don't need to run 
O(n) again


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177910146
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -211,19 +261,30 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 distinct_grp_table, grouping_cols_list)
 # Find number of vertices in each group, this is the 
normalizing factor
 # for computing the random_prob
+where_clause_ppr = ''
+if nodes_of_interest > 0:
+where_clause_ppr = """where __vertices__ = 
ANY(ARRAY{nodes_of_interest})""".format(
--- End diff --

After consulting with QP, `__vertices__ = ANY(ARRAY{nodes_of_interest})` 
works exactly the same as `__vertices__ in (nodes_of_interest)`, this may look 
simpler.  

Besides, since we use this condition in multiple places, I am wondering if 
a join clause is faster - we create a temp table that saves special node ids 
and we join this temp table with vertex table by vertex id - QP suggested to 
try both and see which one runs faster.


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177851780
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -44,29 +44,62 @@ from utilities.utilities import add_postfix
 from utilities.utilities import extract_keyvalue_params
 from utilities.utilities import unique_string, split_quoted_delimited_str
 from utilities.utilities import is_platform_pg
+from utilities.utilities import py_list_to_sql_string
 
 from utilities.validate_args import columns_exist_in_table, 
get_cols_and_types
 from utilities.validate_args import table_exists
 
+
 def validate_pagerank_args(schema_madlib, vertex_table, vertex_id, 
edge_table,
edge_params, out_table, damping_factor, 
max_iter,
-   threshold, grouping_cols_list):
+   threshold, grouping_cols_list, 
nodes_of_interest):
 """
 Function to validate input parameters for PageRank
 """
 validate_graph_coding(vertex_table, vertex_id, edge_table, edge_params,
   out_table, 'PageRank')
-## Validate args such as threshold and max_iter
+# Validate args such as threshold and max_iter
 validate_params_for_link_analysis(schema_madlib, "PageRank",
-threshold, max_iter,
-edge_table, grouping_cols_list)
+  threshold, max_iter,
+  edge_table, grouping_cols_list)
 _assert(damping_factor >= 0.0 and damping_factor <= 1.0,
 "PageRank: Invalid damping factor value ({0}), must be between 
0 and 1.".
 format(damping_factor))
 
-
-def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args,
- out_table, damping_factor, max_iter, threshold, 
grouping_cols, **kwargs):
+# Validate against the givin set of nodes for Personalized Page Rank
+if nodes_of_interest:
+nodes_of_interest_count = len(nodes_of_interest)
+vertices_count = plpy.execute("""
+   SELECT count(DISTINCT({vertex_id})) AS cnt FROM 
{vertex_table}
+   WHERE {vertex_id} = ANY(ARRAY{nodes_of_interest})
+   """.format(**locals()))[0]["cnt"]
+# Check to see if the given set of nodes exist in vertex table
+if vertices_count != len(nodes_of_interest):
+plpy.error("PageRank: Invalid value for {0}, must be a subset 
of the vertex_table".format(
--- End diff --

This query tests several invalid scenarios, including duplicate nodes in 
nodes_of_interest, in the error msg maybe we can say "Invalid value for {0}, 
must be a subset of the vertex_table without duplicate nodes".


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177894976
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -211,19 +261,30 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 distinct_grp_table, grouping_cols_list)
 # Find number of vertices in each group, this is the 
normalizing factor
 # for computing the random_prob
+where_clause_ppr = ''
+if nodes_of_interest > 0:
--- End diff --

`if nodes_of_interest:`  or `if total_ppr_nodes > 0:` 


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177915601
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -647,6 +778,26 @@ SELECT * FROM pagerank_out ORDER BY user_id, pagerank 
DESC;
 -- View the summary table to find the number of iterations required for
 -- convergence for each group.
 SELECT * FROM pagerank_out_summary;
+
+-- Compute the Personalized PageRank:
+DROP TABLE IF EXISTS pagerank_out, pagerank_out_summary;
+SELECT madlib.pagerank(
+   'vertex', -- Vertex table
+   'id', -- Vertix id column
+   'edge',   -- Edge table
+   'src=src, dest=dest', -- Comma delimted string of 
edge arguments
+   'pagerank_out',   -- Output table of PageRank
+NULL,-- Default damping factor 
(0.85)
+NULL,-- Default max iters (100)
+NULL,-- Default Threshold
+NULL,-- No Grouping
--- End diff --

move those NULLs one space left


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177914251
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -149,25 +186,37 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 out_cnts = unique_string(desp='out_cnts')
 out_cnts_cnt = unique_string(desp='cnt')
 v1 = unique_string(desp='v1')
+personalized_nodes = unique_string(desp='personalized_nodes')
 
 if is_platform_pg():
 cur_distribution = cnts_distribution = ''
 else:
-cur_distribution = cnts_distribution = \
-"DISTRIBUTED BY ({0}{1})".format(
-grouping_cols_comma, vertex_id)
+cur_distribution = cnts_distribution = "DISTRIBUTED BY 
({0}{1})".format(
+grouping_cols_comma, vertex_id)
 cur_join_clause = """{edge_temp_table}.{dest} = {cur}.{vertex_id}
 """.format(**locals())
 out_cnts_join_clause = """{out_cnts}.{vertex_id} =
 {edge_temp_table}.{src} """.format(**locals())
 v1_join_clause = """{v1}.{vertex_id} = {edge_temp_table}.{src}
 """.format(**locals())
 
+# Get query params for Personalized Page Rank.
+ppr_params = get_query_params_for_ppr(nodes_of_interest, 
damping_factor,
--- End diff --

Is it better to check `if nodes_of_interest` before calling 
get_query_params_for_ppr instead of checking it in get_query_params_for_ppr?


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177914961
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -551,14 +680,16 @@ def pagerank_help(schema_madlib, message, **kwargs):
 message.lower() in ("usage", "help", "?"):
 help_string = "Get from method below"
 help_string = get_graph_usage(schema_madlib, 'PageRank',
-"""out_table TEXT, -- Name of the output table for PageRank
+  """out_table TEXT, -- Name of 
the output table for PageRank
 damping_factor DOUBLE PRECISION, -- Damping factor in random surfer 
model
  -- (DEFAULT = 0.85)
 max_iter  INTEGER, -- Maximum iteration number (DEFAULT = 100)
 threshold DOUBLE PRECISION, -- Stopping criteria (DEFAULT = 
1/(N*1000),
 -- N is number of vertices in the 
graph)
-grouping_col  TEXT -- Comma separated column names to group on
+grouping_col  TEXT, -- Comma separated column names to group on
-- (DEFAULT = NULL, no grouping)
+nodes_of_interest ARRAY OF INTEGER -- A comma seperated list of 
vertices
+  or nodes for personalized page 
rank.
 """) + """
 
--- End diff --

indent left side, and indent comment(--) right


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177892625
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -44,29 +44,62 @@ from utilities.utilities import add_postfix
 from utilities.utilities import extract_keyvalue_params
 from utilities.utilities import unique_string, split_quoted_delimited_str
 from utilities.utilities import is_platform_pg
+from utilities.utilities import py_list_to_sql_string
 
 from utilities.validate_args import columns_exist_in_table, 
get_cols_and_types
 from utilities.validate_args import table_exists
 
+
 def validate_pagerank_args(schema_madlib, vertex_table, vertex_id, 
edge_table,
edge_params, out_table, damping_factor, 
max_iter,
-   threshold, grouping_cols_list):
+   threshold, grouping_cols_list, 
nodes_of_interest):
 """
 Function to validate input parameters for PageRank
 """
 validate_graph_coding(vertex_table, vertex_id, edge_table, edge_params,
   out_table, 'PageRank')
-## Validate args such as threshold and max_iter
+# Validate args such as threshold and max_iter
 validate_params_for_link_analysis(schema_madlib, "PageRank",
-threshold, max_iter,
-edge_table, grouping_cols_list)
+  threshold, max_iter,
+  edge_table, grouping_cols_list)
 _assert(damping_factor >= 0.0 and damping_factor <= 1.0,
 "PageRank: Invalid damping factor value ({0}), must be between 
0 and 1.".
 format(damping_factor))
 
-
-def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args,
- out_table, damping_factor, max_iter, threshold, 
grouping_cols, **kwargs):
+# Validate against the givin set of nodes for Personalized Page Rank
+if nodes_of_interest:
+nodes_of_interest_count = len(nodes_of_interest)
+vertices_count = plpy.execute("""
+   SELECT count(DISTINCT({vertex_id})) AS cnt FROM 
{vertex_table}
+   WHERE {vertex_id} = ANY(ARRAY{nodes_of_interest})
+   """.format(**locals()))[0]["cnt"]
+# Check to see if the given set of nodes exist in vertex table
+if vertices_count != len(nodes_of_interest):
+plpy.error("PageRank: Invalid value for {0}, must be a subset 
of the vertex_table".format(
+nodes_of_interest))
+# Validate given set of nodes against each user group.
+# If all the given nodes are not present in the user group
+# then throw an error.
+if grouping_cols_list:
+missing_user_grps = ''
+grp_by_column = get_table_qualified_col_str(
+edge_table, grouping_cols_list)
+grps_without_nodes = plpy.execute("""
+   SELECT {grp_by_column} FROM {edge_table}
+   WHERE src = ANY(ARRAY{nodes_of_interest}) group by 
{grp_by_column}
+   having count(DISTINCT(src)) != {nodes_of_interest_count}
+   """.format(**locals()))
+for row in range(grps_without_nodes.nrows()):
+missing_user_grps += 
str(grps_without_nodes[row]['user_id'])
+if row < grps_without_nodes.nrows() - 1:
+missing_user_grps += ' ,'
+if grps_without_nodes.nrows() > 0:
+plpy.error("Nodes for Personalizaed Page Rank are missing 
from these groups: {0} ".format(
+missing_user_grps))
+
--- End diff --

Here some similar things are test twice - when `if nodes_of_interest`, 
there is a `count` operation in line 73 and in line 77 there is one test(this 
is for without grouping). Then when `if grouping_cols_list`, another `count` 
and `compare` happen in line 90 per group. There might be a way to simplify the 
logic here so that for grouping, we don't need to do it twice.  Besides, if the 
above query really slow down performance a lot, I would think about doing it 
simpler by not giving a list of groups missing special nodes but just a 
warning(optional, depending on how expensive the above query is).


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177916983
  
--- Diff: src/ports/postgres/modules/graph/test/pagerank.sql_in ---
@@ -95,6 +101,49 @@ SELECT assert(relative_error(SUM(pagerank), 1) < 
0.1,
 ) FROM pagerank_gr_out WHERE user_id=2;
 
 
+-- Tests for Personalized Page Rank
+
+-- Test without grouping 
+
+DROP TABLE IF EXISTS pagerank_ppr_out;
+DROP TABLE IF EXISTS pagerank_ppr_out_summary;
+SELECT pagerank(
+ 'vertex',-- Vertex table
+ 'id',-- Vertix id column
+ '"EDGE"',  -- "EDGE" table
+ 'src=src, dest=dest', -- "EDGE" args
+ 'pagerank_ppr_out', -- Output table of PageRank
+ NULL,  -- Default damping factor (0.85)
+ NULL,  -- Default max iters (100)
+ NULL,  -- Default Threshold 
+ NULL, -- Grouping column
+'{1,3}'); -- Personlized Nodes
+
+
+-- View the PageRank of all vertices, sorted by their scores.
+SELECT assert(relative_error(SUM(pagerank), 1) < 0.00124,
+'PageRank: Scores do not sum up to 1.'
+) FROM pagerank_ppr_out;
+
+
+-- Test with grouping 
+
+DROP TABLE IF EXISTS pagerank_ppr_grp_out;
+DROP TABLE IF EXISTS pagerank_ppr_grp_out_summary;
+SELECT pagerank(
+ 'vertex',-- Vertex table
+ 'id',-- Vertix id column
+ '"EDGE"',  -- "EDGE" table
+ 'src=src, dest=dest', -- "EDGE" args
+ 'pagerank_ppr_grp_out', -- Output table of PageRank
+ NULL,  -- Default damping factor (0.85)
+ NULL,  -- Default max iters (100)
+ NULL,  -- Default Threshold 
+ 'user_id', -- Grouping column
+'{1,3}'); -- Personlized Nodes
+
+SELECT assert(count(*) = 14, 'Tuple count for Pagerank out table != 14') 
FROM pagerank_ppr_grp_out;
--- End diff --

can we do similar assertion here by group?


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177917620
  
--- Diff: src/ports/postgres/modules/graph/pagerank.sql_in ---
@@ -273,6 +278,48 @@ SELECT * FROM pagerank_out_summary ORDER BY user_id;
 (2 rows)
 
 
+-# Example of Personalized Page Rank with Nodes {2,4}
+
+DROP TABLE IF EXISTS pagerank_out, pagerank_out_summary;
+SELECT madlib.pagerank(
+   'vertex', -- Vertex table
+   'id', -- Vertix id column
+   'edge',   -- Edge table
+   'src=src, dest=dest', -- Comma delimted string of 
edge arguments
+   'pagerank_out',   -- Output table of PageRank 
+NULL,-- Default damping factor 
(0.85)
+NULL,-- Default max iters (100)
+NULL,-- Default Threshold 
+NULL,-- No Grouping 
+   '{2,4}'); -- Personlized Nodes
--- End diff --

Great


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177915929
  
--- Diff: src/ports/postgres/modules/graph/test/pagerank.sql_in ---
@@ -66,7 +66,12 @@ SELECT pagerank(
  'id',-- Vertix id column
  '"EDGE"',  -- "EDGE" table
  'src=src, dest=dest', -- "EDGE" args
- 'pagerank_out'); -- Output table of PageRank
+ 'pagerank_out',-- Output table of PageRank
+  NULL, -- Default damping factor (0.85)
+  NULL, -- Default max iters (100)
+  NULL, -- Default Threshold 
+  NULL, -- No Grouping 
+ NULL); -- Personlized Nodes
--- End diff --

In this case, we can remove the last 5 NULLs since they are all optional.


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177893734
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -122,12 +158,13 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 grouping_where_clause = ''
 group_by_clause = ''
 random_prob = ''
+ppr_join_clause = ''
 
 edge_temp_table = unique_string(desp='temp_edge')
 grouping_cols_comma = grouping_cols + ',' if grouping_cols else ''
 distribution = ('' if is_platform_pg() else
 "DISTRIBUTED BY ({0}{1})".format(
-grouping_cols_comma, dest))
+grouping_cols_comma, dest))
--- End diff --

maybe indent with the above line, or move the above line backwards to the 
current place


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177917195
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -149,25 +164,39 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 out_cnts = unique_string(desp='out_cnts')
 out_cnts_cnt = unique_string(desp='cnt')
 v1 = unique_string(desp='v1')
+personalized_nodes = unique_string(desp='personalized_nodes')
 
 if is_platform_pg():
 cur_distribution = cnts_distribution = ''
 else:
-cur_distribution = cnts_distribution = \
-"DISTRIBUTED BY ({0}{1})".format(
-grouping_cols_comma, vertex_id)
+cur_distribution = cnts_distribution = "DISTRIBUTED BY 
({0}{1})".format(
+grouping_cols_comma, vertex_id)
 cur_join_clause = """{edge_temp_table}.{dest} = {cur}.{vertex_id}
 """.format(**locals())
 out_cnts_join_clause = """{out_cnts}.{vertex_id} =
 {edge_temp_table}.{src} """.format(**locals())
 v1_join_clause = """{v1}.{vertex_id} = {edge_temp_table}.{src}
 """.format(**locals())
 
+# Get query params for Personalized Page Rank.
+ppr_params = get_query_params_for_ppr(nodes_of_interest, 
damping_factor,
+  ppr_join_clause, vertex_id,
+  edge_temp_table, 
vertex_table, cur_distribution,
+  personalized_nodes)
+total_ppr_nodes = ppr_params[0]
+random_jump_prob_ppr = ppr_params[1]
+ppr_join_clause = ppr_params[2]
+
 random_probability = (1.0 - damping_factor) / n_vertices
+if total_ppr_nodes > 0:
+random_jump_prob = random_jump_prob_ppr
+else:
+random_jump_prob = random_probability
--- End diff --

Got it.


---


[GitHub] madlib pull request #241: MiniBatch Pre-Processor: Add new module minibatch_...

2018-03-20 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/241#discussion_r175957289
  
--- Diff: 
src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in ---
@@ -0,0 +1,559 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+"""
+@file minibatch_preprocessing.py_in
+
+"""
+from math import ceil
+import plpy
+
+from utilities import add_postfix
+from utilities import _assert
+from utilities import get_seg_number
+from utilities import is_platform_pg
+from utilities import is_psql_numeric_type
+from utilities import is_string_formatted_as_array_expression
+from utilities import py_list_to_sql_string
+from utilities import split_quoted_delimited_str
+from utilities import _string_to_array
+from utilities import validate_module_input_params
+from mean_std_dev_calculator import MeanStdDevCalculator
+from validate_args import get_expr_type
+from validate_args import output_tbl_valid
+from validate_args import _tbl_dimension_rownum
+
+m4_changequote(`')
+
+# These are readonly variables, do not modify
+MINIBATCH_OUTPUT_DEPENDENT_COLNAME = "dependent_varname"
+MINIBATCH_OUTPUT_INDEPENDENT_COLNAME = "independent_varname"
+
+class MiniBatchPreProcessor:
+"""
+This class is responsible for executing the main logic of mini batch
+preprocessing, which packs multiple rows of selected columns from the
+source table into one row based on the buffer size
+"""
+def __init__(self, schema_madlib, source_table, output_table,
+  dependent_varname, independent_varname, buffer_size, 
**kwargs):
+self.schema_madlib = schema_madlib
+self.source_table = source_table
+self.output_table = output_table
+self.dependent_varname = dependent_varname
+self.independent_varname = independent_varname
+self.buffer_size = buffer_size
+
+self.module_name = "minibatch_preprocessor"
+self.output_standardization_table = add_postfix(self.output_table,
+   "_standardization")
+self.output_summary_table = add_postfix(self.output_table, 
"_summary")
+self._validate_minibatch_preprocessor_params()
+
+def minibatch_preprocessor(self):
+# Get array expressions for both dep and indep variables from the
+# MiniBatchQueryFormatter class
+dependent_var_dbtype = get_expr_type(self.dependent_varname,
+ self.source_table)
+qry_formatter = MiniBatchQueryFormatter(self.source_table)
+dep_var_array_str, dep_var_classes_str = qry_formatter.\
+get_dep_var_array_and_classes(self.dependent_varname,
+  dependent_var_dbtype)
+indep_var_array_str = qry_formatter.get_indep_var_array_str(
+  self.independent_varname)
+
+standardizer = MiniBatchStandardizer(self.schema_madlib,
+ self.source_table,
+ dep_var_array_str,
+ indep_var_array_str,
+ 
self.output_standardization_table)
+standardize_query = standardizer.get_query_for_standardizing()
+
+num_rows_processed, num_missing_rows_skipped = self.\
+
_get_skipped_rows_processed_count(
+dep_var_array_str,
+indep_var_array_str)
+calculated_buffer_size = MiniBatchBufferSizeCalculator.\
+

[GitHub] madlib issue #240: MLP: Fix step size initialization based on learning rate ...

2018-03-20 Thread jingyimei
Github user jingyimei commented on the issue:

https://github.com/apache/madlib/pull/240
  
We manually tested the bug fix by printing the step_size and step_size_init 
for each iteration and then called mlp with all the possible policies.
step_size was updated as expected.  +1 for the changes


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-20 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r175663431
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -149,25 +164,39 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 out_cnts = unique_string(desp='out_cnts')
 out_cnts_cnt = unique_string(desp='cnt')
 v1 = unique_string(desp='v1')
+personalized_nodes = unique_string(desp='personalized_nodes')
 
 if is_platform_pg():
 cur_distribution = cnts_distribution = ''
 else:
-cur_distribution = cnts_distribution = \
-"DISTRIBUTED BY ({0}{1})".format(
-grouping_cols_comma, vertex_id)
+cur_distribution = cnts_distribution = "DISTRIBUTED BY 
({0}{1})".format(
+grouping_cols_comma, vertex_id)
 cur_join_clause = """{edge_temp_table}.{dest} = {cur}.{vertex_id}
 """.format(**locals())
 out_cnts_join_clause = """{out_cnts}.{vertex_id} =
 {edge_temp_table}.{src} """.format(**locals())
 v1_join_clause = """{v1}.{vertex_id} = {edge_temp_table}.{src}
 """.format(**locals())
 
+# Get query params for Personalized Page Rank.
+ppr_params = get_query_params_for_ppr(nodes_of_interest, 
damping_factor,
+  ppr_join_clause, vertex_id,
+  edge_temp_table, 
vertex_table, cur_distribution,
+  personalized_nodes)
+total_ppr_nodes = ppr_params[0]
+random_jump_prob_ppr = ppr_params[1]
+ppr_join_clause = ppr_params[2]
+
 random_probability = (1.0 - damping_factor) / n_vertices
+if total_ppr_nodes > 0:
+random_jump_prob = random_jump_prob_ppr
+else:
+random_jump_prob = random_probability
--- End diff --

Can move (1.0 - damping_factor) / n_vertices here since random_probability 
is not used anywhere else.


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-20 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r175664342
  
--- Diff: src/ports/postgres/modules/graph/test/pagerank.sql_in ---
@@ -84,7 +89,8 @@ SELECT pagerank(
  NULL,
  NULL,
  NULL,
- 'user_id');
+ 'user_id',
+ NULL);
 
 -- View the PageRank of all vertices, sorted by their scores.
 SELECT assert(relative_error(SUM(pagerank), 1) < 0.1,
--- End diff --

We may need at least two test cases with nodes_of_interest not null and 
with/without grouping.


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-20 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r175627510
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -527,14 +562,63 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 """.format(**locals()))
 
 # Step 4: Cleanup
-plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2},{3},{4},{5},{6}
+plpy.execute("""DROP TABLE IF EXISTS 
{0},{1},{2},{3},{4},{5},{6},{7}
 """.format(out_cnts, edge_temp_table, cur, message, cur_unconv,
-   message_unconv, nodes_with_no_incoming_edges))
+   message_unconv, nodes_with_no_incoming_edges, 
personalized_nodes))
 if grouping_cols:
 plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2}
 """.format(vertices_per_group, temp_summary_table,
distinct_grp_table))
 
+
+def get_query_params_for_ppr(nodes_of_interest, damping_factor,
+ ppr_join_clause, vertex_id, edge_temp_table, 
vertex_table,
+ cur_distribution, personalized_nodes):
+"""
+ This function will prepare the Join Clause and the condition to 
Calculate the Personalized Page Rank
+ and Returns Total number of user provided nodes of interest, A join 
Clause and a clause to be added
+ to existing formula to calculate pagerank.
+
+ Args:
+ @param nodes_of_interest
+ @param damping_factor
+ @param ppr_join_clause
+ @param vertex_id
+ @param edge_temp_table
+ @param vertex_table
+ @param cur_distribution
+
+ Returns :
+ (Integer, String, String)
+
+"""
+total_ppr_nodes = 0
+random_jump_prob_ppr = ''
+
+if nodes_of_interest:
+total_ppr_nodes = len(nodes_of_interest)
+init_value_ppr_nodes = 1.0 / total_ppr_nodes
+# Create a Temp table that holds the Inital probabilities for the
+# user provided nodes
+plpy.execute("""
+CREATE TEMP TABLE {personalized_nodes} AS
+SELECT {vertex_id}, {init_value_ppr_nodes}::DOUBLE PRECISION 
as pagerank
+FROM {vertex_table} where {vertex_id} =  
ANY(ARRAY{nodes_of_interest})
+{cur_distribution}
+""".format(**locals()))
+ppr_join_clause = """ LEFT  JOIN {personalized_nodes} on
+{personalized_nodes}.{vertex_id} = 
{edge_temp_table}.dest""".format(**locals())
+prob_value = 1.0 - damping_factor
+
+# In case of PPR, Assign the Random jump probability to the 
nodes_of_interest only.
+# For rest of the nodes, Random jump probability  will be zero.
+random_jump_prob_ppr = """ CASE when {edge_temp_table}.dest = 
ANY(ARRAY{nodes_of_interest})
+THEN {prob_value}
+ELSE 0
+END """.format(**locals())
+return(total_ppr_nodes, random_jump_prob_ppr, ppr_join_clause)
+
+
 def pagerank_help(schema_madlib, message, **kwargs):
--- End diff --

We need the new parameter explanation in helper function too


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-20 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r175665727
  
--- Diff: src/ports/postgres/modules/graph/pagerank.sql_in ---
@@ -120,6 +121,10 @@ distribution per group. When this value is NULL, no 
grouping is used and
 a single model is generated for all data.
 @note Expressions are not currently supported for 'grouping_cols'.
 
+ nodes_of_interest (optional) 
--- End diff --

@fmcquillan99 Do we also need additional explanation of personalized 
pagerank somewhere in the user doc? 


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-20 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r175631615
  
--- Diff: src/ports/postgres/modules/graph/pagerank.sql_in ---
@@ -273,6 +278,48 @@ SELECT * FROM pagerank_out_summary ORDER BY user_id;
 (2 rows)
 
 
+-# Example of Personalized Page Rank with Nodes {2,4}
+
+DROP TABLE IF EXISTS pagerank_out, pagerank_out_summary;
+SELECT madlib.pagerank(
+   'vertex', -- Vertex table
+   'id', -- Vertix id column
+   'edge',   -- Edge table
+   'src=src, dest=dest', -- Comma delimted string of 
edge arguments
+   'pagerank_out',   -- Output table of PageRank 
+NULL,-- Default damping factor 
(0.85)
+NULL,-- Default max iters (100)
+NULL,-- Default Threshold 
+NULL,-- No Grouping 
+   '{2,4}'); -- Personlized Nodes
--- End diff --

Another valid input for personalized nodes array is 'ARRAY[2,4]', should be 
mentioned in some example or user doc later.


---


[GitHub] madlib pull request #245: Reduce Install Check run time

2018-03-19 Thread jingyimei
GitHub user jingyimei opened a pull request:

https://github.com/apache/madlib/pull/245

Reduce Install Check run time

To reduce the total run time of install check, we looked at the top 5 
modules that take longest and modified install check test cases. See each 
commit for details.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib reduce_IC_run_time

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/245.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #245


commit bfb3ef87302d80412c72f82194007f088e828974
Author: Jingyi Mei <jmei@...>
Date:   2018-03-16T00:07:44Z

PCA: Refactor IC for pca_project to reduce run time

To reduce the run time for pca_project, we removed calls to
pca_sparse_train and pca_train. Instead, we directly insert data in the
sql file.
Also, we renamed some output table names for simplicity.

Co-authored-by: Nikhil Kak <n...@pivotal.io>

commit a744b396cd7d21a43f7ac1bfede963a1bae318d9
Author: Jingyi Mei <jmei@...>
Date:   2018-03-16T00:45:23Z

Decision Tree: Modify IC to reduce run time

1. We use a smaller array dataset as input to one of the test case, which
reduced 2/3 of the decision tree IC run time.
2. Reduce n_folds from 5 to 3 in one test case

Co-authored-by: Nikhil Kak <n...@pivotal.io>

commit f61bf0e88a72506e299060f3e20f062d58338ec6
Author: Jingyi Mei <jmei@...>
Date:   2018-03-16T18:41:15Z

Random Forest: Clean up install check

This commit reorders test cases and adds comments. Besides, it removes
unnecessary casting in queries.

commit 74c650ba3ef56206e86f461c00f61cb1c70fb78a
Author: Jingyi Mei <jmei@...>
Date:   2018-03-16T18:44:26Z

Random Foreset: Reduce install check time

This commit changes num of trees from 100 to 10 in two test cases, so
that the total run time of IC will be reduced.

Co-authored-by: Nikhil Kak <n...@pivotal.io>

commit c5b88d55a5a9eb25c5dc9ed01f546ab31c4b8f5e
Author: Jngyi Mei and Nikhil Kak <jmei+nkak@...>
Date:   2018-03-19T18:31:58Z

Elastic Net: Remove cross validation test

To reduce the Install Check run time for Elastic Net, we removed the
test case with cross validation.

Co-authored-by: Jingyi Mei <j...@pivotal.io>
Co-authored-by: Nikhil Kak <n...@pivotal.io>




---


[GitHub] madlib pull request #234: Create lower case column name in encode_categorica...

2018-02-13 Thread jingyimei
GitHub user jingyimei opened a pull request:

https://github.com/apache/madlib/pull/234

Create lower case column name in encode_categorical_variables()

JIRA:MADLIB-1202
The previous madlib.encode_categorical_variables() function generates
column name with some capital characters, including:
1. when you specify top_values, there will be a column name with suffix 
__MISC__
2. when you set encode_nulls as True, there will be a column name with 
suffix
__NULL
3. when the original column is boolean type, there will be column names
with suffix _True and _False

In the above cases, users have to use double quoting to query, which is
not conveninet.

This commit adresses this, and all of the three scenarios will generate
coloumn name with lower cases.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jingyimei/madlib 
encode_categorial_column_name_change

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/234.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #234


commit 4d78f9425ffb76089d30bbe85cb3e07f1050268a
Author: Jingyi Mei <jmei@...>
Date:   2018-02-14T00:11:39Z

Create lower case column name in encode_categorical_variables()

JIRA:MADLIB-1202
The previous madlib.encode_categorical_variables() function generates
column name with some capital characters, including:
1. when you specify top_values, there will be a column name with suffix 
__MISC__
2. when you set encode_nulls as True, there will be a column name with 
suffix
__NULL
3. when the original column is boolean type, there will be column names
with suffix _True and _False

In the above cases, users have to use double quoting to query, which is
not conveninet.

This commit adresses this, and all of the three scenarios will generate
coloumn name with lower cases.




---


[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

2018-02-12 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/232#discussion_r167717958
  
--- Diff: src/ports/postgres/modules/utilities/text_utilities.sql_in ---
@@ -74,175 +81,231 @@ tasks related to text.
 Flag to indicate if a vocabulary is to be created. If TRUE, an 
additional
 output table is created containing the vocabulary of all words, with 
an id
 assigned to each word. The table is called 
output_table_vocabulary
-(suffix added to the output_table name) and contains the
+(i.e., suffix added to the output_table name) and contains the
 following columns:
-- \c wordid: An id assignment for each word
-- \c word: The word/term
+- \c wordid: An id for each word.
--- End diff --

We can mention it is in alphabetic ordering.


---


[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

2018-02-12 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/232#discussion_r167715245
  
--- Diff: src/ports/postgres/modules/utilities/text_utilities.sql_in ---
@@ -74,175 +81,231 @@ tasks related to text.
 Flag to indicate if a vocabulary is to be created. If TRUE, an 
additional
 output table is created containing the vocabulary of all words, with 
an id
 assigned to each word. The table is called 
output_table_vocabulary
-(suffix added to the output_table name) and contains the
+(i.e., suffix added to the output_table name) and contains the
 following columns:
-- \c wordid: An id assignment for each word
-- \c word: The word/term
+- \c wordid: An id for each word.
+- \c word: The word/term corresponding to the id.
 
 
 
 @anchor examples
 @par Examples
 
--# Prepare datasets with some example documents
+-# First we create a document table with one document per row:
 
 DROP TABLE IF EXISTS documents;
-CREATE TABLE documents(docid INTEGER, doc_contents TEXT);
+CREATE TABLE documents(docid INT4, contents TEXT);
 INSERT INTO documents VALUES
-(1, 'I like to eat broccoli and banana. I ate a banana and spinach 
smoothie for breakfast.'),
-(2, 'Chinchillas and kittens are cute.'),
-(3, 'My sister adopted two kittens yesterday'),
-(4, 'Look at this cute hamster munching on a piece of broccoli');
+(0, 'I like to eat broccoli and bananas. I ate a banana and spinach 
smoothie for breakfast.'),
+(1, 'Chinchillas and kittens are cute.'),
+(2, 'My sister adopted two kittens yesterday.'),
+(3, 'Look at this cute hamster munching on a piece of broccoli.');
 
-
--# Add a new column containing the words (lower-cased) in a text array
+You can apply stemming, stop word removal and tokenization at this point 
+in order to prepare the documents for text processing. 
+Depending upon your database version, various tools are 
+available. Databases based on more recent versions of 
+PostgreSQL may do something like:
+
+SELECT tsvector_to_array(to_tsvector('english',contents)) from documents;
+
+
+tsvector_to_array 
++--
+ {ate,banana,breakfast,broccoli,eat,like,smoothi,spinach}
+ {chinchilla,cute,kitten}
+ {adopt,kitten,sister,two,yesterday}
+ {broccoli,cute,hamster,look,munch,piec}
+(4 rows)
+
+In this example, we assume a database based on an older 
+version of PostgreSQL and just perform basic punctuation 
+removal and tokenization. The array of words is added as 
+a new column to the documents table:
 
 ALTER TABLE documents ADD COLUMN words TEXT[];
-UPDATE documents SET words = regexp_split_to_array(lower(doc_contents), 
E'[s+.]');
+UPDATE documents SET words = 
+regexp_split_to_array(lower(
+regexp_replace(contents, E'[,.;\\']','', 'g')
+), E'[s+]');
+\\x on   
+SELECT * FROM documents ORDER BY docid;
+
+
+-[ RECORD 1 
]
+docid| 0
+contents | I like to eat broccoli and bananas. I ate a banana and spinach 
smoothie for breakfast.
+words| 
{i,like,to,eat,broccoli,and,bananas,i,ate,a,banana,and,spinach,smoothie,for,breakfast}
+-[ RECORD 2 
]
+docid| 1
+contents | Chinchillas and kittens are cute.
+words| {chinchillas,and,kittens,are,cute}
+-[ RECORD 3 
]
+docid| 2
+contents | My sister adopted two kittens yesterday.
+words| {my,sister,adopted,two,kittens,yesterday}
+-[ RECORD 4 
]
+docid| 3
+contents | Look at this cute hamster munching on a piece of broccoli.
+words| {look,at,this,cute,hamster,munching,on,a,piece,of,broccoli}
 
 
--# Compute the frequency of each word in each document
+-# Compute the frequency of each word in each document:
 
-DROP TABLE IF EXISTS documents_tf;
-SELECT madlib.term_frequency('documents', 'docid', 'words', 
'documents_tf');
-SELECT * FROM documents_tf order by docid;
+DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary;
+SELECT madlib.term_frequency('documents',-- input table
+ 'docid',-- document id
--- End diff --

document id column


---


[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

2018-02-12 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/232#discussion_r167708065
  
--- Diff: src/ports/postgres/modules/lda/lda.sql_in ---
@@ -182,324 +105,789 @@ lda_train( data_table,
 \b Arguments
 
 data_table
-TEXT. The name of the table storing the training dataset. Each row 
is
+TEXT. Name of the table storing the training dataset. Each row is
 in the form docid, wordid, count where \c docid, \c 
wordid, and \c count
-are non-negative integers.
-
+are non-negative integers.  
 The \c docid column refers to the document ID, the \c wordid column is 
the
 word ID (the index of a word in the vocabulary), and \c count is the
-number of occurrences of the word in the document.
-
-Please note that column names for \c docid, \c wordid, and \c count 
are currently fixed, so you must use these
-exact names in the data_table.
+number of occurrences of the word in the document. Please note:
+
+- \c wordid must be 
+contiguous integers going from from 0 to \c voc_size  \c 1.
+- column names for \c docid, \c wordid, and \c count are currently 
fixed, 
+so you must use these exact names in the data_table.  
+
+The function Term 
Frequency
+can be used to generate vocabulary in the required format from raw 
documents.
+
 
 model_table
-TEXT. The name of the table storing the learned models. This table 
has one row and the following columns.
+TEXT. This is an output table generated by LDA which contains the 
learned model. 
+It has one row with the following columns:
 
 
 voc_size
-INTEGER. Size of the vocabulary. Note that the \c 
wordid should be continous integers starting from 0 to \c voc_size  \c 
1.  A data validation routine is called to validate the dataset.
+INTEGER. Size of the vocabulary. As mentioned above 
for the input 
+table, \c wordid consists of contiguous integers going 
+from 0 to \c voc_size  \c 1.  
+
 
 
 topic_num
 INTEGER. Number of topics.
 
 
 alpha
-DOUBLE PRECISION. Dirichlet parameter for the per-doc 
topic multinomial (e.g. 50/topic_num).
+DOUBLE PRECISION. Dirichlet prior for the per-document 
+topic multinomial.
 
 
 beta
-DOUBLE PRECISION. Dirichlet parameter for the 
per-topic word multinomial (e.g. 0.01).
+DOUBLE PRECISION. Dirichlet prior for the per-topic 
+word multinomial.
 
 
 model
-BIGINT[].
+BIGINT[]. The encoded model description (not human 
readable).
 
 
 
 output_data_table
-TEXT. The name of the table to store the output data. It has the 
following columns:
+TEXT. The name of the table generated by LDA that stores 
+the output data. It has the following columns:
 
 
 docid
-INTEGER.
+INTEGER. Document id from input 'data_table'.
 
 
 wordcount
-INTEGER.
+INTEGER. Count of number of words in the document, 
+including repeats. For example, if a word appears 3 times 
+in the document, it is counted 3 times.
 
 
 words
-INTEGER[].
+INTEGER[]. Array of \c wordid in the document, not
+including repeats.  For example, if a word appears 3 times 
+in the document, it appears only once in the \c words 
array.
 
 
 counts
-INTEGER[].
+INTEGER[]. Frequency of occurance of a word in the 
document,
+indexed the same as the \c words array above.  For 
example, if the
+2nd element of the \c counts array is 4, it means that the 
word
+in the 2nd element of the \c words array occurs 4 times in 
the
+document.
 
 
 topic_count
-INTEGER[].
+INTEGER[]. Array of the count of words in the document
+that correspond to each topic.
--- End diff --

maybe mention array index corresponds to 0

[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

2018-02-12 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/232#discussion_r167708360
  
--- Diff: src/ports/postgres/modules/lda/lda.sql_in ---
@@ -182,324 +105,789 @@ lda_train( data_table,
 \b Arguments
 
 data_table
-TEXT. The name of the table storing the training dataset. Each row 
is
+TEXT. Name of the table storing the training dataset. Each row is
 in the form docid, wordid, count where \c docid, \c 
wordid, and \c count
-are non-negative integers.
-
+are non-negative integers.  
 The \c docid column refers to the document ID, the \c wordid column is 
the
 word ID (the index of a word in the vocabulary), and \c count is the
-number of occurrences of the word in the document.
-
-Please note that column names for \c docid, \c wordid, and \c count 
are currently fixed, so you must use these
-exact names in the data_table.
+number of occurrences of the word in the document. Please note:
+
+- \c wordid must be 
+contiguous integers going from from 0 to \c voc_size  \c 1.
+- column names for \c docid, \c wordid, and \c count are currently 
fixed, 
+so you must use these exact names in the data_table.  
+
+The function Term 
Frequency
+can be used to generate vocabulary in the required format from raw 
documents.
+
 
 model_table
-TEXT. The name of the table storing the learned models. This table 
has one row and the following columns.
+TEXT. This is an output table generated by LDA which contains the 
learned model. 
+It has one row with the following columns:
 
 
 voc_size
-INTEGER. Size of the vocabulary. Note that the \c 
wordid should be continous integers starting from 0 to \c voc_size  \c 
1.  A data validation routine is called to validate the dataset.
+INTEGER. Size of the vocabulary. As mentioned above 
for the input 
+table, \c wordid consists of contiguous integers going 
+from 0 to \c voc_size  \c 1.  
+
 
 
 topic_num
 INTEGER. Number of topics.
 
 
 alpha
-DOUBLE PRECISION. Dirichlet parameter for the per-doc 
topic multinomial (e.g. 50/topic_num).
+DOUBLE PRECISION. Dirichlet prior for the per-document 
+topic multinomial.
 
 
 beta
-DOUBLE PRECISION. Dirichlet parameter for the 
per-topic word multinomial (e.g. 0.01).
+DOUBLE PRECISION. Dirichlet prior for the per-topic 
+word multinomial.
 
 
 model
-BIGINT[].
+BIGINT[]. The encoded model description (not human 
readable).
 
 
 
 output_data_table
-TEXT. The name of the table to store the output data. It has the 
following columns:
+TEXT. The name of the table generated by LDA that stores 
+the output data. It has the following columns:
 
 
 docid
-INTEGER.
+INTEGER. Document id from input 'data_table'.
 
 
 wordcount
-INTEGER.
+INTEGER. Count of number of words in the document, 
+including repeats. For example, if a word appears 3 times 
+in the document, it is counted 3 times.
 
 
 words
-INTEGER[].
+INTEGER[]. Array of \c wordid in the document, not
+including repeats.  For example, if a word appears 3 times 
+in the document, it appears only once in the \c words 
array.
 
 
 counts
-INTEGER[].
+INTEGER[]. Frequency of occurance of a word in the 
document,
+indexed the same as the \c words array above.  For 
example, if the
+2nd element of the \c counts array is 4, it means that the 
word
+in the 2nd element of the \c words array occurs 4 times in 
the
+document.
 
 
 topic_count
-INTEGER[].
+INTEGER[]. Array of the count of words in the document
+that correspond to each topic

[GitHub] madlib pull request #232: Multiple LDA improvements and fixes

2018-02-12 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/232#discussion_r167709835
  
--- Diff: src/ports/postgres/modules/lda/lda.sql_in ---
@@ -182,324 +105,789 @@ lda_train( data_table,
 \b Arguments
 
 data_table
-TEXT. The name of the table storing the training dataset. Each row 
is
+TEXT. Name of the table storing the training dataset. Each row is
 in the form docid, wordid, count where \c docid, \c 
wordid, and \c count
-are non-negative integers.
-
+are non-negative integers.  
 The \c docid column refers to the document ID, the \c wordid column is 
the
 word ID (the index of a word in the vocabulary), and \c count is the
-number of occurrences of the word in the document.
-
-Please note that column names for \c docid, \c wordid, and \c count 
are currently fixed, so you must use these
-exact names in the data_table.
+number of occurrences of the word in the document. Please note:
+
+- \c wordid must be 
+contiguous integers going from from 0 to \c voc_size  \c 1.
+- column names for \c docid, \c wordid, and \c count are currently 
fixed, 
+so you must use these exact names in the data_table.  
+
+The function Term 
Frequency
+can be used to generate vocabulary in the required format from raw 
documents.
+
 
 model_table
-TEXT. The name of the table storing the learned models. This table 
has one row and the following columns.
+TEXT. This is an output table generated by LDA which contains the 
learned model. 
+It has one row with the following columns:
 
 
 voc_size
-INTEGER. Size of the vocabulary. Note that the \c 
wordid should be continous integers starting from 0 to \c voc_size  \c 
1.  A data validation routine is called to validate the dataset.
+INTEGER. Size of the vocabulary. As mentioned above 
for the input 
+table, \c wordid consists of contiguous integers going 
+from 0 to \c voc_size  \c 1.  
+
 
 
 topic_num
 INTEGER. Number of topics.
 
 
 alpha
-DOUBLE PRECISION. Dirichlet parameter for the per-doc 
topic multinomial (e.g. 50/topic_num).
+DOUBLE PRECISION. Dirichlet prior for the per-document 
+topic multinomial.
 
 
 beta
-DOUBLE PRECISION. Dirichlet parameter for the 
per-topic word multinomial (e.g. 0.01).
+DOUBLE PRECISION. Dirichlet prior for the per-topic 
+word multinomial.
 
 
 model
-BIGINT[].
+BIGINT[]. The encoded model description (not human 
readable).
 
 
 
 output_data_table
-TEXT. The name of the table to store the output data. It has the 
following columns:
+TEXT. The name of the table generated by LDA that stores 
+the output data. It has the following columns:
 
 
 docid
-INTEGER.
+INTEGER. Document id from input 'data_table'.
 
 
 wordcount
-INTEGER.
+INTEGER. Count of number of words in the document, 
+including repeats. For example, if a word appears 3 times 
+in the document, it is counted 3 times.
 
 
 words
-INTEGER[].
+INTEGER[]. Array of \c wordid in the document, not
+including repeats.  For example, if a word appears 3 times 
+in the document, it appears only once in the \c words 
array.
 
 
 counts
-INTEGER[].
+INTEGER[]. Frequency of occurance of a word in the 
document,
+indexed the same as the \c words array above.  For 
example, if the
+2nd element of the \c counts array is 4, it means that the 
word
+in the 2nd element of the \c words array occurs 4 times in 
the
+document.
 
 
 topic_count
-INTEGER[].
+INTEGER[]. Array of the count of words in the document
+that correspond to each topic

[GitHub] madlib issue #222: minor update to summary() user docs

2018-01-03 Thread jingyimei
Github user jingyimei commented on the issue:

https://github.com/apache/madlib/pull/222
  
Jekins OK to test


---


[GitHub] madlib pull request #211: Change madlib gppkg version string

2017-12-07 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/211#discussion_r155683583
  
--- Diff: cmake/LinuxUtils.cmake ---
@@ -9,3 +9,14 @@ macro(rh_version OUT_VERSION)
 set(${OUT_VERSION} "${OUT_VERSION}-NOTFOUND")
 endif(EXISTS "/etc/redhat-release")
 endmacro(rh_version)
+
--- End diff --

yes, I can use a regex to grep RH_MAJOR_VERSION from RH_VERSION after 
running`rh_version(RH_VERSION)`.


---


[GitHub] madlib issue #211: Change madlib gppkg version string

2017-12-07 Thread jingyimei
Github user jingyimei commented on the issue:

https://github.com/apache/madlib/pull/211
  
@fmcquillan99 Yes, the version string is saved in Version.yml file and when 
we make the release, we will change the version string from 1.13_dev to 1.13, 
and cmake will directly get it from the yml file.


---


[GitHub] madlib pull request #211: Change madlib gppkg version string

2017-12-07 Thread jingyimei
GitHub user jingyimei opened a pull request:

https://github.com/apache/madlib/pull/211

Change madlib gppkg  version string

This commit changes the naming convention for madlib gppkg,
after renaming, the format of madlib gppkg will look like:
madlib-1.13_dev-gp5-rhel6-x86_64.gppkg
madlib-1.12_dev-gp4.3orca-rhel5-x86_64.gppkg

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jingyimei/madlib gppkg_rename

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/211.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #211


commit 192e7208e2cab2845642eb0995e277382c2af250
Author: Jingyi Mei <j...@pivotal.io>
Date:   2017-12-06T04:13:34Z

Change madlib gppkg  version string

This commit changes the naming convention for madlib gppkg,
after renaming, the format of madlib gppkg will look like:
madlib-1.13_dev-gp5-rhel6-x86_64.gppkg
madlib-1.12_dev-gp4.3orca-rhel5-x86_64.gppkg




---


[GitHub] madlib pull request #195: Feature: Add grouping support to HITS

2017-11-15 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/195#discussion_r151063100
  
--- Diff: src/ports/postgres/modules/graph/graph_utils.py_in ---
@@ -109,6 +110,85 @@ def validate_graph_coding(vertex_table, vertex_id, 
edge_table, edge_params,
 
 return None
 
+def validate_params_for_centrality_measures(schema_madlib, func_name,
--- End diff --

+1


---


[GitHub] madlib pull request #195: Feature: Add grouping support to HITS

2017-11-15 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/195#discussion_r151064800
  
--- Diff: src/ports/postgres/modules/graph/hits.py_in ---
@@ -95,234 +109,391 @@ def hits(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 FROM {1}
 """.format(vertex_id, vertex_table))[0]["cnt"]
 
-# Assign default threshold value based on number of nodes in the 
graph.
 if threshold is None:
-threshold = 1.0 / (n_vertices * 1000)
+threshold = 
get_default_threshold_for_centrality_measures(n_vertices)
 
+# table/column names used when grouping_cols is set.
 edge_temp_table = unique_string(desp='temp_edge')
+grouping_cols_comma = grouping_cols + ',' if grouping_cols else ''
 distribution = ('' if is_platform_pg() else
-"DISTRIBUTED BY ({0})".format(dest))
-plpy.execute("DROP TABLE IF EXISTS {0}".format(edge_temp_table))
+"DISTRIBUTED BY ({0}{1})".format(
+grouping_cols_comma, dest))
+drop_tables([edge_temp_table])
 plpy.execute("""
 CREATE TEMP TABLE {edge_temp_table} AS
 SELECT * FROM {edge_table}
 {distribution}
 """.format(**locals()))
 
-# GPDB and HAWQ have distributed by clauses to help them with 
indexing.
-# For Postgres we add the index explicitly.
-if is_platform_pg():
-plpy.execute("CREATE INDEX ON {0}({1})".format(
-edge_temp_table, dest))
+
##
+# Set initial authority_norm and hub_norm as 1, so that later the 
final
+# norm should be positive number
+authority_init_value = 1.0
+hub_init_value = 1.0
+
+subquery1 = unique_string(desp='subquery1')
+
+distinct_grp_table = ''
+select_grouping_cols_comma = ''
+select_subquery1_grouping_cols_comma = ''
--- End diff --

I understood the incentive to rename those variables to make them more 
meaningful, however, seems this doesn't help simplify/reduce code but extends 
the length of many lines. When I read the code I still feel hard to keep track 
of them and understand the query. In this case I feel a shorter variable name 
is even simpler to read.


---


[GitHub] madlib pull request #195: Feature: Add grouping support to HITS

2017-11-15 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/195#discussion_r151061922
  
--- Diff: src/ports/postgres/modules/utilities/utilities.py_in ---
@@ -709,16 +709,35 @@ def _check_groups(tbl1, tbl2, grp_list):
 return ' AND '.join([" {tbl1}.{i} = {tbl2}.{i} ".format(**locals())
  for i in grp_list])
 
-
-def _grp_from_table(tbl, grp_list):
-"""
-Helper function for selecting grouping columns of a table
+def get_filtered_cols_subquery_str(include_from_table, exclude_from_table,
--- End diff --

what is our consideration to define function name not start with 
underscore(weak internal use) while the previous one is a weak internal use?


---


[GitHub] madlib pull request #191: KNN: Fix optional parameters and ordering

2017-10-24 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/191#discussion_r146700914
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -160,20 +164,23 @@ def knn(schema_madlib, point_source, 
point_column_name, point_id, label_column_n
 ) {x_temp_table}
 ) {y_temp_table}
 WHERE {y_temp_table}.r <= {k_val}
-""".format(**locals()))
-plpy.execute(
-"""
+""".format(**locals())
+plpy.execute(sql1)
+
+sql2 = """
 CREATE TABLE {output_table} AS
 SELECT {test_id_temp} AS id, {test_column_name} ,
 CASE WHEN {output_neighbors}
-THEN array_agg(knn_temp.train_id)
+THEN array_agg(knn_temp.train_id
+ORDER BY knn_temp.dist ASC)
--- End diff --

Can be in one line


---


[GitHub] madlib pull request #191: KNN: Fix optional parameters and ordering

2017-10-24 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/191#discussion_r146700815
  
--- Diff: src/ports/postgres/modules/knn/knn.sql_in ---
@@ -96,19 +95,19 @@ in a column of type DOUBLE PRECISION[].
 
 point_id
 TEXT. Name of the column in 'point_source’ containing source data 
ids.
-The ids are of type INTEGER with no duplicates. They do not need to be 
contiguous. 
-This parameter must be used if the list of nearest neighbors are to be 
output, i.e., 
+The ids are of type INTEGER with no duplicates. They do not need to be 
contiguous.
+This parameter must be used if the list of nearest neighbors are to be 
output, i.e.,
 if the parameter 'output_neighbors' below is TRUE or if 
'label_column_name' is NULL.
 
 label_column_name
 TEXT. Name of the column with labels/values of training data points.
-If Boolean, integer or text types will run knn classification, else if 
-double precision values will run knn regression.  
-If you set this to NULL will return neighbors only without doing 
classification or regression.
+If Boolean, integer or text types will run KNN classification, else if
+double precision values will run KNN regression.
+If you set this to NULL, will return the set of neighbors only without
+actually doing classification or regression.
--- End diff --

When I read this, I had some problem understanding the sentence. Can we 
modify this sentence like "If you set this to NULL, it will only return the set 
of neighbors without actually doing classification or regression"?


---


[GitHub] madlib pull request #191: KNN: Fix optional parameters and ordering

2017-10-24 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/191#discussion_r146701746
  
--- Diff: src/ports/postgres/modules/knn/knn.sql_in ---
@@ -218,56 +215,57 @@ INSERT INTO knn_test_data VALUES
 
 -#  Run KNN for classification:
 
-DROP TABLE IF EXISTS madlib_knn_result_classification;
-SELECT * FROM madlib.knn( 
+DROP TABLE IF EXISTS knn_result_classification;
+SELECT * FROM madlib.knn(
 'knn_train_data',  -- Table of training data
 'data',-- Col name of training data
-'id',  -- Col Name of id in train data 
+'id',  -- Col Name of id in train data
 'label',   -- Training labels
 'knn_test_data',   -- Table of test data
 'data',-- Col name of test data
-'id',  -- Col name of id in test data 
-'madlib_knn_result_classification',  -- Output table
- 3 -- Number of nearest neighbours
+'id',  -- Col name of id in test data
+'knn_result_classification',  -- Output table
+ 3,-- Number of nearest neighbors
  True  -- True if you want to show 
Nearest-Neighbors, False otherwise
--- End diff --

We can say "True if you want to show Nearest-Neighbors by id" to make it 
clearer


---


[GitHub] madlib pull request #186: Add error message for checking postgres install co...

2017-09-19 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/186#discussion_r139798284
  
--- Diff: src/madpack/madpack.py ---
@@ -512,7 +512,9 @@ def _plpy_check(py_min_ver):
 try:
 _internal_run_query("CREATE LANGUAGE plpythonu;", True)
 except:
-_error('Cannot create language plpythonu. Stopping 
installation...', False)
+_error("""Cannot create language plpythonu. Please check if you
+have configured and installed postgres with `--with-python`
--- End diff --

Sure, that should be more general.


---


[GitHub] madlib pull request #186: Add error message for checking postgres install co...

2017-09-19 Thread jingyimei
GitHub user jingyimei opened a pull request:

https://github.com/apache/madlib/pull/186

Add error message for checking postgres install configuration

MADlib needs to be installed on a postgres with python extension. If
postgres is not configured with the `--with-python` option, madpack
installation will failed for not finding $libdir/plpython2 directory.
This commit add some instructions when this failure happens.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jingyimei/madlib python_error_message

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/186.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #186


commit f123105aa543953a12757eb228d33e10f6979b9f
Author: Jingyi Mei <j...@pivotal.io>
Date:   2017-09-19T19:01:02Z

Add error message for checking postgres install configuration

MADlib needs to be installed on a postgres with python configured. If
postgres is not configured with the `--with-python` option, madpack
installation will failed for not finding $libdir/plpython2 directory.
This commit add some instructions when this failure happens.




---


[GitHub] madlib issue #178: Featur: HITS

2017-09-13 Thread jingyimei
Github user jingyimei commented on the issue:

https://github.com/apache/madlib/pull/178
  
Jenkins ok to test


---


[GitHub] madlib issue #178: Featur: HITS

2017-09-13 Thread jingyimei
Github user jingyimei commented on the issue:

https://github.com/apache/madlib/pull/178
  
Jenkins ok to test


---


[GitHub] madlib pull request #178: Featur: Hit

2017-08-31 Thread jingyimei
GitHub user jingyimei opened a pull request:

https://github.com/apache/madlib/pull/178

Featur: Hit

JIRA:MADLIB-1124

Introducees a new module that computes the HITS scores of all nodes in a
directed graph.
Implements the HITS algorithm with normalization
(https://en.wikipedia.org/wiki/HITS_algorithm)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jingyimei/incubator-madlib hits

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/178.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #178


commit 144f65b003843e1f9578cb1befdc9338096c80b0
Author: Jingyi <j...@pivotal.io>
Date:   2017-08-25T22:06:39Z

Feature: HITS

JIRA:MADLIB-1124

Introducees a new module that computes the HITS scores of all nodes in a
directed graph.
Implements the HITS algorithm with normalization
(https://en.wikipedia.org/wiki/HITS_algorithm)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #175: Fix example py_in file

2017-08-21 Thread jingyimei
Github user jingyimei commented on the issue:

https://github.com/apache/incubator-madlib/pull/175
  
@orhankislal 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---