[GitHub] madlib issue #343: Linear Regression: Support for JSON and special character...
Github user jingyimei commented on the issue: https://github.com/apache/madlib/pull/343 @hpandeycodeit Can you please rebase this branch from master so that we can do full test? Thanks! ---
[GitHub] madlib issue #343: Linear Regression: Support for JSON and special character...
Github user jingyimei commented on the issue: https://github.com/apache/madlib/pull/343 Jenkins Ok to test ---
[GitHub] madlib issue #343: Linear Regression: Support for JSON and special character...
Github user jingyimei commented on the issue: https://github.com/apache/madlib/pull/343 @hpandeycodeit Seems couple of dev check test cases failed in Jenkins build. Can you please look into them and fix? Thanks! ---
[GitHub] madlib pull request #337: Madpack: Add UDO and UDOC automation
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/337#discussion_r232721653 --- Diff: src/madpack/diff_udo.sql --- @@ -0,0 +1,81 @@ +-- +-- Licensed to the Apache Software Foundation (ASF) under one +-- or more contributor license agreements. See the NOTICE file +-- distributed with this work for additional information +-- regarding copyright ownership. The ASF licenses this file +-- to you under the Apache License, Version 2.0 (the +-- "License"); you may not use this file except in compliance +-- with the License. You may obtain a copy of the License at + +-- http://www.apache.org/licenses/LICENSE-2.0 + +-- Unless required by applicable law or agreed to in writing, +-- software distributed under the License is distributed on an +-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +-- KIND, either express or implied. See the License for the +-- specific language governing permissions and limitations +-- under the License. +-- + +SET client_min_messages to ERROR; +\x on + +CREATE OR REPLACE FUNCTION filter_schema(argstr text, schema_name text) +RETURNS text AS $$ +if argstr is None: +return "NULL" +return argstr.replace(schema_name + ".", '') +$$ LANGUAGE plpythonu; + +CREATE OR REPLACE FUNCTION alter_schema(argstr text, schema_name text) +RETURNS text AS $$ +if argstr is None: +return "NULL" +return argstr.replace(schema_name + ".", 'schema_madlib.') +$$ LANGUAGE plpythonu; + + +CREATE OR REPLACE FUNCTION get_udos(table_name text, schema_name text, + type_filter text) +RETURNS VOID AS +$$ +import plpy + +plpy.execute(""" +create table {table_name} AS +SELECT * +FROM ( +SELECT n.nspname AS "Schema", + o.oprname AS name, + filter_schema(o.oprcode::text, '{schema_name}') AS oprcode, + alter_schema(pg_catalog.format_type(o.oprleft, NULL), '{schema_name}') AS oprleft, + alter_schema(pg_catalog.format_type(o.oprright, NULL), '{schema_name}') AS oprright, + alter_schema(pg_catalog.format_type(o.oprresult, NULL), '{schema_name}') AS rettype +FROM pg_catalog.pg_operator o +LEFT JOIN pg_catalog.pg_namespace n ON n.oid = o.oprnamespace +WHERE n.nspname OPERATOR(pg_catalog.~) '^({schema_name})$' --- End diff -- Got it. Thanks for the explanation! ---
[GitHub] madlib pull request #337: Madpack: Add UDO and UDOC automation
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/337#discussion_r231972206 --- Diff: src/madpack/diff_udo.sql --- @@ -0,0 +1,81 @@ +-- +-- Licensed to the Apache Software Foundation (ASF) under one +-- or more contributor license agreements. See the NOTICE file +-- distributed with this work for additional information +-- regarding copyright ownership. The ASF licenses this file +-- to you under the Apache License, Version 2.0 (the +-- "License"); you may not use this file except in compliance +-- with the License. You may obtain a copy of the License at + +-- http://www.apache.org/licenses/LICENSE-2.0 + +-- Unless required by applicable law or agreed to in writing, +-- software distributed under the License is distributed on an +-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +-- KIND, either express or implied. See the License for the +-- specific language governing permissions and limitations +-- under the License. +-- + +SET client_min_messages to ERROR; +\x on + +CREATE OR REPLACE FUNCTION filter_schema(argstr text, schema_name text) +RETURNS text AS $$ +if argstr is None: +return "NULL" +return argstr.replace(schema_name + ".", '') +$$ LANGUAGE plpythonu; + +CREATE OR REPLACE FUNCTION alter_schema(argstr text, schema_name text) +RETURNS text AS $$ +if argstr is None: +return "NULL" +return argstr.replace(schema_name + ".", 'schema_madlib.') +$$ LANGUAGE plpythonu; + + +CREATE OR REPLACE FUNCTION get_udos(table_name text, schema_name text, + type_filter text) +RETURNS VOID AS +$$ +import plpy + +plpy.execute(""" +create table {table_name} AS +SELECT * +FROM ( +SELECT n.nspname AS "Schema", + o.oprname AS name, + filter_schema(o.oprcode::text, '{schema_name}') AS oprcode, + alter_schema(pg_catalog.format_type(o.oprleft, NULL), '{schema_name}') AS oprleft, + alter_schema(pg_catalog.format_type(o.oprright, NULL), '{schema_name}') AS oprright, + alter_schema(pg_catalog.format_type(o.oprresult, NULL), '{schema_name}') AS rettype +FROM pg_catalog.pg_operator o +LEFT JOIN pg_catalog.pg_namespace n ON n.oid = o.oprnamespace +WHERE n.nspname OPERATOR(pg_catalog.~) '^({schema_name})$' --- End diff -- can you please explain this where clause a bit? I am a bit confused about the 'OPERATOR' part ---
[GitHub] madlib pull request #337: Madpack: Add UDO and UDOC automation
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/337#discussion_r231953942 --- Diff: src/madpack/create_changelist.py --- @@ -237,6 +325,13 @@ print "Something went wrong! The changelist might be wrong/corrupted." raise finally: -os.system("rm -f /tmp/madlib_tmp_nm.txt /tmp/madlib_tmp_udf.txt " - "/tmp/madlib_tmp_udt.txt /tmp/madlib_tmp_cl.yaml " --- End diff -- what is this madlib_tmp_cl.yaml file here? why we don't remove it now ---
[GitHub] madlib pull request #332: Update Dockerfile to use ubuntu 16.04
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/332#discussion_r228394441 --- Diff: tool/docker/base/Dockerfile_ubuntu16_postgres10 --- @@ -0,0 +1,108 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +FROM ubuntu:16.04 + +### Get necessary libraries to add postgresql apt repository +RUN apt-get update && apt-get install -y lsb-core software-properties-common wget + +### Add postgresql apt repository +RUN add-apt-repository "deb http://apt.postgresql.org/pub/repos/apt/ $(lsb_release -sc)-pgdg main" && wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | apt-key add - + +### Have to update after getting new repository +RUN apt-get update + +### Get postgres10 and postgres specific add-ons +RUN apt-get install -y postgresql-10 \ + postgresql-client-10 \ + postgresql-plpython-10 \ + postgresql-server-dev-10 \ + libpq-dev \ + build-essential \ + openssl \ + libssl-dev \ + libboost-all-dev \ + m4 \ + vim \ + pgxnclient \ + flex \ + bison \ + graphviz + +### Reset pg_hba.conf file to allow no password prompt +### Docker file doesn't support heardoc, like cat > /etc/postgresql/10/main/pg_hba.conf<<-EOF, +### and this echo and \n\ are workaround to write the file +RUN echo " \n\ +# Database administrative login by Unix domain socket \n\ +local all all trust \n\ + +# TYPE DATABASEUSERADDRESS METHOD \n\ + +# "local" is for Unix domain socket connections only \n\ +local all all trust \n\ +# IPv4 local connections: \n\ +hostall all 127.0.0.1/32trust \n\ +# IPv6 local connections: \n\ +hostall all ::1/128 trust \n\ +" > /etc/postgresql/10/main/pg_hba.conf + +### We need to set nproc to unlimited to be able to run scripts as +### the user 'postgres'. This is actually useful when we try to setup +### and start a Postgres server. +RUN echo " * soft nproc unlimited " > /etc/security/limits.d/postgres-limits.conf + + +### Always start postgres server when login +RUN echo "service postgresql start" >> ~/.bashrc + +### Build custom CMake with SSQL support +RUN wget https://cmake.org/files/v3.6/cmake-3.6.1.tar.gz && \ + tar -zxvf cmake-3.6.1.tar.gz && \ + cd cmake-3.6.1 && \ + sed -i 's/-DCMAKE_BOOTSTRAP=1/-DCMAKE_BOOTSTRAP=1 -DCMAKE_USE_OPENSSL=ON/g' bootstrap && \ + ./configure && \ + make -j2 && \ + make install && \ + cd .. + +### Install doxygen-1.8.13: +RUN wget http://ftp.stack.nl/pub/users/dimitri/doxygen-1.8.13.src.tar.gz && \ --- End diff -- `make doc` is optional when people want to install madlib, and that's why I removed the line `make doc` before. But I think it is OK to bake doxygen in the image and people can run `make doc` when needed. ---
[GitHub] madlib pull request #332: Update Dockerfile to use ubuntu 16.04
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/332#discussion_r228394202 --- Diff: tool/docker/base/Dockerfile_ubuntu16_postgres10 --- @@ -0,0 +1,108 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +FROM ubuntu:16.04 + +### Get necessary libraries to add postgresql apt repository +RUN apt-get update && apt-get install -y lsb-core software-properties-common wget + +### Add postgresql apt repository +RUN add-apt-repository "deb http://apt.postgresql.org/pub/repos/apt/ $(lsb_release -sc)-pgdg main" && wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | apt-key add - + +### Have to update after getting new repository +RUN apt-get update + +### Get postgres10 and postgres specific add-ons +RUN apt-get install -y postgresql-10 \ + postgresql-client-10 \ + postgresql-plpython-10 \ + postgresql-server-dev-10 \ + libpq-dev \ + build-essential \ + openssl \ + libssl-dev \ + libboost-all-dev \ + m4 \ + vim \ + pgxnclient \ + flex \ + bison \ + graphviz + +### Reset pg_hba.conf file to allow no password prompt +### Docker file doesn't support heardoc, like cat > /etc/postgresql/10/main/pg_hba.conf<<-EOF, +### and this echo and \n\ are workaround to write the file +RUN echo " \n\ +# Database administrative login by Unix domain socket \n\ +local all all trust \n\ + +# TYPE DATABASEUSERADDRESS METHOD \n\ + +# "local" is for Unix domain socket connections only \n\ +local all all trust \n\ +# IPv4 local connections: \n\ +hostall all 127.0.0.1/32trust \n\ +# IPv6 local connections: \n\ +hostall all ::1/128 trust \n\ +" > /etc/postgresql/10/main/pg_hba.conf + +### We need to set nproc to unlimited to be able to run scripts as +### the user 'postgres'. This is actually useful when we try to setup +### and start a Postgres server. +RUN echo " * soft nproc unlimited " > /etc/security/limits.d/postgres-limits.conf + + +### Always start postgres server when login +RUN echo "service postgresql start" >> ~/.bashrc + +### Build custom CMake with SSQL support +RUN wget https://cmake.org/files/v3.6/cmake-3.6.1.tar.gz && \ + tar -zxvf cmake-3.6.1.tar.gz && \ + cd cmake-3.6.1 && \ + sed -i 's/-DCMAKE_BOOTSTRAP=1/-DCMAKE_BOOTSTRAP=1 -DCMAKE_USE_OPENSSL=ON/g' bootstrap && \ + ./configure && \ + make -j2 && \ + make install && \ + cd .. + +### Install doxygen-1.8.13: +RUN wget http://ftp.stack.nl/pub/users/dimitri/doxygen-1.8.13.src.tar.gz && \ + tar xf doxygen-1.8.13.src.tar.gz && \ + cd doxygen-1.8.13 && \ + mkdir build && \ + cd build && \ + cmake -G "Unix Makefiles" .. && \ + make && \ + make install + +### Optional: install LaTex +### uncomment the following 'RUN apt-get' line to bake LaTex into the image +### Note: if you run the following line, please tag the image as +### madlib/postgres_10:LaTex, and don't tag it as latest +# RUN apt-get install -y texlive-full --- End diff -- This is for making design doc (pdf). LaTex is pretty big and thus we make different images so people can pull it when needed. ---
[GitHub] madlib pull request #332: Update Dockerfile to use ubuntu 16.04
GitHub user jingyimei opened a pull request: https://github.com/apache/madlib/pull/332 Update Dockerfile to use ubuntu 16.04 This commit adds a new dockerfile to bake postgres 10.5 on ubuntu 16.04. Also updates docker_start.sh and README to pull the new docker image instead of the old one (Postgres9.6 on Ubuntu 8.9). You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib update_dockerfile Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/332.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #332 commit e745949b270bb633d9d02ca328363a1929478ea0 Author: Jingyi Mei Date: 2018-10-19T14:43:07Z Update dockerfile to use ubuntu 16.04 This commit adds a new dockerfile to bake postgres 10.5 on ubuntu 16.04. Also updates docker_start.sh and README to pull the new docker image instead of the old one (Postgres9.6 on Ubuntu 8.9). ---
[GitHub] madlib pull request #326: Install/Dev check: Add new test cases for some mod...
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/326#discussion_r221714858 --- Diff: src/ports/postgres/modules/pmml/test/pmml.ic.sql_in --- @@ -0,0 +1,119 @@ +/* --- *//** + * + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + * + *//* --- */ +DROP TABLE IF EXISTS abalone CASCADE; + +CREATE TABLE abalone ( +id integer, +sex text, +length double precision, +diameter double precision, +height double precision, +whole double precision, +shucked double precision, +viscera double precision, +shell double precision, +rings integer +); + +INSERT INTO abalone VALUES +(3151, 'F', 0.655027, 0.505004, 0.165008, 1.36699, 0.583519, 0.351479, 0.396019, 10), +(2026, 'F', 0.550044, 0.469973, 0.149994, 0.920485, 0.381005, 0.243494, 0.267516, 10), +(3751, 'I', 0.434998, 0.375, 0.110001, 0.41548, 0.170012, 0.0759981, 0.14499, 8), +(720, 'I', 0.149994, 0.16, 0.0250014, 0.0149994, 0.00449966, 0.0048, 0.0050001, 2), +(1635, 'F', 0.574956, 0.469973, 0.154999, 1.1161, 0.509008, 0.237989, 0.340024, 10), +(2648, 'I', 0.5, 0.390013, 0.125, 0.582963, 0.293984, 0.132006, 0.160504, 8), +(1796, 'F', 0.57996, 0.429993, 0.170012, 1.47998, 0.65347, 0.32401, 0.41548, 10), +(209, 'F', 0.525022, 0.41498, 0.170012, 0.832518, 0.275523, 0.168511, 0.309998, 13), +(1451, 'I', 0.455016, 0.33502, 0.135009, 0.501001, 0.274021, 0.0995051, 0.106497, 7), +(1108, 'I', 0.510009, 0.380004, 0.115005, 0.515458, 0.214997, 0.113504, 0.166009, 8), +(3675, 'F', 0.594973, 0.450011, 0.165008, 1.08096, 0.489991, 0.252502, 0.279026, 12), +(2108, 'F', 0.675044, 0.550044, 0.179993, 1.688499989, 0.562055, 0.370496, 0.599978, 15), +(3312, 'F', 0.479982, 0.380004, 0.135009, 0.507006, 0.191504, 0.13651, 0.154999, 12), +(882, 'M', 0.655027, 0.520018, 0.165008, 1.40948, 0.585965, 0.290981, 0.405027, 9), +(3402, 'M', 0.479982, 0.395018, 0.149994, 0.681495, 0.214496, 0.140514, 0.2495, 18), +(829, 'I', 0.409976, 0.325011, 0.16, 0.394017, 0.20799, 0.0655027, 0.105997, 6), +(1305, 'M', 0.535031, 0.434998, 0.149994, 0.716971, 0.347476, 0.14449, 0.194006, 9), +(3613, 'M', 0.599978, 0.46002, 0.179993, 1.1399, 0.422987, 0.257507, 0.364991, 10), +(1068, 'I', 0.340024, 0.265013, 0.0800017, 0.201512, 0.0899967, 0
[GitHub] madlib pull request #326: Install/Dev check: Add new test cases for some mod...
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/326#discussion_r221712133 --- Diff: methods/stemmer/src/pg_gp/test/porter_stemmer.sql_in --- @@ -0,0 +1,38 @@ +/* --- *//** + * + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + * + *//* --- */ + +CREATE TABLE token_tbl ( id integer, --- End diff -- Noticed the install check and dev check test cases are exactly the same as each other. Is it better to differentiate them? ---
[GitHub] madlib issue #325: Madpack/ic func schema
Github user jingyimei commented on the issue: https://github.com/apache/madlib/pull/325 @iyerr3 > I'm getting confused here, so maybe an example would help. > > My understanding of @orhankislal 's comment was that say `page_rank` and `bfs` both use the same table called "vertex" then we will have to drop this table at the beginning of each of the test files. This may then collide with a user-created "vertex" table - hence this commit. > My suggestions to solve the problem: > > * Use a function-specific name (`"pagerank_vertex"`) instead of a generic name with (`"vertex"`) no drop by in the beginning. This seems doesn't solve the potential conflict with user created table. Besides, in some IC/DC file we drop tables in between and create table with the same name again. In this case the second drop/creation will fail if user already created it. > * Create a single `"vertex"` table (with no drop by statement) and use that for all graph functions This may cause some assertion based on returning number of records fail as I mentioned in the previous comment, and also needs effort of audition and combining test data. > * Schema-qualify the data table with the install-check schema name. > (not sure if I totally understand this) install-check schema is based on module, not algo, for example pagerank and hits and sssp use the same IC schema, and there will be issue between running two sql files for droping/creating same table. > Please note I'm not suggesting pooling **all** install-check data into a single file (at least not as a solution to this problem). I understood your point. What I was trying to say was multiple algos in the same module share the same data creation file. ---
[GitHub] madlib issue #325: Madpack/ic func schema
Github user jingyimei commented on the issue: https://github.com/apache/madlib/pull/325 @iyerr3 The install-check schema is by module, and we may have multiple files for one module. For your proposed solution, what do you expect in the single source of table creation? Do you mean audit all the table creation and try to combine all the data into one big table and every sql just use the same table? For example, for pagerank and hits in `graph` module, we only create one vertex and one edge table with all the special cases/data, and always use those two tables for all the sql files? Besides, how do we deal with some special test case when we only need 1 row/ few row in the table instead of a bigger table? Also in some DEV check we assert a specific sql returns a specific number of rows, and those assertion will fail because of the change of input table. The other concern is that this solution requires lots of audition and changes in every sql file. Developers need to understand how to add new row in the single table creation file and how sql files under the same module work together with each other. For example, if someone wants to add a test case in HITS under graph, instead of adding a new case in hits sql file, he needs to go to another file and add table, and go back to hits sql file and add test case. It sounds a bit confusing and inconvenient to me. ---
[GitHub] madlib issue #325: Madpack/ic func schema
Github user jingyimei commented on the issue: https://github.com/apache/madlib/pull/325 > @orhankislal Since this is a really big change, could you please add more info in the commit on why this change is necessary and the best available solution? + 1 ---
[GitHub] madlib pull request #323: Build: Add single quote while setting AppendOnly g...
GitHub user jingyimei opened a pull request: https://github.com/apache/madlib/pull/323 Build: Add single quote while setting AppendOnly guc Commit 3db98babe3326fb5e2cd16d0639a2bef264f4b04 added a context manager for setting appendonly to false for all madlib modules. The commit was missing a quote around the `gp_default_storage_options` guc because of which the set command always failed. This commit adds a quote while setting the guc. Co-authored-by: Nikhil Kak You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib test_ao_table Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/323.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #323 commit ebf528f413fc94d9fdeb40045011cb1b374bf12b Author: Jingyi Mei Date: 2018-09-24T18:23:55Z Build: Add single quote while setting AppendOnly guc Commit #3db98babe3326fb5e2cd16d0639a2bef264f4b04 added a context manager for setting appendonly to false for all madlib modules. The commit was missing a quote around the `gp_default_storage_options` guc because of which the set command always failed. This commit adds a quote while setting the guc. Co-authored-by: Nikhil Kak commit b4aabdd1ef0deec4ec89e7acb8dd0c51692a97c9 Author: Jingyi Mei Date: 2018-09-24T18:48:51Z Build: Remove primary key constraint in IC/DC Some of the install-check/dev-check tests were setting primary key constraints. If the user sets the GUC gp_default_storage_options='appendonly=true', IC/DC will fail while table creation because appendonly tables don't support primary key. This commit removes these constrains since they are unnecessary for those test cases. Co-authored-by: Nikhil Kak ---
[GitHub] madlib pull request #301: DT/RF: Don't eliminate single-level categorical va...
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/301#discussion_r205911065 --- Diff: src/ports/postgres/modules/recursive_partitioning/test/decision_tree.sql_in --- @@ -273,7 +274,7 @@ SELECT tree_train('dt_golf'::text, -- source table 'train_output'::text,-- output model table 'id'::text, -- id column 'temperature::double precision'::text, -- response - 'humidity, windy, "Cont_features"'::text, -- features + 'humidity, windy2, "Cont_features"'::text, -- features --- End diff -- we can leave a comment here explaining why we use windy2 with all valid value 'false' ---
[GitHub] madlib pull request #301: DT/RF: Don't eliminate single-level categorical va...
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/301#discussion_r205910952 --- Diff: src/ports/postgres/modules/recursive_partitioning/test/decision_tree.sql_in --- @@ -282,13 +283,16 @@ SELECT tree_train('dt_golf'::text, -- source table 6::integer,-- min split 2::integer,-- min bucket 3::integer,-- number of bins per continuous variable - 'cp=0.01, n_folds=2' -- cost-complexity pruning parameter + 'cp=0.01, n_folds=2', -- cost-complexity pruning parameter + 'null_as_category=True' ); SELECT _print_decision_tree(tree) from train_output; SELECT tree_display('train_output', False); SELECT impurity_var_importance FROM train_output; +SELECT cat_levels_in_text, cat_n_levels, impurity_var_importance FROM train_output; --- End diff -- Is it better to assert cat_n_levels == {1} instead of just quering? ---
[GitHub] madlib pull request #298: misc 1.15 user doc updates
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/298#discussion_r205301880 --- Diff: doc/mainpage.dox.in --- @@ -100,13 +86,14 @@ complete matrix stored as a distributed table. @defgroup grp_matrix Matrix Operations @defgroup grp_matrix_factorization Matrix Factorization -@brief Matrix Factorization methods including Singular Value Decomposition and Low-rank Matrix Factorization +Linear algebra methods that factorize a matrix --- End diff -- Is it possible to keep two copies here, with one has @brief and the other one without brief so that they show up in both places in webpage? ---
[GitHub] madlib issue #298: misc 1.15 user doc updates
Github user jingyimei commented on the issue: https://github.com/apache/madlib/pull/298 Jenkins OK to test ---
[GitHub] madlib pull request #:
Github user jingyimei commented on the pull request: https://github.com/apache/madlib/commit/1fe308c70d2c91fef508d29d81ed0e93da429eb6#commitcomment-29832374 In src/madpack/madpack.py: In src/madpack/madpack.py on line 973: Make sense. ---
[GitHub] madlib pull request #:
Github user jingyimei commented on the pull request: https://github.com/apache/madlib/commit/1fe308c70d2c91fef508d29d81ed0e93da429eb6#commitcomment-29823247 In src/madpack/madpack.py: In src/madpack/madpack.py on line 973: Instead of versions do not mathch, it would be clearer if we mention schema doesn't exisits or madlib are not installed. ---
[GitHub] madlib issue #294: Pagerank: Remove duplicate entries from grouping output
Github user jingyimei commented on the issue: https://github.com/apache/madlib/pull/294 The current master has another issue: When you run ``` DROP TABLE IF EXISTS vertex, "EDGE"; CREATE TABLE vertex( id INTEGER ); CREATE TABLE "EDGE"( src INTEGER, dest INTEGER, user_id INTEGER ); INSERT INTO vertex VALUES (0), (1), (2); INSERT INTO "EDGE" VALUES (0, 1, 1), (0, 2, 1), (1, 2, 1), (2, 1, 1), (0, 1, 2); DROP TABLE IF EXISTS pagerank_ppr_grp_out; DROP TABLE IF EXISTS pagerank_ppr_grp_out_summary; SELECT pagerank( 'vertex', -- Vertex table 'id', -- Vertix id column '"EDGE"', -- "EDGE" table 'src=src, dest=dest', -- "EDGE" args 'pagerank_ppr_grp_out', -- Output table of PageRank NULL, -- Default damping factor (0.85) NULL, -- Default max iters (100) NULL, -- Default Threshold 'user_id'); ``` you will get the following result: ``` madlib=# select * from pagerank_ppr_grp_out order by user_id, id; user_id | id | pagerank -++--- 1 | 0 | 0.05 1 | 0 | 0.05 1 | 1 | 0.614906399170753 1 | 2 | 0.614906399170753 2 | 0 | 0.075 2 | 1 | 0.13875 (6 rows) ``` where for user_id=1 the pagerank scores don't sum up to 1 where they should have to. This PR actually fix this issue and gives the right number. However the dev check didn't have a case to catch this issue before. Suggest to add this corner case in dev check to test future changes. ---
[GitHub] madlib pull request #289: RF: Add impurity variable importance
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/289#discussion_r201489373 --- Diff: src/modules/recursive_partitioning/DT_impl.hpp --- @@ -1512,6 +1512,9 @@ DecisionTree::computeVariableImportance( uint64_t maj_count = getMajorityCount(node_index); uint64_t min_count = node_count - maj_count; +if (min_count == 0) { --- End diff -- We can leave some comment here to address it later ---
[GitHub] madlib pull request #289: RF: Add impurity variable importance
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/289#discussion_r201501719 --- Diff: src/ports/postgres/modules/recursive_partitioning/random_forest.py_in --- @@ -1291,38 +1300,64 @@ def _create_group_table( schema_madlib, output_table_name, oob_error_table, importance_table, cat_features_info_table, grp_key_to_grp_cols, grouping_cols, tree_terminated): -""" Ceate the group table for random forest""" +""" Create the group table for random forest""" + +cat_var_importance_str = '' +con_var_importance_str = '' +impurity_var_importance_str = '' +left_join_importance_table_str = '' +join_impurity_table_str = '' + +if importance_table: +impurity_var_importance_table_name = unique_string(desp='impurity') +plpy.execute(""" +CREATE TEMP TABLE {impurity_var_importance_table_name} AS +SELECT +gid, +{schema_madlib}.array_avg(impurity_var_importance, False) AS impurity_var_importance +FROM {output_table_name} +GROUP BY gid +""".format(**locals())) + +cat_var_importance_str = ", cat_var_importance AS oob_cat_var_importance," +con_var_importance_str = "con_var_importance AS oob_con_var_importance," +impurity_var_importance_str = "impurity_var_importance" +left_join_importance_table_str = """LEFT OUTER JOIN {importance_table} +USING (gid)""".format(importance_table=importance_table) +join_impurity_table_str = """JOIN {impurity_var_importance_table_name} USING (gid)""".format(impurity_var_importance_table_name=impurity_var_importance_table_name) + grouping_cols_str = ('' if grouping_cols is None else grouping_cols + ",") group_table_name = add_postfix(output_table_name, "_group") + sql_create_group_table = """ CREATE TABLE {group_table_name} AS SELECT gid, {grouping_cols_str} -grp_finished as success, +grp_finished AS success, cat_n_levels, cat_levels_in_text, -oob_error, -cat_var_importance, -con_var_importance +oob_error +{cat_var_importance_str} +{con_var_importance_str} +{impurity_var_importance_str} FROM {oob_error_table} JOIN {grp_key_to_grp_cols} USING (gid) JOIN ( SELECT -unnest($1) as grp_key, -unnest($2) as grp_finished +unnest($1) AS grp_key, +unnest($2) AS grp_finished ) tree_terminated USING (grp_key) JOIN {cat_features_info_table} USING (gid) -LEFT OUTER JOIN -{importance_table} -USING (gid) +{left_join_importance_table_str} --- End diff -- We can name it oob_importance_table_* to explicitly distinguish it from impurity importance table ---
[GitHub] madlib pull request #289: RF: Add impurity variable importance
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/289#discussion_r201462356 --- Diff: src/ports/postgres/modules/recursive_partitioning/random_forest.py_in --- @@ -616,23 +628,20 @@ def forest_train( _calculate_oob_error(schema_madlib, oob_prediction_table, oob_error_table, id_col_name, is_classification) - -importance_table = unique_string() -sql_create_empty_imp_tbl = """ -CREATE TEMP TABLE {importance_table} -( -gid integer, -cat_var_importance float8[], -con_var_importance float8[] -); -""".format(**locals()) - plpy.notice("sql_create_empty_imp_tbl:\n"+sql_create_empty_imp_tbl) -plpy.execute(sql_create_empty_imp_tbl) - -# we populate the importance_table only if variable importance is to be -# calculated, otherwise we use an empty table which will be used later -# for an outer join. +importance_table = '' if importance: +importance_table = unique_string() +sql_create_empty_imp_tbl = """ +CREATE TEMP TABLE {importance_table} +( +gid integer, +cat_var_importance float8[], +con_var_importance float8[] +); +""".format(**locals()) + plpy.notice("sql_create_empty_imp_tbl:\n"+sql_create_empty_imp_tbl) --- End diff -- Do we need this notice? ---
[GitHub] madlib pull request #285: Madpack: fix install/reinstall not giving proper e...
GitHub user jingyimei opened a pull request: https://github.com/apache/madlib/pull/285 Madpack: fix install/reinstall not giving proper error message Previously, uninstalling or reinstalling on a database that does not have MADlib already installed fails as expected. However, the info messages do not mention this failure and also keeps going to prepare database objects to install. This commit fixes this issue. If there is no madlib installed in database, madpack will show nothing found to uninstall/reinstall and stop. You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib bug_reinstall Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/285.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #285 commit 677e76d202f620b71159367f1a1fd3aaab5513fc Author: Jingyi Mei Date: 2018-07-02T00:49:24Z Madpack: fix install/reinstall not giving proper error message Previously, uninstalling or reinstalling on a database that does not have MADlib already installed fails as expected. However, the info messages do not mention this failure and also keeps going to prepare database objects to install. This commit fixes this issue. If there is no madlib installed in database, madpack will show nothing found to uninstall/reinstall and stop. ---
[GitHub] madlib pull request #273: Minibatch Preprocessing: fix dependent var with sp...
Github user jingyimei closed the pull request at: https://github.com/apache/madlib/pull/273 ---
[GitHub] madlib issue #273: Minibatch Preprocessing: fix dependent var with special c...
Github user jingyimei commented on the issue: https://github.com/apache/madlib/pull/273 Will handle those in another PR, closing this one. ---
[GitHub] madlib issue #274: Handling special characters in MLP and Encode Categorical...
Github user jingyimei commented on the issue: https://github.com/apache/madlib/pull/274 Will handle those in another PR, closing this one. ---
[GitHub] madlib pull request #274: Handling special characters in MLP and Encode Cate...
Github user jingyimei closed the pull request at: https://github.com/apache/madlib/pull/274 ---
[GitHub] madlib pull request #274: Handling special characters in MLP and Encode Cate...
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/274#discussion_r193525716 --- Diff: src/ports/postgres/modules/utilities/utilities.py_in --- @@ -296,11 +305,24 @@ def py_list_to_sql_string(array, array_type=None, long_format=None): if not array: return "'{{ }}'::{0}".format(array_type) else: -array_str = "ARRAY[ {0} ]" if long_format else "'{{ {0} }}'" -return (array_str + "::{1}").format(','.join(map(str, array)), array_type) +# For non-character array types: +if long_format: +array_str = "ARRAY[ {0} ]"; +return (array_str + "::{1}").format( +','.join(map(str, array)), array_type) +else: +# For character array types, we have to deal with special characters in +# elements an array, i.e, {'ele''one', "ele,two", "ele$three"} : +# We firstly use ",,," to join elements in python list and then call +# postgres string_to_array with a delimiter ",,," to form the final +# psql array, because this sequence of characters will be very +# unlikely to show up in a real world use case and some special +# case (such as "M,M") will be handled. +array_str = "string_to_array($${0}$$, ',,,')" +return (array_str + "::{1}").format( +',,,'.join(map(str, array)), array_type) --- End diff -- Agree ---
[GitHub] madlib pull request #274: Handling special characters in MLP and Encode Cate...
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/274#discussion_r193503140 --- Diff: src/ports/postgres/modules/utilities/utilities.py_in --- @@ -296,11 +305,24 @@ def py_list_to_sql_string(array, array_type=None, long_format=None): if not array: return "'{{ }}'::{0}".format(array_type) else: -array_str = "ARRAY[ {0} ]" if long_format else "'{{ {0} }}'" -return (array_str + "::{1}").format(','.join(map(str, array)), array_type) +# For non-character array types: +if long_format: +array_str = "ARRAY[ {0} ]"; +return (array_str + "::{1}").format( +','.join(map(str, array)), array_type) +else: +# For character array types, we have to deal with special characters in +# elements an array, i.e, {'ele''one', "ele,two", "ele$three"} : +# We firstly use ",,," to join elements in python list and then call +# postgres string_to_array with a delimiter ",,," to form the final +# psql array, because this sequence of characters will be very +# unlikely to show up in a real world use case and some special +# case (such as "M,M") will be handled. +array_str = "string_to_array($${0}$$, ',,,')" +return (array_str + "::{1}").format( +',,,'.join(map(str, array)), array_type) --- End diff -- This sounds like a more obscure, but there is always some corner cases we can't cover given this approach. ---
[GitHub] madlib pull request #273: Minibatch Preprocessing: fix dependent var with sp...
GitHub user jingyimei opened a pull request: https://github.com/apache/madlib/pull/273 Minibatch Preprocessing: fix dependent var with special character JIRA:MADLIB-1237 Previously, minibatch processing would error out when the specified dependent variable has special characters within its values. We fixed this in two places: 1. in the query with WHERE condition, we use $$ to queto string instead of ' ' to do string equals. 2. in the query with creating an array column, instead of using SELECT '{ele'with*special_char, 'M,M', 'M$M'}'::text[], we call SELECT string_to_array(''ele'with*special_char', 'M"M', 'M$M'', ',')::text[] Install check test cases also get updated. Co-Authored-by: Jingyi Mei <j...@pivotal.io> You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib bug_minibatch_preprocessor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/273.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #273 commit 3a55cafd7fad1e87996b4584f7e0431529da7a52 Author: Jingyi Mei <jmei@...> Date: 2018-05-23T23:29:54Z Minibatch Preprocessing: fix dependent var with special character JIRA:MADLIB-1237 Previously, minibatch processing would error out when the specified dependent variable has special characters within its values. We fixed this in two places: 1. in the query with WHERE condition, we use $$ to queto string instead of ' ' to do string equals. 2. in the query with creating an array column, instead of using SELECT '{ele'with*special_char, 'M,M', 'M$M'}'::text[], we call SELECT string_to_array(''ele'with*special_char', 'M"M', 'M$M'', ',')::text[] Install check test cases also get updated. Co-Authored-by: Jingyi Mei <j...@pivotal.io> ---
[GitHub] madlib issue #265: Release: Add v1.14 release notes
Github user jingyimei commented on the issue: https://github.com/apache/madlib/pull/265 LGTM ---
[GitHub] madlib pull request #265: Release: Add v1.14 release notes
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/265#discussion_r182843994 --- Diff: RELEASE_NOTES --- @@ -9,6 +9,56 @@ commit history located at https://github.com/apache/madlib/commits/master. Current list of bugs and issues can be found at https://issues.apache.org/jira/browse/MADLIB. â- +MADlib v1.14: + +Release Date: 2018-April-28 + +New features: +* New module - Balanced datasets: A sampling module to balance classification +datasets by resampling using various techniques including undersampling, +oversampling, uniform sampling or user-defined proportion sampling +(MADLIB-1168) +* Mini-batch: Added a mini-batch optimizer for MLP and a preprocessor function +necessary to create batches from the data (MADLIB-1200, MADLIB-1206) +* k-NN: Added weighted averaging/voting by distance (MADLIB-1181) +* Summary: Added additional stats: number of positive, negative, zero values and +95% confidence intervals for the mean (MADLIB-1167) +* Encode categorical: Updated to produce lower-case column names when possible +(MADLIB-1202) +* MLP: Added support for already one-hot encoded categorical dependent variable +in a classification task (MADLIB-1222) +* Pagerank: Added option for personalized vertices that allows higher weightage +for a subset of vertices which will have a higher jump probability as +compared to other vertices and a random surfer is more likely to +jump to these personalization vertices (MADLIB-1084) + +Bug fixes: +- Fixed issue with invalid calls of construct_array that led to problems +in Postgresql 10 (MADLIB-1185) +- Added newline between file concatenation during PGXN install (MADLIB-1194) +- Fixed upgrade issues in knn (MADLIB-1197) +- Added fix to ensure RF variable importance are always non-negative +- Fixed inconsistency in LDA output and improved usability +(MADLIB-1160, MADLIB-1201) +- Fixed MLP and RF predict for models trained in earlier versions to +ensure misisng optional parameters are given appropriate default values +(MADLIB-1207) +- Fixed a scenario in DT where no features exist due categorical columns +with single level being dropped led to the database crashing +- Fixed step size initialization in MLP based on learning rate policy +(MADLIB-1212) +- Fixed PCA issue that leads to failure when grouping column is a TEXT type +(MADLIB-1215) +- Fixed cat levels output in DT when grouping is enabled (MADLIB-1218) +- Fixed and simplified initialization of model coefficients in MLP +- Removed source table dependency for predicting regression models in MLP +(MADLIB-1223) +- Print loss of first iteration in MLP (MADLIB-1228) + --- End diff -- We should mention MADLIB-1209 Neural net related bug fix. ---
[GitHub] madlib pull request #256: Minibatch Preprocessing: change default buffer siz...
GitHub user jingyimei opened a pull request: https://github.com/apache/madlib/pull/256 Minibatch Preprocessing: change default buffer size formula for grouping This commit changes the previous calculation formula for default buffer size. Previously, we used num_rows_processed/num_of_segments to indicate data distribution in each segment. To adjust this to a grouping scenario, we use avg_num_rows_processed/num_of_segment to indicate data distribution when there are more than one groups of data. Other code changes are due to this change. You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib feature/minibatch-preprocessing-default_buffer_size Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/256.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #256 commit b33b00504ef46c7084b60f9c54d8f6797660542b Author: Jingyi Mei <jmei@...> Date: 2018-04-04T00:50:57Z Minibatch Preprocessing: change default buffer size formula to fit grouping This commit changes the previous calculation formula for default buffer size. Previously, we used num_rows_processed/num_of_segments to indicate data distribution in each segment. To adjust this to a grouping scenario, we use avg_num_rows_processed/num_of_segment to indicate data distribution when there are more than one groups of data. Other code changes are due to this change. ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177916814 --- Diff: src/ports/postgres/modules/graph/test/pagerank.sql_in --- @@ -95,6 +101,49 @@ SELECT assert(relative_error(SUM(pagerank), 1) < 0.1, ) FROM pagerank_gr_out WHERE user_id=2; +-- Tests for Personalized Page Rank + +-- Test without grouping + +DROP TABLE IF EXISTS pagerank_ppr_out; +DROP TABLE IF EXISTS pagerank_ppr_out_summary; +SELECT pagerank( + 'vertex',-- Vertex table + 'id',-- Vertix id column + '"EDGE"', -- "EDGE" table + 'src=src, dest=dest', -- "EDGE" args + 'pagerank_ppr_out', -- Output table of PageRank + NULL, -- Default damping factor (0.85) + NULL, -- Default max iters (100) + NULL, -- Default Threshold + NULL, -- Grouping column +'{1,3}'); -- Personlized Nodes + + +-- View the PageRank of all vertices, sorted by their scores. +SELECT assert(relative_error(SUM(pagerank), 1) < 0.00124, --- End diff -- Is this 0.00124 based on current test result? Can we make it smaller? ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177899442 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -211,19 +261,30 @@ def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, distinct_grp_table, grouping_cols_list) # Find number of vertices in each group, this is the normalizing factor # for computing the random_prob +where_clause_ppr = '' +if nodes_of_interest > 0: +where_clause_ppr = """where __vertices__ = ANY(ARRAY{nodes_of_interest})""".format( +**locals()) +random_prob_grp = 1.0 - damping_factor +init_prob_grp = 1.0 / len(nodes_of_interest) +else: +random_prob_grp = """{rand_damp}/COUNT(__vertices__)::DOUBLE PRECISION + """.format(**locals()) +init_prob_grp = """1/COUNT(__vertices__)::DOUBLE PRECISION""".format( +**locals()) + plpy.execute("DROP TABLE IF EXISTS {0}".format(vertices_per_group)) plpy.execute("""CREATE TEMP TABLE {vertices_per_group} AS SELECT {distinct_grp_table}.*, -1/COUNT(__vertices__)::DOUBLE PRECISION AS {init_pr}, -{rand_damp}/COUNT(__vertices__)::DOUBLE PRECISION -AS {random_prob} +{init_prob_grp} AS {init_pr}, +{random_prob_grp} as {random_prob} FROM {distinct_grp_table} INNER JOIN ( SELECT {grouping_cols}, {src} AS __vertices__ FROM {edge_temp_table} UNION SELECT {grouping_cols}, {dest} FROM {edge_temp_table} ){subq} -ON {grouping_where_clause} +ON {grouping_where_clause} {where_clause_ppr} --- End diff -- put {where_clause_ppr} in a new line ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177912288 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -527,14 +615,55 @@ def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, """.format(**locals())) # Step 4: Cleanup -plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2},{3},{4},{5},{6} +plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2},{3},{4},{5},{6},{7} """.format(out_cnts, edge_temp_table, cur, message, cur_unconv, - message_unconv, nodes_with_no_incoming_edges)) + message_unconv, nodes_with_no_incoming_edges, personalized_nodes)) --- End diff -- This "personalized_nodes" table doesn't get created before ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177897977 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -211,19 +261,30 @@ def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, distinct_grp_table, grouping_cols_list) # Find number of vertices in each group, this is the normalizing factor # for computing the random_prob +where_clause_ppr = '' +if nodes_of_interest > 0: +where_clause_ppr = """where __vertices__ = ANY(ARRAY{nodes_of_interest})""".format( +**locals()) +random_prob_grp = 1.0 - damping_factor +init_prob_grp = 1.0 / len(nodes_of_interest) --- End diff -- len(nodes_of_interest) == total_ppr_nodes ? so that we don't need to run O(n) again ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177910146 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -211,19 +261,30 @@ def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, distinct_grp_table, grouping_cols_list) # Find number of vertices in each group, this is the normalizing factor # for computing the random_prob +where_clause_ppr = '' +if nodes_of_interest > 0: +where_clause_ppr = """where __vertices__ = ANY(ARRAY{nodes_of_interest})""".format( --- End diff -- After consulting with QP, `__vertices__ = ANY(ARRAY{nodes_of_interest})` works exactly the same as `__vertices__ in (nodes_of_interest)`, this may look simpler. Besides, since we use this condition in multiple places, I am wondering if a join clause is faster - we create a temp table that saves special node ids and we join this temp table with vertex table by vertex id - QP suggested to try both and see which one runs faster. ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177851780 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -44,29 +44,62 @@ from utilities.utilities import add_postfix from utilities.utilities import extract_keyvalue_params from utilities.utilities import unique_string, split_quoted_delimited_str from utilities.utilities import is_platform_pg +from utilities.utilities import py_list_to_sql_string from utilities.validate_args import columns_exist_in_table, get_cols_and_types from utilities.validate_args import table_exists + def validate_pagerank_args(schema_madlib, vertex_table, vertex_id, edge_table, edge_params, out_table, damping_factor, max_iter, - threshold, grouping_cols_list): + threshold, grouping_cols_list, nodes_of_interest): """ Function to validate input parameters for PageRank """ validate_graph_coding(vertex_table, vertex_id, edge_table, edge_params, out_table, 'PageRank') -## Validate args such as threshold and max_iter +# Validate args such as threshold and max_iter validate_params_for_link_analysis(schema_madlib, "PageRank", -threshold, max_iter, -edge_table, grouping_cols_list) + threshold, max_iter, + edge_table, grouping_cols_list) _assert(damping_factor >= 0.0 and damping_factor <= 1.0, "PageRank: Invalid damping factor value ({0}), must be between 0 and 1.". format(damping_factor)) - -def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, - out_table, damping_factor, max_iter, threshold, grouping_cols, **kwargs): +# Validate against the givin set of nodes for Personalized Page Rank +if nodes_of_interest: +nodes_of_interest_count = len(nodes_of_interest) +vertices_count = plpy.execute(""" + SELECT count(DISTINCT({vertex_id})) AS cnt FROM {vertex_table} + WHERE {vertex_id} = ANY(ARRAY{nodes_of_interest}) + """.format(**locals()))[0]["cnt"] +# Check to see if the given set of nodes exist in vertex table +if vertices_count != len(nodes_of_interest): +plpy.error("PageRank: Invalid value for {0}, must be a subset of the vertex_table".format( --- End diff -- This query tests several invalid scenarios, including duplicate nodes in nodes_of_interest, in the error msg maybe we can say "Invalid value for {0}, must be a subset of the vertex_table without duplicate nodes". ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177894976 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -211,19 +261,30 @@ def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, distinct_grp_table, grouping_cols_list) # Find number of vertices in each group, this is the normalizing factor # for computing the random_prob +where_clause_ppr = '' +if nodes_of_interest > 0: --- End diff -- `if nodes_of_interest:` or `if total_ppr_nodes > 0:` ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177915601 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -647,6 +778,26 @@ SELECT * FROM pagerank_out ORDER BY user_id, pagerank DESC; -- View the summary table to find the number of iterations required for -- convergence for each group. SELECT * FROM pagerank_out_summary; + +-- Compute the Personalized PageRank: +DROP TABLE IF EXISTS pagerank_out, pagerank_out_summary; +SELECT madlib.pagerank( + 'vertex', -- Vertex table + 'id', -- Vertix id column + 'edge', -- Edge table + 'src=src, dest=dest', -- Comma delimted string of edge arguments + 'pagerank_out', -- Output table of PageRank +NULL,-- Default damping factor (0.85) +NULL,-- Default max iters (100) +NULL,-- Default Threshold +NULL,-- No Grouping --- End diff -- move those NULLs one space left ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177914251 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -149,25 +186,37 @@ def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, out_cnts = unique_string(desp='out_cnts') out_cnts_cnt = unique_string(desp='cnt') v1 = unique_string(desp='v1') +personalized_nodes = unique_string(desp='personalized_nodes') if is_platform_pg(): cur_distribution = cnts_distribution = '' else: -cur_distribution = cnts_distribution = \ -"DISTRIBUTED BY ({0}{1})".format( -grouping_cols_comma, vertex_id) +cur_distribution = cnts_distribution = "DISTRIBUTED BY ({0}{1})".format( +grouping_cols_comma, vertex_id) cur_join_clause = """{edge_temp_table}.{dest} = {cur}.{vertex_id} """.format(**locals()) out_cnts_join_clause = """{out_cnts}.{vertex_id} = {edge_temp_table}.{src} """.format(**locals()) v1_join_clause = """{v1}.{vertex_id} = {edge_temp_table}.{src} """.format(**locals()) +# Get query params for Personalized Page Rank. +ppr_params = get_query_params_for_ppr(nodes_of_interest, damping_factor, --- End diff -- Is it better to check `if nodes_of_interest` before calling get_query_params_for_ppr instead of checking it in get_query_params_for_ppr? ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177914961 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -551,14 +680,16 @@ def pagerank_help(schema_madlib, message, **kwargs): message.lower() in ("usage", "help", "?"): help_string = "Get from method below" help_string = get_graph_usage(schema_madlib, 'PageRank', -"""out_table TEXT, -- Name of the output table for PageRank + """out_table TEXT, -- Name of the output table for PageRank damping_factor DOUBLE PRECISION, -- Damping factor in random surfer model -- (DEFAULT = 0.85) max_iter INTEGER, -- Maximum iteration number (DEFAULT = 100) threshold DOUBLE PRECISION, -- Stopping criteria (DEFAULT = 1/(N*1000), -- N is number of vertices in the graph) -grouping_col TEXT -- Comma separated column names to group on +grouping_col TEXT, -- Comma separated column names to group on -- (DEFAULT = NULL, no grouping) +nodes_of_interest ARRAY OF INTEGER -- A comma seperated list of vertices + or nodes for personalized page rank. """) + """ --- End diff -- indent left side, and indent comment(--) right ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177892625 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -44,29 +44,62 @@ from utilities.utilities import add_postfix from utilities.utilities import extract_keyvalue_params from utilities.utilities import unique_string, split_quoted_delimited_str from utilities.utilities import is_platform_pg +from utilities.utilities import py_list_to_sql_string from utilities.validate_args import columns_exist_in_table, get_cols_and_types from utilities.validate_args import table_exists + def validate_pagerank_args(schema_madlib, vertex_table, vertex_id, edge_table, edge_params, out_table, damping_factor, max_iter, - threshold, grouping_cols_list): + threshold, grouping_cols_list, nodes_of_interest): """ Function to validate input parameters for PageRank """ validate_graph_coding(vertex_table, vertex_id, edge_table, edge_params, out_table, 'PageRank') -## Validate args such as threshold and max_iter +# Validate args such as threshold and max_iter validate_params_for_link_analysis(schema_madlib, "PageRank", -threshold, max_iter, -edge_table, grouping_cols_list) + threshold, max_iter, + edge_table, grouping_cols_list) _assert(damping_factor >= 0.0 and damping_factor <= 1.0, "PageRank: Invalid damping factor value ({0}), must be between 0 and 1.". format(damping_factor)) - -def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, - out_table, damping_factor, max_iter, threshold, grouping_cols, **kwargs): +# Validate against the givin set of nodes for Personalized Page Rank +if nodes_of_interest: +nodes_of_interest_count = len(nodes_of_interest) +vertices_count = plpy.execute(""" + SELECT count(DISTINCT({vertex_id})) AS cnt FROM {vertex_table} + WHERE {vertex_id} = ANY(ARRAY{nodes_of_interest}) + """.format(**locals()))[0]["cnt"] +# Check to see if the given set of nodes exist in vertex table +if vertices_count != len(nodes_of_interest): +plpy.error("PageRank: Invalid value for {0}, must be a subset of the vertex_table".format( +nodes_of_interest)) +# Validate given set of nodes against each user group. +# If all the given nodes are not present in the user group +# then throw an error. +if grouping_cols_list: +missing_user_grps = '' +grp_by_column = get_table_qualified_col_str( +edge_table, grouping_cols_list) +grps_without_nodes = plpy.execute(""" + SELECT {grp_by_column} FROM {edge_table} + WHERE src = ANY(ARRAY{nodes_of_interest}) group by {grp_by_column} + having count(DISTINCT(src)) != {nodes_of_interest_count} + """.format(**locals())) +for row in range(grps_without_nodes.nrows()): +missing_user_grps += str(grps_without_nodes[row]['user_id']) +if row < grps_without_nodes.nrows() - 1: +missing_user_grps += ' ,' +if grps_without_nodes.nrows() > 0: +plpy.error("Nodes for Personalizaed Page Rank are missing from these groups: {0} ".format( +missing_user_grps)) + --- End diff -- Here some similar things are test twice - when `if nodes_of_interest`, there is a `count` operation in line 73 and in line 77 there is one test(this is for without grouping). Then when `if grouping_cols_list`, another `count` and `compare` happen in line 90 per group. There might be a way to simplify the logic here so that for grouping, we don't need to do it twice. Besides, if the above query really slow down performance a lot, I would think about doing it simpler by not giving a list of groups missing special nodes but just a warning(optional, depending on how expensive the above query is). ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177916983 --- Diff: src/ports/postgres/modules/graph/test/pagerank.sql_in --- @@ -95,6 +101,49 @@ SELECT assert(relative_error(SUM(pagerank), 1) < 0.1, ) FROM pagerank_gr_out WHERE user_id=2; +-- Tests for Personalized Page Rank + +-- Test without grouping + +DROP TABLE IF EXISTS pagerank_ppr_out; +DROP TABLE IF EXISTS pagerank_ppr_out_summary; +SELECT pagerank( + 'vertex',-- Vertex table + 'id',-- Vertix id column + '"EDGE"', -- "EDGE" table + 'src=src, dest=dest', -- "EDGE" args + 'pagerank_ppr_out', -- Output table of PageRank + NULL, -- Default damping factor (0.85) + NULL, -- Default max iters (100) + NULL, -- Default Threshold + NULL, -- Grouping column +'{1,3}'); -- Personlized Nodes + + +-- View the PageRank of all vertices, sorted by their scores. +SELECT assert(relative_error(SUM(pagerank), 1) < 0.00124, +'PageRank: Scores do not sum up to 1.' +) FROM pagerank_ppr_out; + + +-- Test with grouping + +DROP TABLE IF EXISTS pagerank_ppr_grp_out; +DROP TABLE IF EXISTS pagerank_ppr_grp_out_summary; +SELECT pagerank( + 'vertex',-- Vertex table + 'id',-- Vertix id column + '"EDGE"', -- "EDGE" table + 'src=src, dest=dest', -- "EDGE" args + 'pagerank_ppr_grp_out', -- Output table of PageRank + NULL, -- Default damping factor (0.85) + NULL, -- Default max iters (100) + NULL, -- Default Threshold + 'user_id', -- Grouping column +'{1,3}'); -- Personlized Nodes + +SELECT assert(count(*) = 14, 'Tuple count for Pagerank out table != 14') FROM pagerank_ppr_grp_out; --- End diff -- can we do similar assertion here by group? ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177917620 --- Diff: src/ports/postgres/modules/graph/pagerank.sql_in --- @@ -273,6 +278,48 @@ SELECT * FROM pagerank_out_summary ORDER BY user_id; (2 rows) +-# Example of Personalized Page Rank with Nodes {2,4} + +DROP TABLE IF EXISTS pagerank_out, pagerank_out_summary; +SELECT madlib.pagerank( + 'vertex', -- Vertex table + 'id', -- Vertix id column + 'edge', -- Edge table + 'src=src, dest=dest', -- Comma delimted string of edge arguments + 'pagerank_out', -- Output table of PageRank +NULL,-- Default damping factor (0.85) +NULL,-- Default max iters (100) +NULL,-- Default Threshold +NULL,-- No Grouping + '{2,4}'); -- Personlized Nodes --- End diff -- Great ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177915929 --- Diff: src/ports/postgres/modules/graph/test/pagerank.sql_in --- @@ -66,7 +66,12 @@ SELECT pagerank( 'id',-- Vertix id column '"EDGE"', -- "EDGE" table 'src=src, dest=dest', -- "EDGE" args - 'pagerank_out'); -- Output table of PageRank + 'pagerank_out',-- Output table of PageRank + NULL, -- Default damping factor (0.85) + NULL, -- Default max iters (100) + NULL, -- Default Threshold + NULL, -- No Grouping + NULL); -- Personlized Nodes --- End diff -- In this case, we can remove the last 5 NULLs since they are all optional. ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177893734 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -122,12 +158,13 @@ def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, grouping_where_clause = '' group_by_clause = '' random_prob = '' +ppr_join_clause = '' edge_temp_table = unique_string(desp='temp_edge') grouping_cols_comma = grouping_cols + ',' if grouping_cols else '' distribution = ('' if is_platform_pg() else "DISTRIBUTED BY ({0}{1})".format( -grouping_cols_comma, dest)) +grouping_cols_comma, dest)) --- End diff -- maybe indent with the above line, or move the above line backwards to the current place ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177917195 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -149,25 +164,39 @@ def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, out_cnts = unique_string(desp='out_cnts') out_cnts_cnt = unique_string(desp='cnt') v1 = unique_string(desp='v1') +personalized_nodes = unique_string(desp='personalized_nodes') if is_platform_pg(): cur_distribution = cnts_distribution = '' else: -cur_distribution = cnts_distribution = \ -"DISTRIBUTED BY ({0}{1})".format( -grouping_cols_comma, vertex_id) +cur_distribution = cnts_distribution = "DISTRIBUTED BY ({0}{1})".format( +grouping_cols_comma, vertex_id) cur_join_clause = """{edge_temp_table}.{dest} = {cur}.{vertex_id} """.format(**locals()) out_cnts_join_clause = """{out_cnts}.{vertex_id} = {edge_temp_table}.{src} """.format(**locals()) v1_join_clause = """{v1}.{vertex_id} = {edge_temp_table}.{src} """.format(**locals()) +# Get query params for Personalized Page Rank. +ppr_params = get_query_params_for_ppr(nodes_of_interest, damping_factor, + ppr_join_clause, vertex_id, + edge_temp_table, vertex_table, cur_distribution, + personalized_nodes) +total_ppr_nodes = ppr_params[0] +random_jump_prob_ppr = ppr_params[1] +ppr_join_clause = ppr_params[2] + random_probability = (1.0 - damping_factor) / n_vertices +if total_ppr_nodes > 0: +random_jump_prob = random_jump_prob_ppr +else: +random_jump_prob = random_probability --- End diff -- Got it. ---
[GitHub] madlib pull request #241: MiniBatch Pre-Processor: Add new module minibatch_...
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/241#discussion_r175957289 --- Diff: src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in --- @@ -0,0 +1,559 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + + +""" +@file minibatch_preprocessing.py_in + +""" +from math import ceil +import plpy + +from utilities import add_postfix +from utilities import _assert +from utilities import get_seg_number +from utilities import is_platform_pg +from utilities import is_psql_numeric_type +from utilities import is_string_formatted_as_array_expression +from utilities import py_list_to_sql_string +from utilities import split_quoted_delimited_str +from utilities import _string_to_array +from utilities import validate_module_input_params +from mean_std_dev_calculator import MeanStdDevCalculator +from validate_args import get_expr_type +from validate_args import output_tbl_valid +from validate_args import _tbl_dimension_rownum + +m4_changequote(`') + +# These are readonly variables, do not modify +MINIBATCH_OUTPUT_DEPENDENT_COLNAME = "dependent_varname" +MINIBATCH_OUTPUT_INDEPENDENT_COLNAME = "independent_varname" + +class MiniBatchPreProcessor: +""" +This class is responsible for executing the main logic of mini batch +preprocessing, which packs multiple rows of selected columns from the +source table into one row based on the buffer size +""" +def __init__(self, schema_madlib, source_table, output_table, + dependent_varname, independent_varname, buffer_size, **kwargs): +self.schema_madlib = schema_madlib +self.source_table = source_table +self.output_table = output_table +self.dependent_varname = dependent_varname +self.independent_varname = independent_varname +self.buffer_size = buffer_size + +self.module_name = "minibatch_preprocessor" +self.output_standardization_table = add_postfix(self.output_table, + "_standardization") +self.output_summary_table = add_postfix(self.output_table, "_summary") +self._validate_minibatch_preprocessor_params() + +def minibatch_preprocessor(self): +# Get array expressions for both dep and indep variables from the +# MiniBatchQueryFormatter class +dependent_var_dbtype = get_expr_type(self.dependent_varname, + self.source_table) +qry_formatter = MiniBatchQueryFormatter(self.source_table) +dep_var_array_str, dep_var_classes_str = qry_formatter.\ +get_dep_var_array_and_classes(self.dependent_varname, + dependent_var_dbtype) +indep_var_array_str = qry_formatter.get_indep_var_array_str( + self.independent_varname) + +standardizer = MiniBatchStandardizer(self.schema_madlib, + self.source_table, + dep_var_array_str, + indep_var_array_str, + self.output_standardization_table) +standardize_query = standardizer.get_query_for_standardizing() + +num_rows_processed, num_missing_rows_skipped = self.\ + _get_skipped_rows_processed_count( +dep_var_array_str, +indep_var_array_str) +calculated_buffer_size = MiniBatchBufferSizeCalculator.\ +
[GitHub] madlib issue #240: MLP: Fix step size initialization based on learning rate ...
Github user jingyimei commented on the issue: https://github.com/apache/madlib/pull/240 We manually tested the bug fix by printing the step_size and step_size_init for each iteration and then called mlp with all the possible policies. step_size was updated as expected. +1 for the changes ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r175663431 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -149,25 +164,39 @@ def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, out_cnts = unique_string(desp='out_cnts') out_cnts_cnt = unique_string(desp='cnt') v1 = unique_string(desp='v1') +personalized_nodes = unique_string(desp='personalized_nodes') if is_platform_pg(): cur_distribution = cnts_distribution = '' else: -cur_distribution = cnts_distribution = \ -"DISTRIBUTED BY ({0}{1})".format( -grouping_cols_comma, vertex_id) +cur_distribution = cnts_distribution = "DISTRIBUTED BY ({0}{1})".format( +grouping_cols_comma, vertex_id) cur_join_clause = """{edge_temp_table}.{dest} = {cur}.{vertex_id} """.format(**locals()) out_cnts_join_clause = """{out_cnts}.{vertex_id} = {edge_temp_table}.{src} """.format(**locals()) v1_join_clause = """{v1}.{vertex_id} = {edge_temp_table}.{src} """.format(**locals()) +# Get query params for Personalized Page Rank. +ppr_params = get_query_params_for_ppr(nodes_of_interest, damping_factor, + ppr_join_clause, vertex_id, + edge_temp_table, vertex_table, cur_distribution, + personalized_nodes) +total_ppr_nodes = ppr_params[0] +random_jump_prob_ppr = ppr_params[1] +ppr_join_clause = ppr_params[2] + random_probability = (1.0 - damping_factor) / n_vertices +if total_ppr_nodes > 0: +random_jump_prob = random_jump_prob_ppr +else: +random_jump_prob = random_probability --- End diff -- Can move (1.0 - damping_factor) / n_vertices here since random_probability is not used anywhere else. ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r175664342 --- Diff: src/ports/postgres/modules/graph/test/pagerank.sql_in --- @@ -84,7 +89,8 @@ SELECT pagerank( NULL, NULL, NULL, - 'user_id'); + 'user_id', + NULL); -- View the PageRank of all vertices, sorted by their scores. SELECT assert(relative_error(SUM(pagerank), 1) < 0.1, --- End diff -- We may need at least two test cases with nodes_of_interest not null and with/without grouping. ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r175627510 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -527,14 +562,63 @@ def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, """.format(**locals())) # Step 4: Cleanup -plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2},{3},{4},{5},{6} +plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2},{3},{4},{5},{6},{7} """.format(out_cnts, edge_temp_table, cur, message, cur_unconv, - message_unconv, nodes_with_no_incoming_edges)) + message_unconv, nodes_with_no_incoming_edges, personalized_nodes)) if grouping_cols: plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2} """.format(vertices_per_group, temp_summary_table, distinct_grp_table)) + +def get_query_params_for_ppr(nodes_of_interest, damping_factor, + ppr_join_clause, vertex_id, edge_temp_table, vertex_table, + cur_distribution, personalized_nodes): +""" + This function will prepare the Join Clause and the condition to Calculate the Personalized Page Rank + and Returns Total number of user provided nodes of interest, A join Clause and a clause to be added + to existing formula to calculate pagerank. + + Args: + @param nodes_of_interest + @param damping_factor + @param ppr_join_clause + @param vertex_id + @param edge_temp_table + @param vertex_table + @param cur_distribution + + Returns : + (Integer, String, String) + +""" +total_ppr_nodes = 0 +random_jump_prob_ppr = '' + +if nodes_of_interest: +total_ppr_nodes = len(nodes_of_interest) +init_value_ppr_nodes = 1.0 / total_ppr_nodes +# Create a Temp table that holds the Inital probabilities for the +# user provided nodes +plpy.execute(""" +CREATE TEMP TABLE {personalized_nodes} AS +SELECT {vertex_id}, {init_value_ppr_nodes}::DOUBLE PRECISION as pagerank +FROM {vertex_table} where {vertex_id} = ANY(ARRAY{nodes_of_interest}) +{cur_distribution} +""".format(**locals())) +ppr_join_clause = """ LEFT JOIN {personalized_nodes} on +{personalized_nodes}.{vertex_id} = {edge_temp_table}.dest""".format(**locals()) +prob_value = 1.0 - damping_factor + +# In case of PPR, Assign the Random jump probability to the nodes_of_interest only. +# For rest of the nodes, Random jump probability will be zero. +random_jump_prob_ppr = """ CASE when {edge_temp_table}.dest = ANY(ARRAY{nodes_of_interest}) +THEN {prob_value} +ELSE 0 +END """.format(**locals()) +return(total_ppr_nodes, random_jump_prob_ppr, ppr_join_clause) + + def pagerank_help(schema_madlib, message, **kwargs): --- End diff -- We need the new parameter explanation in helper function too ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r175665727 --- Diff: src/ports/postgres/modules/graph/pagerank.sql_in --- @@ -120,6 +121,10 @@ distribution per group. When this value is NULL, no grouping is used and a single model is generated for all data. @note Expressions are not currently supported for 'grouping_cols'. + nodes_of_interest (optional) --- End diff -- @fmcquillan99 Do we also need additional explanation of personalized pagerank somewhere in the user doc? ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r175631615 --- Diff: src/ports/postgres/modules/graph/pagerank.sql_in --- @@ -273,6 +278,48 @@ SELECT * FROM pagerank_out_summary ORDER BY user_id; (2 rows) +-# Example of Personalized Page Rank with Nodes {2,4} + +DROP TABLE IF EXISTS pagerank_out, pagerank_out_summary; +SELECT madlib.pagerank( + 'vertex', -- Vertex table + 'id', -- Vertix id column + 'edge', -- Edge table + 'src=src, dest=dest', -- Comma delimted string of edge arguments + 'pagerank_out', -- Output table of PageRank +NULL,-- Default damping factor (0.85) +NULL,-- Default max iters (100) +NULL,-- Default Threshold +NULL,-- No Grouping + '{2,4}'); -- Personlized Nodes --- End diff -- Another valid input for personalized nodes array is 'ARRAY[2,4]', should be mentioned in some example or user doc later. ---
[GitHub] madlib pull request #245: Reduce Install Check run time
GitHub user jingyimei opened a pull request: https://github.com/apache/madlib/pull/245 Reduce Install Check run time To reduce the total run time of install check, we looked at the top 5 modules that take longest and modified install check test cases. See each commit for details. You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib reduce_IC_run_time Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/245.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #245 commit bfb3ef87302d80412c72f82194007f088e828974 Author: Jingyi Mei <jmei@...> Date: 2018-03-16T00:07:44Z PCA: Refactor IC for pca_project to reduce run time To reduce the run time for pca_project, we removed calls to pca_sparse_train and pca_train. Instead, we directly insert data in the sql file. Also, we renamed some output table names for simplicity. Co-authored-by: Nikhil Kak <n...@pivotal.io> commit a744b396cd7d21a43f7ac1bfede963a1bae318d9 Author: Jingyi Mei <jmei@...> Date: 2018-03-16T00:45:23Z Decision Tree: Modify IC to reduce run time 1. We use a smaller array dataset as input to one of the test case, which reduced 2/3 of the decision tree IC run time. 2. Reduce n_folds from 5 to 3 in one test case Co-authored-by: Nikhil Kak <n...@pivotal.io> commit f61bf0e88a72506e299060f3e20f062d58338ec6 Author: Jingyi Mei <jmei@...> Date: 2018-03-16T18:41:15Z Random Forest: Clean up install check This commit reorders test cases and adds comments. Besides, it removes unnecessary casting in queries. commit 74c650ba3ef56206e86f461c00f61cb1c70fb78a Author: Jingyi Mei <jmei@...> Date: 2018-03-16T18:44:26Z Random Foreset: Reduce install check time This commit changes num of trees from 100 to 10 in two test cases, so that the total run time of IC will be reduced. Co-authored-by: Nikhil Kak <n...@pivotal.io> commit c5b88d55a5a9eb25c5dc9ed01f546ab31c4b8f5e Author: Jngyi Mei and Nikhil Kak <jmei+nkak@...> Date: 2018-03-19T18:31:58Z Elastic Net: Remove cross validation test To reduce the Install Check run time for Elastic Net, we removed the test case with cross validation. Co-authored-by: Jingyi Mei <j...@pivotal.io> Co-authored-by: Nikhil Kak <n...@pivotal.io> ---
[GitHub] madlib pull request #234: Create lower case column name in encode_categorica...
GitHub user jingyimei opened a pull request: https://github.com/apache/madlib/pull/234 Create lower case column name in encode_categorical_variables() JIRA:MADLIB-1202 The previous madlib.encode_categorical_variables() function generates column name with some capital characters, including: 1. when you specify top_values, there will be a column name with suffix __MISC__ 2. when you set encode_nulls as True, there will be a column name with suffix __NULL 3. when the original column is boolean type, there will be column names with suffix _True and _False In the above cases, users have to use double quoting to query, which is not conveninet. This commit adresses this, and all of the three scenarios will generate coloumn name with lower cases. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jingyimei/madlib encode_categorial_column_name_change Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/234.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #234 commit 4d78f9425ffb76089d30bbe85cb3e07f1050268a Author: Jingyi Mei <jmei@...> Date: 2018-02-14T00:11:39Z Create lower case column name in encode_categorical_variables() JIRA:MADLIB-1202 The previous madlib.encode_categorical_variables() function generates column name with some capital characters, including: 1. when you specify top_values, there will be a column name with suffix __MISC__ 2. when you set encode_nulls as True, there will be a column name with suffix __NULL 3. when the original column is boolean type, there will be column names with suffix _True and _False In the above cases, users have to use double quoting to query, which is not conveninet. This commit adresses this, and all of the three scenarios will generate coloumn name with lower cases. ---
[GitHub] madlib pull request #232: Multiple LDA improvements and fixes
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/232#discussion_r167717958 --- Diff: src/ports/postgres/modules/utilities/text_utilities.sql_in --- @@ -74,175 +81,231 @@ tasks related to text. Flag to indicate if a vocabulary is to be created. If TRUE, an additional output table is created containing the vocabulary of all words, with an id assigned to each word. The table is called output_table_vocabulary -(suffix added to the output_table name) and contains the +(i.e., suffix added to the output_table name) and contains the following columns: -- \c wordid: An id assignment for each word -- \c word: The word/term +- \c wordid: An id for each word. --- End diff -- We can mention it is in alphabetic ordering. ---
[GitHub] madlib pull request #232: Multiple LDA improvements and fixes
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/232#discussion_r167715245 --- Diff: src/ports/postgres/modules/utilities/text_utilities.sql_in --- @@ -74,175 +81,231 @@ tasks related to text. Flag to indicate if a vocabulary is to be created. If TRUE, an additional output table is created containing the vocabulary of all words, with an id assigned to each word. The table is called output_table_vocabulary -(suffix added to the output_table name) and contains the +(i.e., suffix added to the output_table name) and contains the following columns: -- \c wordid: An id assignment for each word -- \c word: The word/term +- \c wordid: An id for each word. +- \c word: The word/term corresponding to the id. @anchor examples @par Examples --# Prepare datasets with some example documents +-# First we create a document table with one document per row: DROP TABLE IF EXISTS documents; -CREATE TABLE documents(docid INTEGER, doc_contents TEXT); +CREATE TABLE documents(docid INT4, contents TEXT); INSERT INTO documents VALUES -(1, 'I like to eat broccoli and banana. I ate a banana and spinach smoothie for breakfast.'), -(2, 'Chinchillas and kittens are cute.'), -(3, 'My sister adopted two kittens yesterday'), -(4, 'Look at this cute hamster munching on a piece of broccoli'); +(0, 'I like to eat broccoli and bananas. I ate a banana and spinach smoothie for breakfast.'), +(1, 'Chinchillas and kittens are cute.'), +(2, 'My sister adopted two kittens yesterday.'), +(3, 'Look at this cute hamster munching on a piece of broccoli.'); - --# Add a new column containing the words (lower-cased) in a text array +You can apply stemming, stop word removal and tokenization at this point +in order to prepare the documents for text processing. +Depending upon your database version, various tools are +available. Databases based on more recent versions of +PostgreSQL may do something like: + +SELECT tsvector_to_array(to_tsvector('english',contents)) from documents; + + +tsvector_to_array ++-- + {ate,banana,breakfast,broccoli,eat,like,smoothi,spinach} + {chinchilla,cute,kitten} + {adopt,kitten,sister,two,yesterday} + {broccoli,cute,hamster,look,munch,piec} +(4 rows) + +In this example, we assume a database based on an older +version of PostgreSQL and just perform basic punctuation +removal and tokenization. The array of words is added as +a new column to the documents table: ALTER TABLE documents ADD COLUMN words TEXT[]; -UPDATE documents SET words = regexp_split_to_array(lower(doc_contents), E'[s+.]'); +UPDATE documents SET words = +regexp_split_to_array(lower( +regexp_replace(contents, E'[,.;\\']','', 'g') +), E'[s+]'); +\\x on +SELECT * FROM documents ORDER BY docid; + + +-[ RECORD 1 ] +docid| 0 +contents | I like to eat broccoli and bananas. I ate a banana and spinach smoothie for breakfast. +words| {i,like,to,eat,broccoli,and,bananas,i,ate,a,banana,and,spinach,smoothie,for,breakfast} +-[ RECORD 2 ] +docid| 1 +contents | Chinchillas and kittens are cute. +words| {chinchillas,and,kittens,are,cute} +-[ RECORD 3 ] +docid| 2 +contents | My sister adopted two kittens yesterday. +words| {my,sister,adopted,two,kittens,yesterday} +-[ RECORD 4 ] +docid| 3 +contents | Look at this cute hamster munching on a piece of broccoli. +words| {look,at,this,cute,hamster,munching,on,a,piece,of,broccoli} --# Compute the frequency of each word in each document +-# Compute the frequency of each word in each document: -DROP TABLE IF EXISTS documents_tf; -SELECT madlib.term_frequency('documents', 'docid', 'words', 'documents_tf'); -SELECT * FROM documents_tf order by docid; +DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary; +SELECT madlib.term_frequency('documents',-- input table + 'docid',-- document id --- End diff -- document id column ---
[GitHub] madlib pull request #232: Multiple LDA improvements and fixes
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/232#discussion_r167708065 --- Diff: src/ports/postgres/modules/lda/lda.sql_in --- @@ -182,324 +105,789 @@ lda_train( data_table, \b Arguments data_table -TEXT. The name of the table storing the training dataset. Each row is +TEXT. Name of the table storing the training dataset. Each row is in the form docid, wordid, count where \c docid, \c wordid, and \c count -are non-negative integers. - +are non-negative integers. The \c docid column refers to the document ID, the \c wordid column is the word ID (the index of a word in the vocabulary), and \c count is the -number of occurrences of the word in the document. - -Please note that column names for \c docid, \c wordid, and \c count are currently fixed, so you must use these -exact names in the data_table. +number of occurrences of the word in the document. Please note: + +- \c wordid must be +contiguous integers going from from 0 to \c voc_size \c 1. +- column names for \c docid, \c wordid, and \c count are currently fixed, +so you must use these exact names in the data_table. + +The function Term Frequency +can be used to generate vocabulary in the required format from raw documents. + model_table -TEXT. The name of the table storing the learned models. This table has one row and the following columns. +TEXT. This is an output table generated by LDA which contains the learned model. +It has one row with the following columns: voc_size -INTEGER. Size of the vocabulary. Note that the \c wordid should be continous integers starting from 0 to \c voc_size \c 1. A data validation routine is called to validate the dataset. +INTEGER. Size of the vocabulary. As mentioned above for the input +table, \c wordid consists of contiguous integers going +from 0 to \c voc_size \c 1. + topic_num INTEGER. Number of topics. alpha -DOUBLE PRECISION. Dirichlet parameter for the per-doc topic multinomial (e.g. 50/topic_num). +DOUBLE PRECISION. Dirichlet prior for the per-document +topic multinomial. beta -DOUBLE PRECISION. Dirichlet parameter for the per-topic word multinomial (e.g. 0.01). +DOUBLE PRECISION. Dirichlet prior for the per-topic +word multinomial. model -BIGINT[]. +BIGINT[]. The encoded model description (not human readable). output_data_table -TEXT. The name of the table to store the output data. It has the following columns: +TEXT. The name of the table generated by LDA that stores +the output data. It has the following columns: docid -INTEGER. +INTEGER. Document id from input 'data_table'. wordcount -INTEGER. +INTEGER. Count of number of words in the document, +including repeats. For example, if a word appears 3 times +in the document, it is counted 3 times. words -INTEGER[]. +INTEGER[]. Array of \c wordid in the document, not +including repeats. For example, if a word appears 3 times +in the document, it appears only once in the \c words array. counts -INTEGER[]. +INTEGER[]. Frequency of occurance of a word in the document, +indexed the same as the \c words array above. For example, if the +2nd element of the \c counts array is 4, it means that the word +in the 2nd element of the \c words array occurs 4 times in the +document. topic_count -INTEGER[]. +INTEGER[]. Array of the count of words in the document +that correspond to each topic. --- End diff -- maybe mention array index corresponds to 0
[GitHub] madlib pull request #232: Multiple LDA improvements and fixes
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/232#discussion_r167708360 --- Diff: src/ports/postgres/modules/lda/lda.sql_in --- @@ -182,324 +105,789 @@ lda_train( data_table, \b Arguments data_table -TEXT. The name of the table storing the training dataset. Each row is +TEXT. Name of the table storing the training dataset. Each row is in the form docid, wordid, count where \c docid, \c wordid, and \c count -are non-negative integers. - +are non-negative integers. The \c docid column refers to the document ID, the \c wordid column is the word ID (the index of a word in the vocabulary), and \c count is the -number of occurrences of the word in the document. - -Please note that column names for \c docid, \c wordid, and \c count are currently fixed, so you must use these -exact names in the data_table. +number of occurrences of the word in the document. Please note: + +- \c wordid must be +contiguous integers going from from 0 to \c voc_size \c 1. +- column names for \c docid, \c wordid, and \c count are currently fixed, +so you must use these exact names in the data_table. + +The function Term Frequency +can be used to generate vocabulary in the required format from raw documents. + model_table -TEXT. The name of the table storing the learned models. This table has one row and the following columns. +TEXT. This is an output table generated by LDA which contains the learned model. +It has one row with the following columns: voc_size -INTEGER. Size of the vocabulary. Note that the \c wordid should be continous integers starting from 0 to \c voc_size \c 1. A data validation routine is called to validate the dataset. +INTEGER. Size of the vocabulary. As mentioned above for the input +table, \c wordid consists of contiguous integers going +from 0 to \c voc_size \c 1. + topic_num INTEGER. Number of topics. alpha -DOUBLE PRECISION. Dirichlet parameter for the per-doc topic multinomial (e.g. 50/topic_num). +DOUBLE PRECISION. Dirichlet prior for the per-document +topic multinomial. beta -DOUBLE PRECISION. Dirichlet parameter for the per-topic word multinomial (e.g. 0.01). +DOUBLE PRECISION. Dirichlet prior for the per-topic +word multinomial. model -BIGINT[]. +BIGINT[]. The encoded model description (not human readable). output_data_table -TEXT. The name of the table to store the output data. It has the following columns: +TEXT. The name of the table generated by LDA that stores +the output data. It has the following columns: docid -INTEGER. +INTEGER. Document id from input 'data_table'. wordcount -INTEGER. +INTEGER. Count of number of words in the document, +including repeats. For example, if a word appears 3 times +in the document, it is counted 3 times. words -INTEGER[]. +INTEGER[]. Array of \c wordid in the document, not +including repeats. For example, if a word appears 3 times +in the document, it appears only once in the \c words array. counts -INTEGER[]. +INTEGER[]. Frequency of occurance of a word in the document, +indexed the same as the \c words array above. For example, if the +2nd element of the \c counts array is 4, it means that the word +in the 2nd element of the \c words array occurs 4 times in the +document. topic_count -INTEGER[]. +INTEGER[]. Array of the count of words in the document +that correspond to each topic
[GitHub] madlib pull request #232: Multiple LDA improvements and fixes
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/232#discussion_r167709835 --- Diff: src/ports/postgres/modules/lda/lda.sql_in --- @@ -182,324 +105,789 @@ lda_train( data_table, \b Arguments data_table -TEXT. The name of the table storing the training dataset. Each row is +TEXT. Name of the table storing the training dataset. Each row is in the form docid, wordid, count where \c docid, \c wordid, and \c count -are non-negative integers. - +are non-negative integers. The \c docid column refers to the document ID, the \c wordid column is the word ID (the index of a word in the vocabulary), and \c count is the -number of occurrences of the word in the document. - -Please note that column names for \c docid, \c wordid, and \c count are currently fixed, so you must use these -exact names in the data_table. +number of occurrences of the word in the document. Please note: + +- \c wordid must be +contiguous integers going from from 0 to \c voc_size \c 1. +- column names for \c docid, \c wordid, and \c count are currently fixed, +so you must use these exact names in the data_table. + +The function Term Frequency +can be used to generate vocabulary in the required format from raw documents. + model_table -TEXT. The name of the table storing the learned models. This table has one row and the following columns. +TEXT. This is an output table generated by LDA which contains the learned model. +It has one row with the following columns: voc_size -INTEGER. Size of the vocabulary. Note that the \c wordid should be continous integers starting from 0 to \c voc_size \c 1. A data validation routine is called to validate the dataset. +INTEGER. Size of the vocabulary. As mentioned above for the input +table, \c wordid consists of contiguous integers going +from 0 to \c voc_size \c 1. + topic_num INTEGER. Number of topics. alpha -DOUBLE PRECISION. Dirichlet parameter for the per-doc topic multinomial (e.g. 50/topic_num). +DOUBLE PRECISION. Dirichlet prior for the per-document +topic multinomial. beta -DOUBLE PRECISION. Dirichlet parameter for the per-topic word multinomial (e.g. 0.01). +DOUBLE PRECISION. Dirichlet prior for the per-topic +word multinomial. model -BIGINT[]. +BIGINT[]. The encoded model description (not human readable). output_data_table -TEXT. The name of the table to store the output data. It has the following columns: +TEXT. The name of the table generated by LDA that stores +the output data. It has the following columns: docid -INTEGER. +INTEGER. Document id from input 'data_table'. wordcount -INTEGER. +INTEGER. Count of number of words in the document, +including repeats. For example, if a word appears 3 times +in the document, it is counted 3 times. words -INTEGER[]. +INTEGER[]. Array of \c wordid in the document, not +including repeats. For example, if a word appears 3 times +in the document, it appears only once in the \c words array. counts -INTEGER[]. +INTEGER[]. Frequency of occurance of a word in the document, +indexed the same as the \c words array above. For example, if the +2nd element of the \c counts array is 4, it means that the word +in the 2nd element of the \c words array occurs 4 times in the +document. topic_count -INTEGER[]. +INTEGER[]. Array of the count of words in the document +that correspond to each topic
[GitHub] madlib issue #222: minor update to summary() user docs
Github user jingyimei commented on the issue: https://github.com/apache/madlib/pull/222 Jekins OK to test ---
[GitHub] madlib pull request #211: Change madlib gppkg version string
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/211#discussion_r155683583 --- Diff: cmake/LinuxUtils.cmake --- @@ -9,3 +9,14 @@ macro(rh_version OUT_VERSION) set(${OUT_VERSION} "${OUT_VERSION}-NOTFOUND") endif(EXISTS "/etc/redhat-release") endmacro(rh_version) + --- End diff -- yes, I can use a regex to grep RH_MAJOR_VERSION from RH_VERSION after running`rh_version(RH_VERSION)`. ---
[GitHub] madlib issue #211: Change madlib gppkg version string
Github user jingyimei commented on the issue: https://github.com/apache/madlib/pull/211 @fmcquillan99 Yes, the version string is saved in Version.yml file and when we make the release, we will change the version string from 1.13_dev to 1.13, and cmake will directly get it from the yml file. ---
[GitHub] madlib pull request #211: Change madlib gppkg version string
GitHub user jingyimei opened a pull request: https://github.com/apache/madlib/pull/211 Change madlib gppkg version string This commit changes the naming convention for madlib gppkg, after renaming, the format of madlib gppkg will look like: madlib-1.13_dev-gp5-rhel6-x86_64.gppkg madlib-1.12_dev-gp4.3orca-rhel5-x86_64.gppkg You can merge this pull request into a Git repository by running: $ git pull https://github.com/jingyimei/madlib gppkg_rename Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/211.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #211 commit 192e7208e2cab2845642eb0995e277382c2af250 Author: Jingyi Mei <j...@pivotal.io> Date: 2017-12-06T04:13:34Z Change madlib gppkg version string This commit changes the naming convention for madlib gppkg, after renaming, the format of madlib gppkg will look like: madlib-1.13_dev-gp5-rhel6-x86_64.gppkg madlib-1.12_dev-gp4.3orca-rhel5-x86_64.gppkg ---
[GitHub] madlib pull request #195: Feature: Add grouping support to HITS
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/195#discussion_r151063100 --- Diff: src/ports/postgres/modules/graph/graph_utils.py_in --- @@ -109,6 +110,85 @@ def validate_graph_coding(vertex_table, vertex_id, edge_table, edge_params, return None +def validate_params_for_centrality_measures(schema_madlib, func_name, --- End diff -- +1 ---
[GitHub] madlib pull request #195: Feature: Add grouping support to HITS
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/195#discussion_r151064800 --- Diff: src/ports/postgres/modules/graph/hits.py_in --- @@ -95,234 +109,391 @@ def hits(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, FROM {1} """.format(vertex_id, vertex_table))[0]["cnt"] -# Assign default threshold value based on number of nodes in the graph. if threshold is None: -threshold = 1.0 / (n_vertices * 1000) +threshold = get_default_threshold_for_centrality_measures(n_vertices) +# table/column names used when grouping_cols is set. edge_temp_table = unique_string(desp='temp_edge') +grouping_cols_comma = grouping_cols + ',' if grouping_cols else '' distribution = ('' if is_platform_pg() else -"DISTRIBUTED BY ({0})".format(dest)) -plpy.execute("DROP TABLE IF EXISTS {0}".format(edge_temp_table)) +"DISTRIBUTED BY ({0}{1})".format( +grouping_cols_comma, dest)) +drop_tables([edge_temp_table]) plpy.execute(""" CREATE TEMP TABLE {edge_temp_table} AS SELECT * FROM {edge_table} {distribution} """.format(**locals())) -# GPDB and HAWQ have distributed by clauses to help them with indexing. -# For Postgres we add the index explicitly. -if is_platform_pg(): -plpy.execute("CREATE INDEX ON {0}({1})".format( -edge_temp_table, dest)) + ## +# Set initial authority_norm and hub_norm as 1, so that later the final +# norm should be positive number +authority_init_value = 1.0 +hub_init_value = 1.0 + +subquery1 = unique_string(desp='subquery1') + +distinct_grp_table = '' +select_grouping_cols_comma = '' +select_subquery1_grouping_cols_comma = '' --- End diff -- I understood the incentive to rename those variables to make them more meaningful, however, seems this doesn't help simplify/reduce code but extends the length of many lines. When I read the code I still feel hard to keep track of them and understand the query. In this case I feel a shorter variable name is even simpler to read. ---
[GitHub] madlib pull request #195: Feature: Add grouping support to HITS
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/195#discussion_r151061922 --- Diff: src/ports/postgres/modules/utilities/utilities.py_in --- @@ -709,16 +709,35 @@ def _check_groups(tbl1, tbl2, grp_list): return ' AND '.join([" {tbl1}.{i} = {tbl2}.{i} ".format(**locals()) for i in grp_list]) - -def _grp_from_table(tbl, grp_list): -""" -Helper function for selecting grouping columns of a table +def get_filtered_cols_subquery_str(include_from_table, exclude_from_table, --- End diff -- what is our consideration to define function name not start with underscore(weak internal use) while the previous one is a weak internal use? ---
[GitHub] madlib pull request #191: KNN: Fix optional parameters and ordering
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/191#discussion_r146700914 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -160,20 +164,23 @@ def knn(schema_madlib, point_source, point_column_name, point_id, label_column_n ) {x_temp_table} ) {y_temp_table} WHERE {y_temp_table}.r <= {k_val} -""".format(**locals())) -plpy.execute( -""" +""".format(**locals()) +plpy.execute(sql1) + +sql2 = """ CREATE TABLE {output_table} AS SELECT {test_id_temp} AS id, {test_column_name} , CASE WHEN {output_neighbors} -THEN array_agg(knn_temp.train_id) +THEN array_agg(knn_temp.train_id +ORDER BY knn_temp.dist ASC) --- End diff -- Can be in one line ---
[GitHub] madlib pull request #191: KNN: Fix optional parameters and ordering
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/191#discussion_r146700815 --- Diff: src/ports/postgres/modules/knn/knn.sql_in --- @@ -96,19 +95,19 @@ in a column of type DOUBLE PRECISION[]. point_id TEXT. Name of the column in 'point_sourceâ containing source data ids. -The ids are of type INTEGER with no duplicates. They do not need to be contiguous. -This parameter must be used if the list of nearest neighbors are to be output, i.e., +The ids are of type INTEGER with no duplicates. They do not need to be contiguous. +This parameter must be used if the list of nearest neighbors are to be output, i.e., if the parameter 'output_neighbors' below is TRUE or if 'label_column_name' is NULL. label_column_name TEXT. Name of the column with labels/values of training data points. -If Boolean, integer or text types will run knn classification, else if -double precision values will run knn regression. -If you set this to NULL will return neighbors only without doing classification or regression. +If Boolean, integer or text types will run KNN classification, else if +double precision values will run KNN regression. +If you set this to NULL, will return the set of neighbors only without +actually doing classification or regression. --- End diff -- When I read this, I had some problem understanding the sentence. Can we modify this sentence like "If you set this to NULL, it will only return the set of neighbors without actually doing classification or regression"? ---
[GitHub] madlib pull request #191: KNN: Fix optional parameters and ordering
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/191#discussion_r146701746 --- Diff: src/ports/postgres/modules/knn/knn.sql_in --- @@ -218,56 +215,57 @@ INSERT INTO knn_test_data VALUES -# Run KNN for classification: -DROP TABLE IF EXISTS madlib_knn_result_classification; -SELECT * FROM madlib.knn( +DROP TABLE IF EXISTS knn_result_classification; +SELECT * FROM madlib.knn( 'knn_train_data', -- Table of training data 'data',-- Col name of training data -'id', -- Col Name of id in train data +'id', -- Col Name of id in train data 'label', -- Training labels 'knn_test_data', -- Table of test data 'data',-- Col name of test data -'id', -- Col name of id in test data -'madlib_knn_result_classification', -- Output table - 3 -- Number of nearest neighbours +'id', -- Col name of id in test data +'knn_result_classification', -- Output table + 3,-- Number of nearest neighbors True -- True if you want to show Nearest-Neighbors, False otherwise --- End diff -- We can say "True if you want to show Nearest-Neighbors by id" to make it clearer ---
[GitHub] madlib pull request #186: Add error message for checking postgres install co...
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/186#discussion_r139798284 --- Diff: src/madpack/madpack.py --- @@ -512,7 +512,9 @@ def _plpy_check(py_min_ver): try: _internal_run_query("CREATE LANGUAGE plpythonu;", True) except: -_error('Cannot create language plpythonu. Stopping installation...', False) +_error("""Cannot create language plpythonu. Please check if you +have configured and installed postgres with `--with-python` --- End diff -- Sure, that should be more general. ---
[GitHub] madlib pull request #186: Add error message for checking postgres install co...
GitHub user jingyimei opened a pull request: https://github.com/apache/madlib/pull/186 Add error message for checking postgres install configuration MADlib needs to be installed on a postgres with python extension. If postgres is not configured with the `--with-python` option, madpack installation will failed for not finding $libdir/plpython2 directory. This commit add some instructions when this failure happens. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jingyimei/madlib python_error_message Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/186.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #186 commit f123105aa543953a12757eb228d33e10f6979b9f Author: Jingyi Mei <j...@pivotal.io> Date: 2017-09-19T19:01:02Z Add error message for checking postgres install configuration MADlib needs to be installed on a postgres with python configured. If postgres is not configured with the `--with-python` option, madpack installation will failed for not finding $libdir/plpython2 directory. This commit add some instructions when this failure happens. ---
[GitHub] madlib issue #178: Featur: HITS
Github user jingyimei commented on the issue: https://github.com/apache/madlib/pull/178 Jenkins ok to test ---
[GitHub] madlib issue #178: Featur: HITS
Github user jingyimei commented on the issue: https://github.com/apache/madlib/pull/178 Jenkins ok to test ---
[GitHub] madlib pull request #178: Featur: Hit
GitHub user jingyimei opened a pull request: https://github.com/apache/madlib/pull/178 Featur: Hit JIRA:MADLIB-1124 Introducees a new module that computes the HITS scores of all nodes in a directed graph. Implements the HITS algorithm with normalization (https://en.wikipedia.org/wiki/HITS_algorithm) You can merge this pull request into a Git repository by running: $ git pull https://github.com/jingyimei/incubator-madlib hits Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/178.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #178 commit 144f65b003843e1f9578cb1befdc9338096c80b0 Author: Jingyi <j...@pivotal.io> Date: 2017-08-25T22:06:39Z Feature: HITS JIRA:MADLIB-1124 Introducees a new module that computes the HITS scores of all nodes in a directed graph. Implements the HITS algorithm with normalization (https://en.wikipedia.org/wiki/HITS_algorithm) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-madlib issue #175: Fix example py_in file
Github user jingyimei commented on the issue: https://github.com/apache/incubator-madlib/pull/175 @orhankislal --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---