[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-04-17 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/madlib/pull/244


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177916814
  
--- Diff: src/ports/postgres/modules/graph/test/pagerank.sql_in ---
@@ -95,6 +101,49 @@ SELECT assert(relative_error(SUM(pagerank), 1) < 
0.1,
 ) FROM pagerank_gr_out WHERE user_id=2;
 
 
+-- Tests for Personalized Page Rank
+
+-- Test without grouping 
+
+DROP TABLE IF EXISTS pagerank_ppr_out;
+DROP TABLE IF EXISTS pagerank_ppr_out_summary;
+SELECT pagerank(
+ 'vertex',-- Vertex table
+ 'id',-- Vertix id column
+ '"EDGE"',  -- "EDGE" table
+ 'src=src, dest=dest', -- "EDGE" args
+ 'pagerank_ppr_out', -- Output table of PageRank
+ NULL,  -- Default damping factor (0.85)
+ NULL,  -- Default max iters (100)
+ NULL,  -- Default Threshold 
+ NULL, -- Grouping column
+'{1,3}'); -- Personlized Nodes
+
+
+-- View the PageRank of all vertices, sorted by their scores.
+SELECT assert(relative_error(SUM(pagerank), 1) < 0.00124,
--- End diff --

Is this  0.00124 based on current test result? Can we make it smaller?


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177899442
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -211,19 +261,30 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 distinct_grp_table, grouping_cols_list)
 # Find number of vertices in each group, this is the 
normalizing factor
 # for computing the random_prob
+where_clause_ppr = ''
+if nodes_of_interest > 0:
+where_clause_ppr = """where __vertices__ = 
ANY(ARRAY{nodes_of_interest})""".format(
+**locals())
+random_prob_grp = 1.0 - damping_factor
+init_prob_grp = 1.0 / len(nodes_of_interest)
+else:
+random_prob_grp  = 
"""{rand_damp}/COUNT(__vertices__)::DOUBLE PRECISION
+ """.format(**locals())
+init_prob_grp  =  """1/COUNT(__vertices__)::DOUBLE 
PRECISION""".format(
+**locals())
+
 plpy.execute("DROP TABLE IF EXISTS 
{0}".format(vertices_per_group))
 plpy.execute("""CREATE TEMP TABLE {vertices_per_group} AS
 SELECT {distinct_grp_table}.*,
-1/COUNT(__vertices__)::DOUBLE PRECISION AS {init_pr},
-{rand_damp}/COUNT(__vertices__)::DOUBLE PRECISION
-AS {random_prob}
+{init_prob_grp} AS {init_pr},
+{random_prob_grp} as {random_prob}
 FROM {distinct_grp_table} INNER JOIN (
 SELECT {grouping_cols}, {src} AS __vertices__
 FROM {edge_temp_table}
 UNION
 SELECT {grouping_cols}, {dest} FROM 
{edge_temp_table}
 ){subq}
-ON {grouping_where_clause}
+ON {grouping_where_clause} {where_clause_ppr}
--- End diff --

put {where_clause_ppr} in a new line


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177912288
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -527,14 +615,55 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 """.format(**locals()))
 
 # Step 4: Cleanup
-plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2},{3},{4},{5},{6}
+plpy.execute("""DROP TABLE IF EXISTS 
{0},{1},{2},{3},{4},{5},{6},{7}
 """.format(out_cnts, edge_temp_table, cur, message, cur_unconv,
-   message_unconv, nodes_with_no_incoming_edges))
+   message_unconv, nodes_with_no_incoming_edges, 
personalized_nodes))
--- End diff --

This "personalized_nodes" table doesn't get created before


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177897977
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -211,19 +261,30 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 distinct_grp_table, grouping_cols_list)
 # Find number of vertices in each group, this is the 
normalizing factor
 # for computing the random_prob
+where_clause_ppr = ''
+if nodes_of_interest > 0:
+where_clause_ppr = """where __vertices__ = 
ANY(ARRAY{nodes_of_interest})""".format(
+**locals())
+random_prob_grp = 1.0 - damping_factor
+init_prob_grp = 1.0 / len(nodes_of_interest)
--- End diff --

len(nodes_of_interest) == total_ppr_nodes ? so that we don't need to run 
O(n) again


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177910146
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -211,19 +261,30 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 distinct_grp_table, grouping_cols_list)
 # Find number of vertices in each group, this is the 
normalizing factor
 # for computing the random_prob
+where_clause_ppr = ''
+if nodes_of_interest > 0:
+where_clause_ppr = """where __vertices__ = 
ANY(ARRAY{nodes_of_interest})""".format(
--- End diff --

After consulting with QP, `__vertices__ = ANY(ARRAY{nodes_of_interest})` 
works exactly the same as `__vertices__ in (nodes_of_interest)`, this may look 
simpler.  

Besides, since we use this condition in multiple places, I am wondering if 
a join clause is faster - we create a temp table that saves special node ids 
and we join this temp table with vertex table by vertex id - QP suggested to 
try both and see which one runs faster.


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177851780
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -44,29 +44,62 @@ from utilities.utilities import add_postfix
 from utilities.utilities import extract_keyvalue_params
 from utilities.utilities import unique_string, split_quoted_delimited_str
 from utilities.utilities import is_platform_pg
+from utilities.utilities import py_list_to_sql_string
 
 from utilities.validate_args import columns_exist_in_table, 
get_cols_and_types
 from utilities.validate_args import table_exists
 
+
 def validate_pagerank_args(schema_madlib, vertex_table, vertex_id, 
edge_table,
edge_params, out_table, damping_factor, 
max_iter,
-   threshold, grouping_cols_list):
+   threshold, grouping_cols_list, 
nodes_of_interest):
 """
 Function to validate input parameters for PageRank
 """
 validate_graph_coding(vertex_table, vertex_id, edge_table, edge_params,
   out_table, 'PageRank')
-## Validate args such as threshold and max_iter
+# Validate args such as threshold and max_iter
 validate_params_for_link_analysis(schema_madlib, "PageRank",
-threshold, max_iter,
-edge_table, grouping_cols_list)
+  threshold, max_iter,
+  edge_table, grouping_cols_list)
 _assert(damping_factor >= 0.0 and damping_factor <= 1.0,
 "PageRank: Invalid damping factor value ({0}), must be between 
0 and 1.".
 format(damping_factor))
 
-
-def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args,
- out_table, damping_factor, max_iter, threshold, 
grouping_cols, **kwargs):
+# Validate against the givin set of nodes for Personalized Page Rank
+if nodes_of_interest:
+nodes_of_interest_count = len(nodes_of_interest)
+vertices_count = plpy.execute("""
+   SELECT count(DISTINCT({vertex_id})) AS cnt FROM 
{vertex_table}
+   WHERE {vertex_id} = ANY(ARRAY{nodes_of_interest})
+   """.format(**locals()))[0]["cnt"]
+# Check to see if the given set of nodes exist in vertex table
+if vertices_count != len(nodes_of_interest):
+plpy.error("PageRank: Invalid value for {0}, must be a subset 
of the vertex_table".format(
--- End diff --

This query tests several invalid scenarios, including duplicate nodes in 
nodes_of_interest, in the error msg maybe we can say "Invalid value for {0}, 
must be a subset of the vertex_table without duplicate nodes".


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177894976
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -211,19 +261,30 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 distinct_grp_table, grouping_cols_list)
 # Find number of vertices in each group, this is the 
normalizing factor
 # for computing the random_prob
+where_clause_ppr = ''
+if nodes_of_interest > 0:
--- End diff --

`if nodes_of_interest:`  or `if total_ppr_nodes > 0:` 


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177915601
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -647,6 +778,26 @@ SELECT * FROM pagerank_out ORDER BY user_id, pagerank 
DESC;
 -- View the summary table to find the number of iterations required for
 -- convergence for each group.
 SELECT * FROM pagerank_out_summary;
+
+-- Compute the Personalized PageRank:
+DROP TABLE IF EXISTS pagerank_out, pagerank_out_summary;
+SELECT madlib.pagerank(
+   'vertex', -- Vertex table
+   'id', -- Vertix id column
+   'edge',   -- Edge table
+   'src=src, dest=dest', -- Comma delimted string of 
edge arguments
+   'pagerank_out',   -- Output table of PageRank
+NULL,-- Default damping factor 
(0.85)
+NULL,-- Default max iters (100)
+NULL,-- Default Threshold
+NULL,-- No Grouping
--- End diff --

move those NULLs one space left


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177914251
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -149,25 +186,37 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 out_cnts = unique_string(desp='out_cnts')
 out_cnts_cnt = unique_string(desp='cnt')
 v1 = unique_string(desp='v1')
+personalized_nodes = unique_string(desp='personalized_nodes')
 
 if is_platform_pg():
 cur_distribution = cnts_distribution = ''
 else:
-cur_distribution = cnts_distribution = \
-"DISTRIBUTED BY ({0}{1})".format(
-grouping_cols_comma, vertex_id)
+cur_distribution = cnts_distribution = "DISTRIBUTED BY 
({0}{1})".format(
+grouping_cols_comma, vertex_id)
 cur_join_clause = """{edge_temp_table}.{dest} = {cur}.{vertex_id}
 """.format(**locals())
 out_cnts_join_clause = """{out_cnts}.{vertex_id} =
 {edge_temp_table}.{src} """.format(**locals())
 v1_join_clause = """{v1}.{vertex_id} = {edge_temp_table}.{src}
 """.format(**locals())
 
+# Get query params for Personalized Page Rank.
+ppr_params = get_query_params_for_ppr(nodes_of_interest, 
damping_factor,
--- End diff --

Is it better to check `if nodes_of_interest` before calling 
get_query_params_for_ppr instead of checking it in get_query_params_for_ppr?


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177914961
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -551,14 +680,16 @@ def pagerank_help(schema_madlib, message, **kwargs):
 message.lower() in ("usage", "help", "?"):
 help_string = "Get from method below"
 help_string = get_graph_usage(schema_madlib, 'PageRank',
-"""out_table TEXT, -- Name of the output table for PageRank
+  """out_table TEXT, -- Name of 
the output table for PageRank
 damping_factor DOUBLE PRECISION, -- Damping factor in random surfer 
model
  -- (DEFAULT = 0.85)
 max_iter  INTEGER, -- Maximum iteration number (DEFAULT = 100)
 threshold DOUBLE PRECISION, -- Stopping criteria (DEFAULT = 
1/(N*1000),
 -- N is number of vertices in the 
graph)
-grouping_col  TEXT -- Comma separated column names to group on
+grouping_col  TEXT, -- Comma separated column names to group on
-- (DEFAULT = NULL, no grouping)
+nodes_of_interest ARRAY OF INTEGER -- A comma seperated list of 
vertices
+  or nodes for personalized page 
rank.
 """) + """
 
--- End diff --

indent left side, and indent comment(--) right


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177892625
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -44,29 +44,62 @@ from utilities.utilities import add_postfix
 from utilities.utilities import extract_keyvalue_params
 from utilities.utilities import unique_string, split_quoted_delimited_str
 from utilities.utilities import is_platform_pg
+from utilities.utilities import py_list_to_sql_string
 
 from utilities.validate_args import columns_exist_in_table, 
get_cols_and_types
 from utilities.validate_args import table_exists
 
+
 def validate_pagerank_args(schema_madlib, vertex_table, vertex_id, 
edge_table,
edge_params, out_table, damping_factor, 
max_iter,
-   threshold, grouping_cols_list):
+   threshold, grouping_cols_list, 
nodes_of_interest):
 """
 Function to validate input parameters for PageRank
 """
 validate_graph_coding(vertex_table, vertex_id, edge_table, edge_params,
   out_table, 'PageRank')
-## Validate args such as threshold and max_iter
+# Validate args such as threshold and max_iter
 validate_params_for_link_analysis(schema_madlib, "PageRank",
-threshold, max_iter,
-edge_table, grouping_cols_list)
+  threshold, max_iter,
+  edge_table, grouping_cols_list)
 _assert(damping_factor >= 0.0 and damping_factor <= 1.0,
 "PageRank: Invalid damping factor value ({0}), must be between 
0 and 1.".
 format(damping_factor))
 
-
-def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args,
- out_table, damping_factor, max_iter, threshold, 
grouping_cols, **kwargs):
+# Validate against the givin set of nodes for Personalized Page Rank
+if nodes_of_interest:
+nodes_of_interest_count = len(nodes_of_interest)
+vertices_count = plpy.execute("""
+   SELECT count(DISTINCT({vertex_id})) AS cnt FROM 
{vertex_table}
+   WHERE {vertex_id} = ANY(ARRAY{nodes_of_interest})
+   """.format(**locals()))[0]["cnt"]
+# Check to see if the given set of nodes exist in vertex table
+if vertices_count != len(nodes_of_interest):
+plpy.error("PageRank: Invalid value for {0}, must be a subset 
of the vertex_table".format(
+nodes_of_interest))
+# Validate given set of nodes against each user group.
+# If all the given nodes are not present in the user group
+# then throw an error.
+if grouping_cols_list:
+missing_user_grps = ''
+grp_by_column = get_table_qualified_col_str(
+edge_table, grouping_cols_list)
+grps_without_nodes = plpy.execute("""
+   SELECT {grp_by_column} FROM {edge_table}
+   WHERE src = ANY(ARRAY{nodes_of_interest}) group by 
{grp_by_column}
+   having count(DISTINCT(src)) != {nodes_of_interest_count}
+   """.format(**locals()))
+for row in range(grps_without_nodes.nrows()):
+missing_user_grps += 
str(grps_without_nodes[row]['user_id'])
+if row < grps_without_nodes.nrows() - 1:
+missing_user_grps += ' ,'
+if grps_without_nodes.nrows() > 0:
+plpy.error("Nodes for Personalizaed Page Rank are missing 
from these groups: {0} ".format(
+missing_user_grps))
+
--- End diff --

Here some similar things are test twice - when `if nodes_of_interest`, 
there is a `count` operation in line 73 and in line 77 there is one test(this 
is for without grouping). Then when `if grouping_cols_list`, another `count` 
and `compare` happen in line 90 per group. There might be a way to simplify the 
logic here so that for grouping, we don't need to do it twice.  Besides, if the 
above query really slow down performance a lot, I would think about doing it 
simpler by not giving a list of groups missing special nodes but just a 
warning(optional, depending on how expensive the above query is).


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177916983
  
--- Diff: src/ports/postgres/modules/graph/test/pagerank.sql_in ---
@@ -95,6 +101,49 @@ SELECT assert(relative_error(SUM(pagerank), 1) < 
0.1,
 ) FROM pagerank_gr_out WHERE user_id=2;
 
 
+-- Tests for Personalized Page Rank
+
+-- Test without grouping 
+
+DROP TABLE IF EXISTS pagerank_ppr_out;
+DROP TABLE IF EXISTS pagerank_ppr_out_summary;
+SELECT pagerank(
+ 'vertex',-- Vertex table
+ 'id',-- Vertix id column
+ '"EDGE"',  -- "EDGE" table
+ 'src=src, dest=dest', -- "EDGE" args
+ 'pagerank_ppr_out', -- Output table of PageRank
+ NULL,  -- Default damping factor (0.85)
+ NULL,  -- Default max iters (100)
+ NULL,  -- Default Threshold 
+ NULL, -- Grouping column
+'{1,3}'); -- Personlized Nodes
+
+
+-- View the PageRank of all vertices, sorted by their scores.
+SELECT assert(relative_error(SUM(pagerank), 1) < 0.00124,
+'PageRank: Scores do not sum up to 1.'
+) FROM pagerank_ppr_out;
+
+
+-- Test with grouping 
+
+DROP TABLE IF EXISTS pagerank_ppr_grp_out;
+DROP TABLE IF EXISTS pagerank_ppr_grp_out_summary;
+SELECT pagerank(
+ 'vertex',-- Vertex table
+ 'id',-- Vertix id column
+ '"EDGE"',  -- "EDGE" table
+ 'src=src, dest=dest', -- "EDGE" args
+ 'pagerank_ppr_grp_out', -- Output table of PageRank
+ NULL,  -- Default damping factor (0.85)
+ NULL,  -- Default max iters (100)
+ NULL,  -- Default Threshold 
+ 'user_id', -- Grouping column
+'{1,3}'); -- Personlized Nodes
+
+SELECT assert(count(*) = 14, 'Tuple count for Pagerank out table != 14') 
FROM pagerank_ppr_grp_out;
--- End diff --

can we do similar assertion here by group?


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177917620
  
--- Diff: src/ports/postgres/modules/graph/pagerank.sql_in ---
@@ -273,6 +278,48 @@ SELECT * FROM pagerank_out_summary ORDER BY user_id;
 (2 rows)
 
 
+-# Example of Personalized Page Rank with Nodes {2,4}
+
+DROP TABLE IF EXISTS pagerank_out, pagerank_out_summary;
+SELECT madlib.pagerank(
+   'vertex', -- Vertex table
+   'id', -- Vertix id column
+   'edge',   -- Edge table
+   'src=src, dest=dest', -- Comma delimted string of 
edge arguments
+   'pagerank_out',   -- Output table of PageRank 
+NULL,-- Default damping factor 
(0.85)
+NULL,-- Default max iters (100)
+NULL,-- Default Threshold 
+NULL,-- No Grouping 
+   '{2,4}'); -- Personlized Nodes
--- End diff --

Great


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177915929
  
--- Diff: src/ports/postgres/modules/graph/test/pagerank.sql_in ---
@@ -66,7 +66,12 @@ SELECT pagerank(
  'id',-- Vertix id column
  '"EDGE"',  -- "EDGE" table
  'src=src, dest=dest', -- "EDGE" args
- 'pagerank_out'); -- Output table of PageRank
+ 'pagerank_out',-- Output table of PageRank
+  NULL, -- Default damping factor (0.85)
+  NULL, -- Default max iters (100)
+  NULL, -- Default Threshold 
+  NULL, -- No Grouping 
+ NULL); -- Personlized Nodes
--- End diff --

In this case, we can remove the last 5 NULLs since they are all optional.


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177893734
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -122,12 +158,13 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 grouping_where_clause = ''
 group_by_clause = ''
 random_prob = ''
+ppr_join_clause = ''
 
 edge_temp_table = unique_string(desp='temp_edge')
 grouping_cols_comma = grouping_cols + ',' if grouping_cols else ''
 distribution = ('' if is_platform_pg() else
 "DISTRIBUTED BY ({0}{1})".format(
-grouping_cols_comma, dest))
+grouping_cols_comma, dest))
--- End diff --

maybe indent with the above line, or move the above line backwards to the 
current place


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-28 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r177917195
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -149,25 +164,39 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 out_cnts = unique_string(desp='out_cnts')
 out_cnts_cnt = unique_string(desp='cnt')
 v1 = unique_string(desp='v1')
+personalized_nodes = unique_string(desp='personalized_nodes')
 
 if is_platform_pg():
 cur_distribution = cnts_distribution = ''
 else:
-cur_distribution = cnts_distribution = \
-"DISTRIBUTED BY ({0}{1})".format(
-grouping_cols_comma, vertex_id)
+cur_distribution = cnts_distribution = "DISTRIBUTED BY 
({0}{1})".format(
+grouping_cols_comma, vertex_id)
 cur_join_clause = """{edge_temp_table}.{dest} = {cur}.{vertex_id}
 """.format(**locals())
 out_cnts_join_clause = """{out_cnts}.{vertex_id} =
 {edge_temp_table}.{src} """.format(**locals())
 v1_join_clause = """{v1}.{vertex_id} = {edge_temp_table}.{src}
 """.format(**locals())
 
+# Get query params for Personalized Page Rank.
+ppr_params = get_query_params_for_ppr(nodes_of_interest, 
damping_factor,
+  ppr_join_clause, vertex_id,
+  edge_temp_table, 
vertex_table, cur_distribution,
+  personalized_nodes)
+total_ppr_nodes = ppr_params[0]
+random_jump_prob_ppr = ppr_params[1]
+ppr_join_clause = ppr_params[2]
+
 random_probability = (1.0 - damping_factor) / n_vertices
+if total_ppr_nodes > 0:
+random_jump_prob = random_jump_prob_ppr
+else:
+random_jump_prob = random_probability
--- End diff --

Got it.


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-20 Thread hpandeycodeit
Github user hpandeycodeit commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r175952795
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -527,14 +562,63 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 """.format(**locals()))
 
 # Step 4: Cleanup
-plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2},{3},{4},{5},{6}
+plpy.execute("""DROP TABLE IF EXISTS 
{0},{1},{2},{3},{4},{5},{6},{7}
 """.format(out_cnts, edge_temp_table, cur, message, cur_unconv,
-   message_unconv, nodes_with_no_incoming_edges))
+   message_unconv, nodes_with_no_incoming_edges, 
personalized_nodes))
 if grouping_cols:
 plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2}
 """.format(vertices_per_group, temp_summary_table,
distinct_grp_table))
 
+
+def get_query_params_for_ppr(nodes_of_interest, damping_factor,
+ ppr_join_clause, vertex_id, edge_temp_table, 
vertex_table,
+ cur_distribution, personalized_nodes):
+"""
+ This function will prepare the Join Clause and the condition to 
Calculate the Personalized Page Rank
+ and Returns Total number of user provided nodes of interest, A join 
Clause and a clause to be added
+ to existing formula to calculate pagerank.
+
+ Args:
+ @param nodes_of_interest
+ @param damping_factor
+ @param ppr_join_clause
+ @param vertex_id
+ @param edge_temp_table
+ @param vertex_table
+ @param cur_distribution
+
+ Returns :
+ (Integer, String, String)
+
+"""
+total_ppr_nodes = 0
+random_jump_prob_ppr = ''
--- End diff --

renamed this variable to ppr_random_prob_clause


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-20 Thread hpandeycodeit
Github user hpandeycodeit commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r175952633
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -527,14 +562,63 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 """.format(**locals()))
 
 # Step 4: Cleanup
-plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2},{3},{4},{5},{6}
+plpy.execute("""DROP TABLE IF EXISTS 
{0},{1},{2},{3},{4},{5},{6},{7}
 """.format(out_cnts, edge_temp_table, cur, message, cur_unconv,
-   message_unconv, nodes_with_no_incoming_edges))
+   message_unconv, nodes_with_no_incoming_edges, 
personalized_nodes))
 if grouping_cols:
 plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2}
 """.format(vertices_per_group, temp_summary_table,
distinct_grp_table))
 
+
+def get_query_params_for_ppr(nodes_of_interest, damping_factor,
+ ppr_join_clause, vertex_id, edge_temp_table, 
vertex_table,
+ cur_distribution, personalized_nodes):
+"""
+ This function will prepare the Join Clause and the condition to 
Calculate the Personalized Page Rank
+ and Returns Total number of user provided nodes of interest, A join 
Clause and a clause to be added
+ to existing formula to calculate pagerank.
+
+ Args:
+ @param nodes_of_interest
+ @param damping_factor
+ @param ppr_join_clause
+ @param vertex_id
+ @param edge_temp_table
+ @param vertex_table
+ @param cur_distribution
+
+ Returns :
+ (Integer, String, String)
+
+"""
+total_ppr_nodes = 0
+random_jump_prob_ppr = ''
+
+if nodes_of_interest:
+total_ppr_nodes = len(nodes_of_interest)
+init_value_ppr_nodes = 1.0 / total_ppr_nodes
+# Create a Temp table that holds the Inital probabilities for the
+# user provided nodes
+plpy.execute("""
+CREATE TEMP TABLE {personalized_nodes} AS
+SELECT {vertex_id}, {init_value_ppr_nodes}::DOUBLE PRECISION 
as pagerank
+FROM {vertex_table} where {vertex_id} =  
ANY(ARRAY{nodes_of_interest})
+{cur_distribution}
+""".format(**locals()))
+ppr_join_clause = """ LEFT  JOIN {personalized_nodes} on
+{personalized_nodes}.{vertex_id} = 
{edge_temp_table}.dest""".format(**locals())
+prob_value = 1.0 - damping_factor
+
+# In case of PPR, Assign the Random jump probability to the 
nodes_of_interest only.
+# For rest of the nodes, Random jump probability  will be zero.
+random_jump_prob_ppr = """ CASE when {edge_temp_table}.dest = 
ANY(ARRAY{nodes_of_interest})
+THEN {prob_value}
+ELSE 0
+END """.format(**locals())
+return(total_ppr_nodes, random_jump_prob_ppr, ppr_join_clause)
+
+
 def pagerank_help(schema_madlib, message, **kwargs):
--- End diff --

Added the explanation and example in the helper function


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-20 Thread hpandeycodeit
Github user hpandeycodeit commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r175952712
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -44,29 +44,40 @@ from utilities.utilities import add_postfix
 from utilities.utilities import extract_keyvalue_params
 from utilities.utilities import unique_string, split_quoted_delimited_str
 from utilities.utilities import is_platform_pg
+from utilities.utilities import py_list_to_sql_string
 
 from utilities.validate_args import columns_exist_in_table, 
get_cols_and_types
 from utilities.validate_args import table_exists
 
+
 def validate_pagerank_args(schema_madlib, vertex_table, vertex_id, 
edge_table,
edge_params, out_table, damping_factor, 
max_iter,
-   threshold, grouping_cols_list):
+   threshold, grouping_cols_list, 
nodes_of_interest):
 """
 Function to validate input parameters for PageRank
 """
 validate_graph_coding(vertex_table, vertex_id, edge_table, edge_params,
   out_table, 'PageRank')
-## Validate args such as threshold and max_iter
+# Validate args such as threshold and max_iter
 validate_params_for_link_analysis(schema_madlib, "PageRank",
-threshold, max_iter,
-edge_table, grouping_cols_list)
+  threshold, max_iter,
+  edge_table, grouping_cols_list)
 _assert(damping_factor >= 0.0 and damping_factor <= 1.0,
 "PageRank: Invalid damping factor value ({0}), must be between 
0 and 1.".
 format(damping_factor))
 
+if nodes_of_interest:
+vertices = plpy.execute("""
--- End diff --

Changed vertices to vertices_count as discussed. 


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-20 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r175663431
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -149,25 +164,39 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 out_cnts = unique_string(desp='out_cnts')
 out_cnts_cnt = unique_string(desp='cnt')
 v1 = unique_string(desp='v1')
+personalized_nodes = unique_string(desp='personalized_nodes')
 
 if is_platform_pg():
 cur_distribution = cnts_distribution = ''
 else:
-cur_distribution = cnts_distribution = \
-"DISTRIBUTED BY ({0}{1})".format(
-grouping_cols_comma, vertex_id)
+cur_distribution = cnts_distribution = "DISTRIBUTED BY 
({0}{1})".format(
+grouping_cols_comma, vertex_id)
 cur_join_clause = """{edge_temp_table}.{dest} = {cur}.{vertex_id}
 """.format(**locals())
 out_cnts_join_clause = """{out_cnts}.{vertex_id} =
 {edge_temp_table}.{src} """.format(**locals())
 v1_join_clause = """{v1}.{vertex_id} = {edge_temp_table}.{src}
 """.format(**locals())
 
+# Get query params for Personalized Page Rank.
+ppr_params = get_query_params_for_ppr(nodes_of_interest, 
damping_factor,
+  ppr_join_clause, vertex_id,
+  edge_temp_table, 
vertex_table, cur_distribution,
+  personalized_nodes)
+total_ppr_nodes = ppr_params[0]
+random_jump_prob_ppr = ppr_params[1]
+ppr_join_clause = ppr_params[2]
+
 random_probability = (1.0 - damping_factor) / n_vertices
+if total_ppr_nodes > 0:
+random_jump_prob = random_jump_prob_ppr
+else:
+random_jump_prob = random_probability
--- End diff --

Can move (1.0 - damping_factor) / n_vertices here since random_probability 
is not used anywhere else.


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-20 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r175664342
  
--- Diff: src/ports/postgres/modules/graph/test/pagerank.sql_in ---
@@ -84,7 +89,8 @@ SELECT pagerank(
  NULL,
  NULL,
  NULL,
- 'user_id');
+ 'user_id',
+ NULL);
 
 -- View the PageRank of all vertices, sorted by their scores.
 SELECT assert(relative_error(SUM(pagerank), 1) < 0.1,
--- End diff --

We may need at least two test cases with nodes_of_interest not null and 
with/without grouping.


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-20 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r175627510
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -527,14 +562,63 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 """.format(**locals()))
 
 # Step 4: Cleanup
-plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2},{3},{4},{5},{6}
+plpy.execute("""DROP TABLE IF EXISTS 
{0},{1},{2},{3},{4},{5},{6},{7}
 """.format(out_cnts, edge_temp_table, cur, message, cur_unconv,
-   message_unconv, nodes_with_no_incoming_edges))
+   message_unconv, nodes_with_no_incoming_edges, 
personalized_nodes))
 if grouping_cols:
 plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2}
 """.format(vertices_per_group, temp_summary_table,
distinct_grp_table))
 
+
+def get_query_params_for_ppr(nodes_of_interest, damping_factor,
+ ppr_join_clause, vertex_id, edge_temp_table, 
vertex_table,
+ cur_distribution, personalized_nodes):
+"""
+ This function will prepare the Join Clause and the condition to 
Calculate the Personalized Page Rank
+ and Returns Total number of user provided nodes of interest, A join 
Clause and a clause to be added
+ to existing formula to calculate pagerank.
+
+ Args:
+ @param nodes_of_interest
+ @param damping_factor
+ @param ppr_join_clause
+ @param vertex_id
+ @param edge_temp_table
+ @param vertex_table
+ @param cur_distribution
+
+ Returns :
+ (Integer, String, String)
+
+"""
+total_ppr_nodes = 0
+random_jump_prob_ppr = ''
+
+if nodes_of_interest:
+total_ppr_nodes = len(nodes_of_interest)
+init_value_ppr_nodes = 1.0 / total_ppr_nodes
+# Create a Temp table that holds the Inital probabilities for the
+# user provided nodes
+plpy.execute("""
+CREATE TEMP TABLE {personalized_nodes} AS
+SELECT {vertex_id}, {init_value_ppr_nodes}::DOUBLE PRECISION 
as pagerank
+FROM {vertex_table} where {vertex_id} =  
ANY(ARRAY{nodes_of_interest})
+{cur_distribution}
+""".format(**locals()))
+ppr_join_clause = """ LEFT  JOIN {personalized_nodes} on
+{personalized_nodes}.{vertex_id} = 
{edge_temp_table}.dest""".format(**locals())
+prob_value = 1.0 - damping_factor
+
+# In case of PPR, Assign the Random jump probability to the 
nodes_of_interest only.
+# For rest of the nodes, Random jump probability  will be zero.
+random_jump_prob_ppr = """ CASE when {edge_temp_table}.dest = 
ANY(ARRAY{nodes_of_interest})
+THEN {prob_value}
+ELSE 0
+END """.format(**locals())
+return(total_ppr_nodes, random_jump_prob_ppr, ppr_join_clause)
+
+
 def pagerank_help(schema_madlib, message, **kwargs):
--- End diff --

We need the new parameter explanation in helper function too


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-20 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r175665727
  
--- Diff: src/ports/postgres/modules/graph/pagerank.sql_in ---
@@ -120,6 +121,10 @@ distribution per group. When this value is NULL, no 
grouping is used and
 a single model is generated for all data.
 @note Expressions are not currently supported for 'grouping_cols'.
 
+ nodes_of_interest (optional) 
--- End diff --

@fmcquillan99 Do we also need additional explanation of personalized 
pagerank somewhere in the user doc? 


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-20 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r175631615
  
--- Diff: src/ports/postgres/modules/graph/pagerank.sql_in ---
@@ -273,6 +278,48 @@ SELECT * FROM pagerank_out_summary ORDER BY user_id;
 (2 rows)
 
 
+-# Example of Personalized Page Rank with Nodes {2,4}
+
+DROP TABLE IF EXISTS pagerank_out, pagerank_out_summary;
+SELECT madlib.pagerank(
+   'vertex', -- Vertex table
+   'id', -- Vertix id column
+   'edge',   -- Edge table
+   'src=src, dest=dest', -- Comma delimted string of 
edge arguments
+   'pagerank_out',   -- Output table of PageRank 
+NULL,-- Default damping factor 
(0.85)
+NULL,-- Default max iters (100)
+NULL,-- Default Threshold 
+NULL,-- No Grouping 
+   '{2,4}'); -- Personlized Nodes
--- End diff --

Another valid input for personalized nodes array is 'ARRAY[2,4]', should be 
mentioned in some example or user doc later.


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-16 Thread hpandeycodeit
GitHub user hpandeycodeit opened a pull request:

https://github.com/apache/madlib/pull/244

Changes for Personalized Page Rank : Jira:1084

Jira : 1084
This PR contains changes for Personalized Page Rank.

-  Added extra parameter, nodes_of_interest in main pagerank function. 
-  Added a new Function get_query_params_for_ppr  in pagerank.py_in to 
calculate random_jump_probabilty based on the user provided input nodes. 
-  Added a condition, when the user provided nodes are present then 
Personalized Page Rank will be executed otherwise regular Page Rank will run. 
-  Added an example function in pagerank.sql_in
- The extra parameter nodes_of_interest is also added in the calling 
functions in pagerank.sql_in

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hpandeycodeit/incubator-madlib graph_1084

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/244.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #244


commit ed1e364db205f104379270529d3eff694a589651
Author: hpandeycodeit 
Date:   2018-03-16T22:15:51Z

Changes for Personalized Page Rank : Jira:1084




---