[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user asfgit closed the pull request at: https://github.com/apache/madlib/pull/244 ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177916814 --- Diff: src/ports/postgres/modules/graph/test/pagerank.sql_in --- @@ -95,6 +101,49 @@ SELECT assert(relative_error(SUM(pagerank), 1) < 0.1, ) FROM pagerank_gr_out WHERE user_id=2; +-- Tests for Personalized Page Rank + +-- Test without grouping + +DROP TABLE IF EXISTS pagerank_ppr_out; +DROP TABLE IF EXISTS pagerank_ppr_out_summary; +SELECT pagerank( + 'vertex',-- Vertex table + 'id',-- Vertix id column + '"EDGE"', -- "EDGE" table + 'src=src, dest=dest', -- "EDGE" args + 'pagerank_ppr_out', -- Output table of PageRank + NULL, -- Default damping factor (0.85) + NULL, -- Default max iters (100) + NULL, -- Default Threshold + NULL, -- Grouping column +'{1,3}'); -- Personlized Nodes + + +-- View the PageRank of all vertices, sorted by their scores. +SELECT assert(relative_error(SUM(pagerank), 1) < 0.00124, --- End diff -- Is this 0.00124 based on current test result? Can we make it smaller? ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177899442 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -211,19 +261,30 @@ def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, distinct_grp_table, grouping_cols_list) # Find number of vertices in each group, this is the normalizing factor # for computing the random_prob +where_clause_ppr = '' +if nodes_of_interest > 0: +where_clause_ppr = """where __vertices__ = ANY(ARRAY{nodes_of_interest})""".format( +**locals()) +random_prob_grp = 1.0 - damping_factor +init_prob_grp = 1.0 / len(nodes_of_interest) +else: +random_prob_grp = """{rand_damp}/COUNT(__vertices__)::DOUBLE PRECISION + """.format(**locals()) +init_prob_grp = """1/COUNT(__vertices__)::DOUBLE PRECISION""".format( +**locals()) + plpy.execute("DROP TABLE IF EXISTS {0}".format(vertices_per_group)) plpy.execute("""CREATE TEMP TABLE {vertices_per_group} AS SELECT {distinct_grp_table}.*, -1/COUNT(__vertices__)::DOUBLE PRECISION AS {init_pr}, -{rand_damp}/COUNT(__vertices__)::DOUBLE PRECISION -AS {random_prob} +{init_prob_grp} AS {init_pr}, +{random_prob_grp} as {random_prob} FROM {distinct_grp_table} INNER JOIN ( SELECT {grouping_cols}, {src} AS __vertices__ FROM {edge_temp_table} UNION SELECT {grouping_cols}, {dest} FROM {edge_temp_table} ){subq} -ON {grouping_where_clause} +ON {grouping_where_clause} {where_clause_ppr} --- End diff -- put {where_clause_ppr} in a new line ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177912288 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -527,14 +615,55 @@ def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, """.format(**locals())) # Step 4: Cleanup -plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2},{3},{4},{5},{6} +plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2},{3},{4},{5},{6},{7} """.format(out_cnts, edge_temp_table, cur, message, cur_unconv, - message_unconv, nodes_with_no_incoming_edges)) + message_unconv, nodes_with_no_incoming_edges, personalized_nodes)) --- End diff -- This "personalized_nodes" table doesn't get created before ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177897977 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -211,19 +261,30 @@ def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, distinct_grp_table, grouping_cols_list) # Find number of vertices in each group, this is the normalizing factor # for computing the random_prob +where_clause_ppr = '' +if nodes_of_interest > 0: +where_clause_ppr = """where __vertices__ = ANY(ARRAY{nodes_of_interest})""".format( +**locals()) +random_prob_grp = 1.0 - damping_factor +init_prob_grp = 1.0 / len(nodes_of_interest) --- End diff -- len(nodes_of_interest) == total_ppr_nodes ? so that we don't need to run O(n) again ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177910146 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -211,19 +261,30 @@ def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, distinct_grp_table, grouping_cols_list) # Find number of vertices in each group, this is the normalizing factor # for computing the random_prob +where_clause_ppr = '' +if nodes_of_interest > 0: +where_clause_ppr = """where __vertices__ = ANY(ARRAY{nodes_of_interest})""".format( --- End diff -- After consulting with QP, `__vertices__ = ANY(ARRAY{nodes_of_interest})` works exactly the same as `__vertices__ in (nodes_of_interest)`, this may look simpler. Besides, since we use this condition in multiple places, I am wondering if a join clause is faster - we create a temp table that saves special node ids and we join this temp table with vertex table by vertex id - QP suggested to try both and see which one runs faster. ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177851780 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -44,29 +44,62 @@ from utilities.utilities import add_postfix from utilities.utilities import extract_keyvalue_params from utilities.utilities import unique_string, split_quoted_delimited_str from utilities.utilities import is_platform_pg +from utilities.utilities import py_list_to_sql_string from utilities.validate_args import columns_exist_in_table, get_cols_and_types from utilities.validate_args import table_exists + def validate_pagerank_args(schema_madlib, vertex_table, vertex_id, edge_table, edge_params, out_table, damping_factor, max_iter, - threshold, grouping_cols_list): + threshold, grouping_cols_list, nodes_of_interest): """ Function to validate input parameters for PageRank """ validate_graph_coding(vertex_table, vertex_id, edge_table, edge_params, out_table, 'PageRank') -## Validate args such as threshold and max_iter +# Validate args such as threshold and max_iter validate_params_for_link_analysis(schema_madlib, "PageRank", -threshold, max_iter, -edge_table, grouping_cols_list) + threshold, max_iter, + edge_table, grouping_cols_list) _assert(damping_factor >= 0.0 and damping_factor <= 1.0, "PageRank: Invalid damping factor value ({0}), must be between 0 and 1.". format(damping_factor)) - -def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, - out_table, damping_factor, max_iter, threshold, grouping_cols, **kwargs): +# Validate against the givin set of nodes for Personalized Page Rank +if nodes_of_interest: +nodes_of_interest_count = len(nodes_of_interest) +vertices_count = plpy.execute(""" + SELECT count(DISTINCT({vertex_id})) AS cnt FROM {vertex_table} + WHERE {vertex_id} = ANY(ARRAY{nodes_of_interest}) + """.format(**locals()))[0]["cnt"] +# Check to see if the given set of nodes exist in vertex table +if vertices_count != len(nodes_of_interest): +plpy.error("PageRank: Invalid value for {0}, must be a subset of the vertex_table".format( --- End diff -- This query tests several invalid scenarios, including duplicate nodes in nodes_of_interest, in the error msg maybe we can say "Invalid value for {0}, must be a subset of the vertex_table without duplicate nodes". ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177894976 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -211,19 +261,30 @@ def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, distinct_grp_table, grouping_cols_list) # Find number of vertices in each group, this is the normalizing factor # for computing the random_prob +where_clause_ppr = '' +if nodes_of_interest > 0: --- End diff -- `if nodes_of_interest:` or `if total_ppr_nodes > 0:` ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177915601 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -647,6 +778,26 @@ SELECT * FROM pagerank_out ORDER BY user_id, pagerank DESC; -- View the summary table to find the number of iterations required for -- convergence for each group. SELECT * FROM pagerank_out_summary; + +-- Compute the Personalized PageRank: +DROP TABLE IF EXISTS pagerank_out, pagerank_out_summary; +SELECT madlib.pagerank( + 'vertex', -- Vertex table + 'id', -- Vertix id column + 'edge', -- Edge table + 'src=src, dest=dest', -- Comma delimted string of edge arguments + 'pagerank_out', -- Output table of PageRank +NULL,-- Default damping factor (0.85) +NULL,-- Default max iters (100) +NULL,-- Default Threshold +NULL,-- No Grouping --- End diff -- move those NULLs one space left ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177914251 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -149,25 +186,37 @@ def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, out_cnts = unique_string(desp='out_cnts') out_cnts_cnt = unique_string(desp='cnt') v1 = unique_string(desp='v1') +personalized_nodes = unique_string(desp='personalized_nodes') if is_platform_pg(): cur_distribution = cnts_distribution = '' else: -cur_distribution = cnts_distribution = \ -"DISTRIBUTED BY ({0}{1})".format( -grouping_cols_comma, vertex_id) +cur_distribution = cnts_distribution = "DISTRIBUTED BY ({0}{1})".format( +grouping_cols_comma, vertex_id) cur_join_clause = """{edge_temp_table}.{dest} = {cur}.{vertex_id} """.format(**locals()) out_cnts_join_clause = """{out_cnts}.{vertex_id} = {edge_temp_table}.{src} """.format(**locals()) v1_join_clause = """{v1}.{vertex_id} = {edge_temp_table}.{src} """.format(**locals()) +# Get query params for Personalized Page Rank. +ppr_params = get_query_params_for_ppr(nodes_of_interest, damping_factor, --- End diff -- Is it better to check `if nodes_of_interest` before calling get_query_params_for_ppr instead of checking it in get_query_params_for_ppr? ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177914961 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -551,14 +680,16 @@ def pagerank_help(schema_madlib, message, **kwargs): message.lower() in ("usage", "help", "?"): help_string = "Get from method below" help_string = get_graph_usage(schema_madlib, 'PageRank', -"""out_table TEXT, -- Name of the output table for PageRank + """out_table TEXT, -- Name of the output table for PageRank damping_factor DOUBLE PRECISION, -- Damping factor in random surfer model -- (DEFAULT = 0.85) max_iter INTEGER, -- Maximum iteration number (DEFAULT = 100) threshold DOUBLE PRECISION, -- Stopping criteria (DEFAULT = 1/(N*1000), -- N is number of vertices in the graph) -grouping_col TEXT -- Comma separated column names to group on +grouping_col TEXT, -- Comma separated column names to group on -- (DEFAULT = NULL, no grouping) +nodes_of_interest ARRAY OF INTEGER -- A comma seperated list of vertices + or nodes for personalized page rank. """) + """ --- End diff -- indent left side, and indent comment(--) right ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177892625 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -44,29 +44,62 @@ from utilities.utilities import add_postfix from utilities.utilities import extract_keyvalue_params from utilities.utilities import unique_string, split_quoted_delimited_str from utilities.utilities import is_platform_pg +from utilities.utilities import py_list_to_sql_string from utilities.validate_args import columns_exist_in_table, get_cols_and_types from utilities.validate_args import table_exists + def validate_pagerank_args(schema_madlib, vertex_table, vertex_id, edge_table, edge_params, out_table, damping_factor, max_iter, - threshold, grouping_cols_list): + threshold, grouping_cols_list, nodes_of_interest): """ Function to validate input parameters for PageRank """ validate_graph_coding(vertex_table, vertex_id, edge_table, edge_params, out_table, 'PageRank') -## Validate args such as threshold and max_iter +# Validate args such as threshold and max_iter validate_params_for_link_analysis(schema_madlib, "PageRank", -threshold, max_iter, -edge_table, grouping_cols_list) + threshold, max_iter, + edge_table, grouping_cols_list) _assert(damping_factor >= 0.0 and damping_factor <= 1.0, "PageRank: Invalid damping factor value ({0}), must be between 0 and 1.". format(damping_factor)) - -def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, - out_table, damping_factor, max_iter, threshold, grouping_cols, **kwargs): +# Validate against the givin set of nodes for Personalized Page Rank +if nodes_of_interest: +nodes_of_interest_count = len(nodes_of_interest) +vertices_count = plpy.execute(""" + SELECT count(DISTINCT({vertex_id})) AS cnt FROM {vertex_table} + WHERE {vertex_id} = ANY(ARRAY{nodes_of_interest}) + """.format(**locals()))[0]["cnt"] +# Check to see if the given set of nodes exist in vertex table +if vertices_count != len(nodes_of_interest): +plpy.error("PageRank: Invalid value for {0}, must be a subset of the vertex_table".format( +nodes_of_interest)) +# Validate given set of nodes against each user group. +# If all the given nodes are not present in the user group +# then throw an error. +if grouping_cols_list: +missing_user_grps = '' +grp_by_column = get_table_qualified_col_str( +edge_table, grouping_cols_list) +grps_without_nodes = plpy.execute(""" + SELECT {grp_by_column} FROM {edge_table} + WHERE src = ANY(ARRAY{nodes_of_interest}) group by {grp_by_column} + having count(DISTINCT(src)) != {nodes_of_interest_count} + """.format(**locals())) +for row in range(grps_without_nodes.nrows()): +missing_user_grps += str(grps_without_nodes[row]['user_id']) +if row < grps_without_nodes.nrows() - 1: +missing_user_grps += ' ,' +if grps_without_nodes.nrows() > 0: +plpy.error("Nodes for Personalizaed Page Rank are missing from these groups: {0} ".format( +missing_user_grps)) + --- End diff -- Here some similar things are test twice - when `if nodes_of_interest`, there is a `count` operation in line 73 and in line 77 there is one test(this is for without grouping). Then when `if grouping_cols_list`, another `count` and `compare` happen in line 90 per group. There might be a way to simplify the logic here so that for grouping, we don't need to do it twice. Besides, if the above query really slow down performance a lot, I would think about doing it simpler by not giving a list of groups missing special nodes but just a warning(optional, depending on how expensive the above query is). ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177916983 --- Diff: src/ports/postgres/modules/graph/test/pagerank.sql_in --- @@ -95,6 +101,49 @@ SELECT assert(relative_error(SUM(pagerank), 1) < 0.1, ) FROM pagerank_gr_out WHERE user_id=2; +-- Tests for Personalized Page Rank + +-- Test without grouping + +DROP TABLE IF EXISTS pagerank_ppr_out; +DROP TABLE IF EXISTS pagerank_ppr_out_summary; +SELECT pagerank( + 'vertex',-- Vertex table + 'id',-- Vertix id column + '"EDGE"', -- "EDGE" table + 'src=src, dest=dest', -- "EDGE" args + 'pagerank_ppr_out', -- Output table of PageRank + NULL, -- Default damping factor (0.85) + NULL, -- Default max iters (100) + NULL, -- Default Threshold + NULL, -- Grouping column +'{1,3}'); -- Personlized Nodes + + +-- View the PageRank of all vertices, sorted by their scores. +SELECT assert(relative_error(SUM(pagerank), 1) < 0.00124, +'PageRank: Scores do not sum up to 1.' +) FROM pagerank_ppr_out; + + +-- Test with grouping + +DROP TABLE IF EXISTS pagerank_ppr_grp_out; +DROP TABLE IF EXISTS pagerank_ppr_grp_out_summary; +SELECT pagerank( + 'vertex',-- Vertex table + 'id',-- Vertix id column + '"EDGE"', -- "EDGE" table + 'src=src, dest=dest', -- "EDGE" args + 'pagerank_ppr_grp_out', -- Output table of PageRank + NULL, -- Default damping factor (0.85) + NULL, -- Default max iters (100) + NULL, -- Default Threshold + 'user_id', -- Grouping column +'{1,3}'); -- Personlized Nodes + +SELECT assert(count(*) = 14, 'Tuple count for Pagerank out table != 14') FROM pagerank_ppr_grp_out; --- End diff -- can we do similar assertion here by group? ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177917620 --- Diff: src/ports/postgres/modules/graph/pagerank.sql_in --- @@ -273,6 +278,48 @@ SELECT * FROM pagerank_out_summary ORDER BY user_id; (2 rows) +-# Example of Personalized Page Rank with Nodes {2,4} + +DROP TABLE IF EXISTS pagerank_out, pagerank_out_summary; +SELECT madlib.pagerank( + 'vertex', -- Vertex table + 'id', -- Vertix id column + 'edge', -- Edge table + 'src=src, dest=dest', -- Comma delimted string of edge arguments + 'pagerank_out', -- Output table of PageRank +NULL,-- Default damping factor (0.85) +NULL,-- Default max iters (100) +NULL,-- Default Threshold +NULL,-- No Grouping + '{2,4}'); -- Personlized Nodes --- End diff -- Great ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177915929 --- Diff: src/ports/postgres/modules/graph/test/pagerank.sql_in --- @@ -66,7 +66,12 @@ SELECT pagerank( 'id',-- Vertix id column '"EDGE"', -- "EDGE" table 'src=src, dest=dest', -- "EDGE" args - 'pagerank_out'); -- Output table of PageRank + 'pagerank_out',-- Output table of PageRank + NULL, -- Default damping factor (0.85) + NULL, -- Default max iters (100) + NULL, -- Default Threshold + NULL, -- No Grouping + NULL); -- Personlized Nodes --- End diff -- In this case, we can remove the last 5 NULLs since they are all optional. ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177893734 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -122,12 +158,13 @@ def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, grouping_where_clause = '' group_by_clause = '' random_prob = '' +ppr_join_clause = '' edge_temp_table = unique_string(desp='temp_edge') grouping_cols_comma = grouping_cols + ',' if grouping_cols else '' distribution = ('' if is_platform_pg() else "DISTRIBUTED BY ({0}{1})".format( -grouping_cols_comma, dest)) +grouping_cols_comma, dest)) --- End diff -- maybe indent with the above line, or move the above line backwards to the current place ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r177917195 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -149,25 +164,39 @@ def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, out_cnts = unique_string(desp='out_cnts') out_cnts_cnt = unique_string(desp='cnt') v1 = unique_string(desp='v1') +personalized_nodes = unique_string(desp='personalized_nodes') if is_platform_pg(): cur_distribution = cnts_distribution = '' else: -cur_distribution = cnts_distribution = \ -"DISTRIBUTED BY ({0}{1})".format( -grouping_cols_comma, vertex_id) +cur_distribution = cnts_distribution = "DISTRIBUTED BY ({0}{1})".format( +grouping_cols_comma, vertex_id) cur_join_clause = """{edge_temp_table}.{dest} = {cur}.{vertex_id} """.format(**locals()) out_cnts_join_clause = """{out_cnts}.{vertex_id} = {edge_temp_table}.{src} """.format(**locals()) v1_join_clause = """{v1}.{vertex_id} = {edge_temp_table}.{src} """.format(**locals()) +# Get query params for Personalized Page Rank. +ppr_params = get_query_params_for_ppr(nodes_of_interest, damping_factor, + ppr_join_clause, vertex_id, + edge_temp_table, vertex_table, cur_distribution, + personalized_nodes) +total_ppr_nodes = ppr_params[0] +random_jump_prob_ppr = ppr_params[1] +ppr_join_clause = ppr_params[2] + random_probability = (1.0 - damping_factor) / n_vertices +if total_ppr_nodes > 0: +random_jump_prob = random_jump_prob_ppr +else: +random_jump_prob = random_probability --- End diff -- Got it. ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user hpandeycodeit commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r175952795 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -527,14 +562,63 @@ def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, """.format(**locals())) # Step 4: Cleanup -plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2},{3},{4},{5},{6} +plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2},{3},{4},{5},{6},{7} """.format(out_cnts, edge_temp_table, cur, message, cur_unconv, - message_unconv, nodes_with_no_incoming_edges)) + message_unconv, nodes_with_no_incoming_edges, personalized_nodes)) if grouping_cols: plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2} """.format(vertices_per_group, temp_summary_table, distinct_grp_table)) + +def get_query_params_for_ppr(nodes_of_interest, damping_factor, + ppr_join_clause, vertex_id, edge_temp_table, vertex_table, + cur_distribution, personalized_nodes): +""" + This function will prepare the Join Clause and the condition to Calculate the Personalized Page Rank + and Returns Total number of user provided nodes of interest, A join Clause and a clause to be added + to existing formula to calculate pagerank. + + Args: + @param nodes_of_interest + @param damping_factor + @param ppr_join_clause + @param vertex_id + @param edge_temp_table + @param vertex_table + @param cur_distribution + + Returns : + (Integer, String, String) + +""" +total_ppr_nodes = 0 +random_jump_prob_ppr = '' --- End diff -- renamed this variable to ppr_random_prob_clause ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user hpandeycodeit commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r175952633 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -527,14 +562,63 @@ def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, """.format(**locals())) # Step 4: Cleanup -plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2},{3},{4},{5},{6} +plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2},{3},{4},{5},{6},{7} """.format(out_cnts, edge_temp_table, cur, message, cur_unconv, - message_unconv, nodes_with_no_incoming_edges)) + message_unconv, nodes_with_no_incoming_edges, personalized_nodes)) if grouping_cols: plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2} """.format(vertices_per_group, temp_summary_table, distinct_grp_table)) + +def get_query_params_for_ppr(nodes_of_interest, damping_factor, + ppr_join_clause, vertex_id, edge_temp_table, vertex_table, + cur_distribution, personalized_nodes): +""" + This function will prepare the Join Clause and the condition to Calculate the Personalized Page Rank + and Returns Total number of user provided nodes of interest, A join Clause and a clause to be added + to existing formula to calculate pagerank. + + Args: + @param nodes_of_interest + @param damping_factor + @param ppr_join_clause + @param vertex_id + @param edge_temp_table + @param vertex_table + @param cur_distribution + + Returns : + (Integer, String, String) + +""" +total_ppr_nodes = 0 +random_jump_prob_ppr = '' + +if nodes_of_interest: +total_ppr_nodes = len(nodes_of_interest) +init_value_ppr_nodes = 1.0 / total_ppr_nodes +# Create a Temp table that holds the Inital probabilities for the +# user provided nodes +plpy.execute(""" +CREATE TEMP TABLE {personalized_nodes} AS +SELECT {vertex_id}, {init_value_ppr_nodes}::DOUBLE PRECISION as pagerank +FROM {vertex_table} where {vertex_id} = ANY(ARRAY{nodes_of_interest}) +{cur_distribution} +""".format(**locals())) +ppr_join_clause = """ LEFT JOIN {personalized_nodes} on +{personalized_nodes}.{vertex_id} = {edge_temp_table}.dest""".format(**locals()) +prob_value = 1.0 - damping_factor + +# In case of PPR, Assign the Random jump probability to the nodes_of_interest only. +# For rest of the nodes, Random jump probability will be zero. +random_jump_prob_ppr = """ CASE when {edge_temp_table}.dest = ANY(ARRAY{nodes_of_interest}) +THEN {prob_value} +ELSE 0 +END """.format(**locals()) +return(total_ppr_nodes, random_jump_prob_ppr, ppr_join_clause) + + def pagerank_help(schema_madlib, message, **kwargs): --- End diff -- Added the explanation and example in the helper function ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user hpandeycodeit commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r175952712 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -44,29 +44,40 @@ from utilities.utilities import add_postfix from utilities.utilities import extract_keyvalue_params from utilities.utilities import unique_string, split_quoted_delimited_str from utilities.utilities import is_platform_pg +from utilities.utilities import py_list_to_sql_string from utilities.validate_args import columns_exist_in_table, get_cols_and_types from utilities.validate_args import table_exists + def validate_pagerank_args(schema_madlib, vertex_table, vertex_id, edge_table, edge_params, out_table, damping_factor, max_iter, - threshold, grouping_cols_list): + threshold, grouping_cols_list, nodes_of_interest): """ Function to validate input parameters for PageRank """ validate_graph_coding(vertex_table, vertex_id, edge_table, edge_params, out_table, 'PageRank') -## Validate args such as threshold and max_iter +# Validate args such as threshold and max_iter validate_params_for_link_analysis(schema_madlib, "PageRank", -threshold, max_iter, -edge_table, grouping_cols_list) + threshold, max_iter, + edge_table, grouping_cols_list) _assert(damping_factor >= 0.0 and damping_factor <= 1.0, "PageRank: Invalid damping factor value ({0}), must be between 0 and 1.". format(damping_factor)) +if nodes_of_interest: +vertices = plpy.execute(""" --- End diff -- Changed vertices to vertices_count as discussed. ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r175663431 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -149,25 +164,39 @@ def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, out_cnts = unique_string(desp='out_cnts') out_cnts_cnt = unique_string(desp='cnt') v1 = unique_string(desp='v1') +personalized_nodes = unique_string(desp='personalized_nodes') if is_platform_pg(): cur_distribution = cnts_distribution = '' else: -cur_distribution = cnts_distribution = \ -"DISTRIBUTED BY ({0}{1})".format( -grouping_cols_comma, vertex_id) +cur_distribution = cnts_distribution = "DISTRIBUTED BY ({0}{1})".format( +grouping_cols_comma, vertex_id) cur_join_clause = """{edge_temp_table}.{dest} = {cur}.{vertex_id} """.format(**locals()) out_cnts_join_clause = """{out_cnts}.{vertex_id} = {edge_temp_table}.{src} """.format(**locals()) v1_join_clause = """{v1}.{vertex_id} = {edge_temp_table}.{src} """.format(**locals()) +# Get query params for Personalized Page Rank. +ppr_params = get_query_params_for_ppr(nodes_of_interest, damping_factor, + ppr_join_clause, vertex_id, + edge_temp_table, vertex_table, cur_distribution, + personalized_nodes) +total_ppr_nodes = ppr_params[0] +random_jump_prob_ppr = ppr_params[1] +ppr_join_clause = ppr_params[2] + random_probability = (1.0 - damping_factor) / n_vertices +if total_ppr_nodes > 0: +random_jump_prob = random_jump_prob_ppr +else: +random_jump_prob = random_probability --- End diff -- Can move (1.0 - damping_factor) / n_vertices here since random_probability is not used anywhere else. ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r175664342 --- Diff: src/ports/postgres/modules/graph/test/pagerank.sql_in --- @@ -84,7 +89,8 @@ SELECT pagerank( NULL, NULL, NULL, - 'user_id'); + 'user_id', + NULL); -- View the PageRank of all vertices, sorted by their scores. SELECT assert(relative_error(SUM(pagerank), 1) < 0.1, --- End diff -- We may need at least two test cases with nodes_of_interest not null and with/without grouping. ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r175627510 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -527,14 +562,63 @@ def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, """.format(**locals())) # Step 4: Cleanup -plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2},{3},{4},{5},{6} +plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2},{3},{4},{5},{6},{7} """.format(out_cnts, edge_temp_table, cur, message, cur_unconv, - message_unconv, nodes_with_no_incoming_edges)) + message_unconv, nodes_with_no_incoming_edges, personalized_nodes)) if grouping_cols: plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2} """.format(vertices_per_group, temp_summary_table, distinct_grp_table)) + +def get_query_params_for_ppr(nodes_of_interest, damping_factor, + ppr_join_clause, vertex_id, edge_temp_table, vertex_table, + cur_distribution, personalized_nodes): +""" + This function will prepare the Join Clause and the condition to Calculate the Personalized Page Rank + and Returns Total number of user provided nodes of interest, A join Clause and a clause to be added + to existing formula to calculate pagerank. + + Args: + @param nodes_of_interest + @param damping_factor + @param ppr_join_clause + @param vertex_id + @param edge_temp_table + @param vertex_table + @param cur_distribution + + Returns : + (Integer, String, String) + +""" +total_ppr_nodes = 0 +random_jump_prob_ppr = '' + +if nodes_of_interest: +total_ppr_nodes = len(nodes_of_interest) +init_value_ppr_nodes = 1.0 / total_ppr_nodes +# Create a Temp table that holds the Inital probabilities for the +# user provided nodes +plpy.execute(""" +CREATE TEMP TABLE {personalized_nodes} AS +SELECT {vertex_id}, {init_value_ppr_nodes}::DOUBLE PRECISION as pagerank +FROM {vertex_table} where {vertex_id} = ANY(ARRAY{nodes_of_interest}) +{cur_distribution} +""".format(**locals())) +ppr_join_clause = """ LEFT JOIN {personalized_nodes} on +{personalized_nodes}.{vertex_id} = {edge_temp_table}.dest""".format(**locals()) +prob_value = 1.0 - damping_factor + +# In case of PPR, Assign the Random jump probability to the nodes_of_interest only. +# For rest of the nodes, Random jump probability will be zero. +random_jump_prob_ppr = """ CASE when {edge_temp_table}.dest = ANY(ARRAY{nodes_of_interest}) +THEN {prob_value} +ELSE 0 +END """.format(**locals()) +return(total_ppr_nodes, random_jump_prob_ppr, ppr_join_clause) + + def pagerank_help(schema_madlib, message, **kwargs): --- End diff -- We need the new parameter explanation in helper function too ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r175665727 --- Diff: src/ports/postgres/modules/graph/pagerank.sql_in --- @@ -120,6 +121,10 @@ distribution per group. When this value is NULL, no grouping is used and a single model is generated for all data. @note Expressions are not currently supported for 'grouping_cols'. + nodes_of_interest (optional) --- End diff -- @fmcquillan99 Do we also need additional explanation of personalized pagerank somewhere in the user doc? ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r175631615 --- Diff: src/ports/postgres/modules/graph/pagerank.sql_in --- @@ -273,6 +278,48 @@ SELECT * FROM pagerank_out_summary ORDER BY user_id; (2 rows) +-# Example of Personalized Page Rank with Nodes {2,4} + +DROP TABLE IF EXISTS pagerank_out, pagerank_out_summary; +SELECT madlib.pagerank( + 'vertex', -- Vertex table + 'id', -- Vertix id column + 'edge', -- Edge table + 'src=src, dest=dest', -- Comma delimted string of edge arguments + 'pagerank_out', -- Output table of PageRank +NULL,-- Default damping factor (0.85) +NULL,-- Default max iters (100) +NULL,-- Default Threshold +NULL,-- No Grouping + '{2,4}'); -- Personlized Nodes --- End diff -- Another valid input for personalized nodes array is 'ARRAY[2,4]', should be mentioned in some example or user doc later. ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
GitHub user hpandeycodeit opened a pull request: https://github.com/apache/madlib/pull/244 Changes for Personalized Page Rank : Jira:1084 Jira : 1084 This PR contains changes for Personalized Page Rank. - Added extra parameter, nodes_of_interest in main pagerank function. - Added a new Function get_query_params_for_ppr in pagerank.py_in to calculate random_jump_probabilty based on the user provided input nodes. - Added a condition, when the user provided nodes are present then Personalized Page Rank will be executed otherwise regular Page Rank will run. - Added an example function in pagerank.sql_in - The extra parameter nodes_of_interest is also added in the calling functions in pagerank.sql_in You can merge this pull request into a Git repository by running: $ git pull https://github.com/hpandeycodeit/incubator-madlib graph_1084 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/244.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #244 commit ed1e364db205f104379270529d3eff694a589651 Author: hpandeycodeitDate: 2018-03-16T22:15:51Z Changes for Personalized Page Rank : Jira:1084 ---