[GitHub] madlib pull request #239: Balance Sample: Add support for grouping
Github user asfgit closed the pull request at: https://github.com/apache/madlib/pull/239 ---
[GitHub] madlib pull request #239: Balance Sample: Add support for grouping
Github user fmcquillan99 commented on a diff in the pull request: https://github.com/apache/madlib/pull/239#discussion_r173254469 --- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in --- @@ -543,6 +545,95 @@ SELECT * FROM output_table ORDER BY mainhue, name; (25 rows) +-# To perform the balance sampling for independent groups, use the 'grouping_cols' +parameter. Note below that each group (zone) has a different count of the +classes (mainhue), with some groups not containing some class values. + +DROP TABLE IF EXISTS output_table; +SELECT madlib.balance_sample( +'flags', -- Source table +'output_table', -- Output table +'mainhue',-- Class column +NULL, -- Uniform +NULL, -- Output table size +'zone'-- Grouping by zone +); +SELECT * FROM output_table ORDER BY zone, mainhue; + + + __madlib_id__ | id |name | landmass | zone | area | population | language | colours | mainhue +---++-+--+--+--++--+-+- + 6 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 5 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 8 | 17 | Sweden |3 |1 | 450 | 8 | 6 | 2 | blue + 7 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 2 | 7 | Denmark |3 |1 | 43 | 5 | 6 | 2 | red + 1 | 6 | China |5 |1 | 9561 | 1008 | 7 | 2 | red + 4 | 12 | Luxembourg |3 |1 |3 | 0 | 4 | 3 | red + 3 | 18 | Switzerland |3 |1 | 41 | 6 | 4 | 2 | red + 1 | 2 | Australia |6 |2 | 7690 | 15 | 1 | 3 | blue + 1 | 1 | Argentina |2 |3 | 2777 | 28 | 2 | 2 | blue + 2 | 4 | Brazil |2 |3 | 8512 |119 | 6 | 4 | green + 6 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue + 5 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue + 4 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue +12 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green +10 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green +11 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green + 1 | 19 | UK |3 |4 | 245 | 56 | 1 | 3 | red + 3 | 5 | Canada |1 |4 | 9976 | 24 | 1 | 2 | red + 2 | 15 | Portugal|3 |4 | 92 | 10 | 6 | 5 | red + 8 | 20 | USA |1 |4 | 9363 |231 | 1 | 3 | white + 7 | 20 | USA |1 |4 | 9363 |231 | 1 | 3 | white + 9 | 10 | Ireland |3 |4 | 70 | 3 | 1 | 3 | white +(23 rows) + + +-# Grouping can be used with class size specification as well. Note below that +'blue=' is the only valid class value since 'blue' is the only class +value that is present in each group. Further, `blue=8` in the example below will +be split between the four groups, resulting in two blue rows for each group. + +DROP TABLE IF EXISTS output_table; +SELECT madlib.balance_sample( +'flags', -- Source table +'output_table', -- Output table +'mainhue',-- Class column +'blue=8', -- Specified class value size. Rest of the values are outputed as is. +NULL, -- Output table size +'zone'-- Group by zone +); +SELECT * FROM output_table ORDER BY zone, mainhue; + + + __madlib_id__ | id |name | landmass | zone | area | population | language | colours | mainhue +---++-+--+--+--++--+-+- + 2 | 17 | Sweden |3 |1 | 450 | 8 | 6 | 2 | blue + 1 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 3 | 3 | Austria |3 |1 | 84 | 8 | 4 | 2 |
[GitHub] madlib pull request #239: Balance Sample: Add support for grouping
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/239#discussion_r173254181 --- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in --- @@ -543,6 +545,95 @@ SELECT * FROM output_table ORDER BY mainhue, name; (25 rows) +-# To perform the balance sampling for independent groups, use the 'grouping_cols' +parameter. Note below that each group (zone) has a different count of the +classes (mainhue), with some groups not containing some class values. + +DROP TABLE IF EXISTS output_table; +SELECT madlib.balance_sample( +'flags', -- Source table +'output_table', -- Output table +'mainhue',-- Class column +NULL, -- Uniform +NULL, -- Output table size +'zone'-- Grouping by zone +); +SELECT * FROM output_table ORDER BY zone, mainhue; + + + __madlib_id__ | id |name | landmass | zone | area | population | language | colours | mainhue +---++-+--+--+--++--+-+- + 6 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 5 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 8 | 17 | Sweden |3 |1 | 450 | 8 | 6 | 2 | blue + 7 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 2 | 7 | Denmark |3 |1 | 43 | 5 | 6 | 2 | red + 1 | 6 | China |5 |1 | 9561 | 1008 | 7 | 2 | red + 4 | 12 | Luxembourg |3 |1 |3 | 0 | 4 | 3 | red + 3 | 18 | Switzerland |3 |1 | 41 | 6 | 4 | 2 | red + 1 | 2 | Australia |6 |2 | 7690 | 15 | 1 | 3 | blue + 1 | 1 | Argentina |2 |3 | 2777 | 28 | 2 | 2 | blue + 2 | 4 | Brazil |2 |3 | 8512 |119 | 6 | 4 | green + 6 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue + 5 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue + 4 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue +12 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green +10 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green +11 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green + 1 | 19 | UK |3 |4 | 245 | 56 | 1 | 3 | red + 3 | 5 | Canada |1 |4 | 9976 | 24 | 1 | 2 | red + 2 | 15 | Portugal|3 |4 | 92 | 10 | 6 | 5 | red + 8 | 20 | USA |1 |4 | 9363 |231 | 1 | 3 | white + 7 | 20 | USA |1 |4 | 9363 |231 | 1 | 3 | white + 9 | 10 | Ireland |3 |4 | 70 | 3 | 1 | 3 | white +(23 rows) + + +-# Grouping can be used with class size specification as well. Note below that +'blue=' is the only valid class value since 'blue' is the only class +value that is present in each group. Further, `blue=8` in the example below will +be split between the four groups, resulting in two blue rows for each group. + +DROP TABLE IF EXISTS output_table; +SELECT madlib.balance_sample( +'flags', -- Source table +'output_table', -- Output table +'mainhue',-- Class column +'blue=8', -- Specified class value size. Rest of the values are outputed as is. +NULL, -- Output table size +'zone'-- Group by zone +); +SELECT * FROM output_table ORDER BY zone, mainhue; + + + __madlib_id__ | id |name | landmass | zone | area | population | language | colours | mainhue +---++-+--+--+--++--+-+- + 2 | 17 | Sweden |3 |1 | 450 | 8 | 6 | 2 | blue + 1 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 3 | 3 | Austria |3 |1 | 84 | 8 | 4 | 2 | red
[GitHub] madlib pull request #239: Balance Sample: Add support for grouping
Github user fmcquillan99 commented on a diff in the pull request: https://github.com/apache/madlib/pull/239#discussion_r173239594 --- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in --- @@ -543,6 +545,95 @@ SELECT * FROM output_table ORDER BY mainhue, name; (25 rows) +-# To perform the balance sampling for independent groups, use the 'grouping_cols' +parameter. Note below that each group (zone) has a different count of the +classes (mainhue), with some groups not containing some class values. + +DROP TABLE IF EXISTS output_table; +SELECT madlib.balance_sample( +'flags', -- Source table +'output_table', -- Output table +'mainhue',-- Class column +NULL, -- Uniform +NULL, -- Output table size +'zone'-- Grouping by zone +); +SELECT * FROM output_table ORDER BY zone, mainhue; + + + __madlib_id__ | id |name | landmass | zone | area | population | language | colours | mainhue +---++-+--+--+--++--+-+- + 6 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 5 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 8 | 17 | Sweden |3 |1 | 450 | 8 | 6 | 2 | blue + 7 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 2 | 7 | Denmark |3 |1 | 43 | 5 | 6 | 2 | red + 1 | 6 | China |5 |1 | 9561 | 1008 | 7 | 2 | red + 4 | 12 | Luxembourg |3 |1 |3 | 0 | 4 | 3 | red + 3 | 18 | Switzerland |3 |1 | 41 | 6 | 4 | 2 | red + 1 | 2 | Australia |6 |2 | 7690 | 15 | 1 | 3 | blue + 1 | 1 | Argentina |2 |3 | 2777 | 28 | 2 | 2 | blue + 2 | 4 | Brazil |2 |3 | 8512 |119 | 6 | 4 | green + 6 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue + 5 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue + 4 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue +12 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green +10 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green +11 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green + 1 | 19 | UK |3 |4 | 245 | 56 | 1 | 3 | red + 3 | 5 | Canada |1 |4 | 9976 | 24 | 1 | 2 | red + 2 | 15 | Portugal|3 |4 | 92 | 10 | 6 | 5 | red + 8 | 20 | USA |1 |4 | 9363 |231 | 1 | 3 | white + 7 | 20 | USA |1 |4 | 9363 |231 | 1 | 3 | white + 9 | 10 | Ireland |3 |4 | 70 | 3 | 1 | 3 | white +(23 rows) + + +-# Grouping can be used with class size specification as well. Note below that +'blue=' is the only valid class value since 'blue' is the only class +value that is present in each group. Further, `blue=8` in the example below will +be split between the four groups, resulting in two blue rows for each group. + +DROP TABLE IF EXISTS output_table; +SELECT madlib.balance_sample( +'flags', -- Source table +'output_table', -- Output table +'mainhue',-- Class column +'blue=8', -- Specified class value size. Rest of the values are outputed as is. +NULL, -- Output table size +'zone'-- Group by zone +); +SELECT * FROM output_table ORDER BY zone, mainhue; + + + __madlib_id__ | id |name | landmass | zone | area | population | language | colours | mainhue +---++-+--+--+--++--+-+- + 2 | 17 | Sweden |3 |1 | 450 | 8 | 6 | 2 | blue + 1 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 3 | 3 | Austria |3 |1 | 84 | 8 | 4 | 2 |
[GitHub] madlib pull request #239: Balance Sample: Add support for grouping
Github user fmcquillan99 commented on a diff in the pull request: https://github.com/apache/madlib/pull/239#discussion_r173238804 --- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in --- @@ -543,6 +545,95 @@ SELECT * FROM output_table ORDER BY mainhue, name; (25 rows) +-# To perform the balance sampling for independent groups, use the 'grouping_cols' +parameter. Note below that each group (zone) has a different count of the +classes (mainhue), with some groups not containing some class values. + +DROP TABLE IF EXISTS output_table; +SELECT madlib.balance_sample( +'flags', -- Source table +'output_table', -- Output table +'mainhue',-- Class column +NULL, -- Uniform +NULL, -- Output table size +'zone'-- Grouping by zone +); +SELECT * FROM output_table ORDER BY zone, mainhue; + + + __madlib_id__ | id |name | landmass | zone | area | population | language | colours | mainhue +---++-+--+--+--++--+-+- + 6 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 5 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 8 | 17 | Sweden |3 |1 | 450 | 8 | 6 | 2 | blue + 7 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 2 | 7 | Denmark |3 |1 | 43 | 5 | 6 | 2 | red + 1 | 6 | China |5 |1 | 9561 | 1008 | 7 | 2 | red + 4 | 12 | Luxembourg |3 |1 |3 | 0 | 4 | 3 | red + 3 | 18 | Switzerland |3 |1 | 41 | 6 | 4 | 2 | red + 1 | 2 | Australia |6 |2 | 7690 | 15 | 1 | 3 | blue + 1 | 1 | Argentina |2 |3 | 2777 | 28 | 2 | 2 | blue + 2 | 4 | Brazil |2 |3 | 8512 |119 | 6 | 4 | green + 6 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue + 5 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue + 4 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue +12 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green +10 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green +11 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green + 1 | 19 | UK |3 |4 | 245 | 56 | 1 | 3 | red + 3 | 5 | Canada |1 |4 | 9976 | 24 | 1 | 2 | red + 2 | 15 | Portugal|3 |4 | 92 | 10 | 6 | 5 | red + 8 | 20 | USA |1 |4 | 9363 |231 | 1 | 3 | white + 7 | 20 | USA |1 |4 | 9363 |231 | 1 | 3 | white + 9 | 10 | Ireland |3 |4 | 70 | 3 | 1 | 3 | white +(23 rows) + + +-# Grouping can be used with class size specification as well. Note below that +'blue=' is the only valid class value since 'blue' is the only class +value that is present in each group. Further, `blue=8` in the example below will +be split between the four groups, resulting in two blue rows for each group. + +DROP TABLE IF EXISTS output_table; +SELECT madlib.balance_sample( +'flags', -- Source table +'output_table', -- Output table +'mainhue',-- Class column +'blue=8', -- Specified class value size. Rest of the values are outputed as is. +NULL, -- Output table size +'zone'-- Group by zone +); +SELECT * FROM output_table ORDER BY zone, mainhue; --- End diff -- For some reason, doxygen builds the above as ``` DROP TABLE IF EXISTS output_table; SELECT madlib.balance_sample( 'flags, -- Source table 'output_table', -- Output table 'mainhue',-- Class column 'blue=8', -- Specified class value size. Rest of the values are outputed as is. NULL, -- Output table size 'zone'-- Group by zone ); SELECT * FROM output_table ORDER BY zone, mainhue; ``` whe
[GitHub] madlib pull request #239: Balance Sample: Add support for grouping
Github user iyerr3 commented on a diff in the pull request: https://github.com/apache/madlib/pull/239#discussion_r173080726 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -468,81 +544,107 @@ def balance_sample(schema_madlib, source_table, output_table, class_col, parsed_class_sizes = extract_keyvalue_params(class_sizes, allow_duplicates=False, lower_case_names=False) +distinct_levels = collate_plpy_result( +plpy.execute("SELECT DISTINCT ({0})::TEXT as levels FROM {1} ". + format(class_col, source_table)))['levels'] if not parsed_class_sizes: -sampling_strategy_str = _validate_and_get_sampling_strategy(class_sizes, -output_table_size) +sampling_strategy_str = _validate_and_get_sampling_strategy( +class_sizes, output_table_size) else: sampling_strategy_str = None try: for each_level, each_class_size in parsed_class_sizes.items(): -_assert(each_level in actual_level_counts, +_assert(each_level in distinct_levels, "Sample: Invalid class value specified ({0})". - format(each_level)) +format(each_level)) each_class_size = int(each_class_size) _assert(each_class_size >= 1, "Sample: Class size has to be greater than zero") parsed_class_sizes[each_level] = each_class_size - -except TypeError: +except ValueError: plpy.error("Sample: Invalid value for class_sizes ({0})". format(class_sizes)) # Get the number of rows to be sampled for each class level, based on # the input table, class_sizes, and output_table_size params. This also # includes info about the resulting sampling strategy, i.e., one of # UNDERSAMPLE, OVERSAMPLE, or NOSAMPLE for each level. -target_class_sizes = _get_target_level_counts(sampling_strategy_str, - parsed_class_sizes, - actual_level_counts, - output_table_size) - -undersample_level_dict, oversample_level_dict, nosample_level_dict = \ -_get_sampling_strategy_specific_dict(target_class_sizes) - -# Get subqueries for each sampling strategy, so that they can be used -# together in one big query. - -# Subquery that will be used to get rows as is for those class levels -# that need no sampling. -nosample_subquery = _get_nosample_subquery( -new_source_table, class_col, nosample_level_dict.keys()) -# Subquery that will be used to sample those class levels that -# have to be oversampled. -oversample_subquery = _get_with_replacement_subquery( -schema_madlib, new_source_table, source_table_columns, class_col, -actual_level_counts, oversample_level_dict) -# Subquery that will be used to sample those class levels that -# have to be undersampled. Undersampling supports both with and without -# replacement, so fetch the appropriate subquery. -if with_replacement: -undersample_subquery = _get_with_replacement_subquery( -schema_madlib, new_source_table, source_table_columns, class_col, -actual_level_counts, undersample_level_dict) -else: -undersample_subquery = _get_without_replacement_subquery( -schema_madlib, new_source_table, source_table_columns, class_col, -actual_level_counts, undersample_level_dict) - -# Merge the three subqueries using a UNION ALL clause. -union_all_subquery = ' UNION ALL '.join( -['({0})'.format(subquery) - for subquery in [undersample_subquery, oversample_subquery, nosample_subquery] - if subquery]) - -final_query = """ -CREATE TABLE {output_table} AS -SELECT row_number() OVER() AS {new_col_name}, * +grp_col_str, grp_cols = get_grouping_col_str( +schema_madlib, 'Balance sample', [NEW_ID_COLUMN, class_col], +source_table, grouping_cols) +actual_grp_level_cou
[GitHub] madlib pull request #239: Balance Sample: Add support for grouping
Github user iyerr3 commented on a diff in the pull request: https://github.com/apache/madlib/pull/239#discussion_r173080410 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -58,28 +60,64 @@ NOSAMPLE = 'nosample' NEW_ID_COLUMN = '__madlib_id__' NULL_IDENTIFIER = '__madlib_null_id__' -def _get_level_frequency_distribution(source_table, class_col): -""" Returns a dict containing the number of rows associated with each class + +def _get_level_frequency_distribution(source_table, class_col, + grp_by_cols=None): +""" Count the number of rows for each class, partitioned by the grp_by_cols + +Returns a dict containing the number of rows associated with each class level. Each class level count is converted to a string using ::text. None is a valid key in this dict, capturing NULL value in the database. """ +if grp_by_cols and grp_by_cols.lower() != 'null': +is_grouping = True +grp_by_cols_comma = grp_by_cols + ', ' +array_grp_by_cols_comma = "array[{0}]".format(grp_by_cols) + " as group_values, " +else: +is_grouping = False +grp_by_cols_comma = array_grp_by_cols_comma = "" + +# In below query, the inner query groups the data using grp_by_cols + classes +# and obtains the count for each combination. The outer query then groups +# again by the grp_by_cols to collect the classes and counts in an array. query_result = plpy.execute(""" -SELECT {class_col}::text AS classes, - count(*) AS class_count -FROM {source_table} -GROUP BY {class_col} - """.format(**locals())) +SELECT +-- For each group get the classes and their rows counts +{grp_identifier} as group_values, +array_agg(classes) as classes, +array_agg(class_count) as class_count +FROM( +-- for each group and class combination present in source table +-- get the count of rows for that combination +SELECT +{array_grp_by_cols_comma} +({class_col})::TEXT AS classes, +count(*) AS class_count +FROM {source_table} +GROUP BY {grp_by_cols_comma} ({class_col}) +) q +GROUP BY {grp_identifier} --- End diff -- Thanks for checking this! I've removed the constant group by clause to avoid this. ---
[GitHub] madlib pull request #239: Balance Sample: Add support for grouping
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/239#discussion_r172954235 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -468,81 +544,107 @@ def balance_sample(schema_madlib, source_table, output_table, class_col, parsed_class_sizes = extract_keyvalue_params(class_sizes, allow_duplicates=False, lower_case_names=False) +distinct_levels = collate_plpy_result( +plpy.execute("SELECT DISTINCT ({0})::TEXT as levels FROM {1} ". + format(class_col, source_table)))['levels'] if not parsed_class_sizes: -sampling_strategy_str = _validate_and_get_sampling_strategy(class_sizes, -output_table_size) +sampling_strategy_str = _validate_and_get_sampling_strategy( +class_sizes, output_table_size) else: sampling_strategy_str = None try: for each_level, each_class_size in parsed_class_sizes.items(): -_assert(each_level in actual_level_counts, +_assert(each_level in distinct_levels, "Sample: Invalid class value specified ({0})". - format(each_level)) +format(each_level)) each_class_size = int(each_class_size) _assert(each_class_size >= 1, "Sample: Class size has to be greater than zero") parsed_class_sizes[each_level] = each_class_size - -except TypeError: +except ValueError: plpy.error("Sample: Invalid value for class_sizes ({0})". format(class_sizes)) # Get the number of rows to be sampled for each class level, based on # the input table, class_sizes, and output_table_size params. This also # includes info about the resulting sampling strategy, i.e., one of # UNDERSAMPLE, OVERSAMPLE, or NOSAMPLE for each level. -target_class_sizes = _get_target_level_counts(sampling_strategy_str, - parsed_class_sizes, - actual_level_counts, - output_table_size) - -undersample_level_dict, oversample_level_dict, nosample_level_dict = \ -_get_sampling_strategy_specific_dict(target_class_sizes) - -# Get subqueries for each sampling strategy, so that they can be used -# together in one big query. - -# Subquery that will be used to get rows as is for those class levels -# that need no sampling. -nosample_subquery = _get_nosample_subquery( -new_source_table, class_col, nosample_level_dict.keys()) -# Subquery that will be used to sample those class levels that -# have to be oversampled. -oversample_subquery = _get_with_replacement_subquery( -schema_madlib, new_source_table, source_table_columns, class_col, -actual_level_counts, oversample_level_dict) -# Subquery that will be used to sample those class levels that -# have to be undersampled. Undersampling supports both with and without -# replacement, so fetch the appropriate subquery. -if with_replacement: -undersample_subquery = _get_with_replacement_subquery( -schema_madlib, new_source_table, source_table_columns, class_col, -actual_level_counts, undersample_level_dict) -else: -undersample_subquery = _get_without_replacement_subquery( -schema_madlib, new_source_table, source_table_columns, class_col, -actual_level_counts, undersample_level_dict) - -# Merge the three subqueries using a UNION ALL clause. -union_all_subquery = ' UNION ALL '.join( -['({0})'.format(subquery) - for subquery in [undersample_subquery, oversample_subquery, nosample_subquery] - if subquery]) - -final_query = """ -CREATE TABLE {output_table} AS -SELECT row_number() OVER() AS {new_col_name}, * +grp_col_str, grp_cols = get_grouping_col_str( +schema_madlib, 'Balance sample', [NEW_ID_COLUMN, class_col], +source_table, grouping_cols) +actual_grp_level_
[GitHub] madlib pull request #239: Balance Sample: Add support for grouping
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/239#discussion_r172953687 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -468,81 +544,107 @@ def balance_sample(schema_madlib, source_table, output_table, class_col, parsed_class_sizes = extract_keyvalue_params(class_sizes, allow_duplicates=False, lower_case_names=False) +distinct_levels = collate_plpy_result( +plpy.execute("SELECT DISTINCT ({0})::TEXT as levels FROM {1} ". + format(class_col, source_table)))['levels'] if not parsed_class_sizes: -sampling_strategy_str = _validate_and_get_sampling_strategy(class_sizes, -output_table_size) +sampling_strategy_str = _validate_and_get_sampling_strategy( +class_sizes, output_table_size) else: sampling_strategy_str = None try: for each_level, each_class_size in parsed_class_sizes.items(): -_assert(each_level in actual_level_counts, +_assert(each_level in distinct_levels, "Sample: Invalid class value specified ({0})". - format(each_level)) +format(each_level)) each_class_size = int(each_class_size) _assert(each_class_size >= 1, "Sample: Class size has to be greater than zero") parsed_class_sizes[each_level] = each_class_size - -except TypeError: +except ValueError: plpy.error("Sample: Invalid value for class_sizes ({0})". format(class_sizes)) # Get the number of rows to be sampled for each class level, based on # the input table, class_sizes, and output_table_size params. This also # includes info about the resulting sampling strategy, i.e., one of # UNDERSAMPLE, OVERSAMPLE, or NOSAMPLE for each level. -target_class_sizes = _get_target_level_counts(sampling_strategy_str, - parsed_class_sizes, - actual_level_counts, - output_table_size) - -undersample_level_dict, oversample_level_dict, nosample_level_dict = \ -_get_sampling_strategy_specific_dict(target_class_sizes) - -# Get subqueries for each sampling strategy, so that they can be used -# together in one big query. - -# Subquery that will be used to get rows as is for those class levels -# that need no sampling. -nosample_subquery = _get_nosample_subquery( -new_source_table, class_col, nosample_level_dict.keys()) -# Subquery that will be used to sample those class levels that -# have to be oversampled. -oversample_subquery = _get_with_replacement_subquery( -schema_madlib, new_source_table, source_table_columns, class_col, -actual_level_counts, oversample_level_dict) -# Subquery that will be used to sample those class levels that -# have to be undersampled. Undersampling supports both with and without -# replacement, so fetch the appropriate subquery. -if with_replacement: -undersample_subquery = _get_with_replacement_subquery( -schema_madlib, new_source_table, source_table_columns, class_col, -actual_level_counts, undersample_level_dict) -else: -undersample_subquery = _get_without_replacement_subquery( -schema_madlib, new_source_table, source_table_columns, class_col, -actual_level_counts, undersample_level_dict) - -# Merge the three subqueries using a UNION ALL clause. -union_all_subquery = ' UNION ALL '.join( -['({0})'.format(subquery) - for subquery in [undersample_subquery, oversample_subquery, nosample_subquery] - if subquery]) - -final_query = """ -CREATE TABLE {output_table} AS -SELECT row_number() OVER() AS {new_col_name}, * +grp_col_str, grp_cols = get_grouping_col_str( +schema_madlib, 'Balance sample', [NEW_ID_COLUMN, class_col], +source_table, grouping_cols) +actual_grp_level_
[GitHub] madlib pull request #239: Balance Sample: Add support for grouping
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/239#discussion_r172958587 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -58,28 +60,64 @@ NOSAMPLE = 'nosample' NEW_ID_COLUMN = '__madlib_id__' NULL_IDENTIFIER = '__madlib_null_id__' -def _get_level_frequency_distribution(source_table, class_col): -""" Returns a dict containing the number of rows associated with each class + +def _get_level_frequency_distribution(source_table, class_col, + grp_by_cols=None): +""" Count the number of rows for each class, partitioned by the grp_by_cols + +Returns a dict containing the number of rows associated with each class level. Each class level count is converted to a string using ::text. None is a valid key in this dict, capturing NULL value in the database. """ +if grp_by_cols and grp_by_cols.lower() != 'null': +is_grouping = True +grp_by_cols_comma = grp_by_cols + ', ' +array_grp_by_cols_comma = "array[{0}]".format(grp_by_cols) + " as group_values, " +else: +is_grouping = False +grp_by_cols_comma = array_grp_by_cols_comma = "" + +# In below query, the inner query groups the data using grp_by_cols + classes +# and obtains the count for each combination. The outer query then groups +# again by the grp_by_cols to collect the classes and counts in an array. query_result = plpy.execute(""" -SELECT {class_col}::text AS classes, - count(*) AS class_count -FROM {source_table} -GROUP BY {class_col} - """.format(**locals())) +SELECT +-- For each group get the classes and their rows counts +{grp_identifier} as group_values, +array_agg(classes) as classes, +array_agg(class_count) as class_count +FROM( +-- for each group and class combination present in source table +-- get the count of rows for that combination +SELECT +{array_grp_by_cols_comma} +({class_col})::TEXT AS classes, +count(*) AS class_count +FROM {source_table} +GROUP BY {grp_by_cols_comma} ({class_col}) +) q +GROUP BY {grp_identifier} --- End diff -- In Greenplum 5.x, the above query fails on install check, with the following error: ``` SELECT balance_sample('"TEST_s"', 'out_sr2', 'gr1', 'undersample ', NULL, NULL, TRUE, TRUE); psql:/tmp/madlib.2N5sjK/sample/test/balance_sample.sql_in.tmp:111: ERROR: plpy.SPIError: non-integer constant in GROUP BY LINE 17: GROUP BY true ^ QUERY: SELECT -- For each group get the classes and their rows counts true as group_values, array_agg(classes) as classes, array_agg(class_count) as class_count FROM( -- for each group and class combination present in source table -- get the count of rows for that combination SELECT (gr1)::TEXT AS classes, count(*) AS class_count FROM "TEST_s" GROUP BY (gr1) ) q GROUP BY true CONTEXT: Traceback (most recent call last): PL/Python function "balance_sample", line 23, in return balance_sample.balance_sample(**globals()) PL/Python function "balance_sample", line 575, in balance_sample PL/Python function "balance_sample", line 100, in _get_level_frequency_distribution PL/Python function "balance_sample" ``` ---
[GitHub] madlib pull request #239: Balance Sample: Add support for grouping
Github user fmcquillan99 commented on a diff in the pull request: https://github.com/apache/madlib/pull/239#discussion_r172949439 --- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in --- @@ -543,6 +544,90 @@ SELECT * FROM output_table ORDER BY mainhue, name; (25 rows) +-# To perform the balance sampling for independent groups, use the 'grouping_cols' +parameter. Note below that each group (zone) has a different count of the classes (mainhue), +with some groups not containing some class values. + +DROP TABLE IF EXISTS output_table; +SELECT madlib.balance_sample( +'flags', -- Source table +'output_table', -- Output table +'mainhue',-- Class column +NULL, -- Uniform +NULL, -- Output table size +'zone'-- No grouping +); +SELECT * FROM output_table ORDER BY zone, mainhue; + + + __madlib_id__ | id |name | landmass | zone | area | population | language | colours | mainhue +---++-+--+--+--++--+-+- + 6 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 5 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 8 | 17 | Sweden |3 |1 | 450 | 8 | 6 | 2 | blue + 7 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 2 | 7 | Denmark |3 |1 | 43 | 5 | 6 | 2 | red + 1 | 6 | China |5 |1 | 9561 | 1008 | 7 | 2 | red + 4 | 12 | Luxembourg |3 |1 |3 | 0 | 4 | 3 | red + 3 | 18 | Switzerland |3 |1 | 41 | 6 | 4 | 2 | red + 1 | 2 | Australia |6 |2 | 7690 | 15 | 1 | 3 | blue + 1 | 1 | Argentina |2 |3 | 2777 | 28 | 2 | 2 | blue + 2 | 4 | Brazil |2 |3 | 8512 |119 | 6 | 4 | green + 6 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue + 5 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue + 4 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue +12 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green +10 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green +11 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green + 1 | 19 | UK |3 |4 | 245 | 56 | 1 | 3 | red + 3 | 5 | Canada |1 |4 | 9976 | 24 | 1 | 2 | red + 2 | 15 | Portugal|3 |4 | 92 | 10 | 6 | 5 | red + 8 | 20 | USA |1 |4 | 9363 |231 | 1 | 3 | white + 7 | 20 | USA |1 |4 | 9363 |231 | 1 | 3 | white + 9 | 10 | Ireland |3 |4 | 70 | 3 | 1 | 3 | white +(23 rows) + + +-# Grouping can be used with class size specification as well. Note below that +'blue=' is the only valid class value since 'blue' is the only class +value that is present in each group. + +DROP TABLE IF EXISTS output_table; +SELECT madlib.balance_sample( +'flags', -- Source table +'output_table', -- Output table +'mainhue',-- Class column +'blue=2', -- Specified class value size. Rest of the values are sampled as is. +NULL, -- Output table size +'zone'-- No grouping +); +SELECT * FROM output_table ORDER BY zone, mainhue; + + + __madlib_id__ | id |name | landmass | zone | area | population | language | colours | mainhue +---++-+--+--+--++--+-+- + 1 | 17 | Sweden |3 |1 | 450 | 8 | 6 | 2 | blue + 7 | 18 | Switzerland |3 |1 | 41 | 6 | 4 | 2 | red + 6 | 14 | Norway |3 |1 | 324 | 4 | 6 | 3 | red + 5 | 12 | Luxembourg |3 |1 |3 | 0 | 4 | 3 | red + 4 | 7 |
[GitHub] madlib pull request #239: Balance Sample: Add support for grouping
Github user iyerr3 commented on a diff in the pull request: https://github.com/apache/madlib/pull/239#discussion_r172943311 --- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in --- @@ -543,6 +544,90 @@ SELECT * FROM output_table ORDER BY mainhue, name; (25 rows) +-# To perform the balance sampling for independent groups, use the 'grouping_cols' +parameter. Note below that each group (zone) has a different count of the classes (mainhue), +with some groups not containing some class values. + +DROP TABLE IF EXISTS output_table; +SELECT madlib.balance_sample( +'flags', -- Source table +'output_table', -- Output table +'mainhue',-- Class column +NULL, -- Uniform +NULL, -- Output table size +'zone'-- No grouping +); +SELECT * FROM output_table ORDER BY zone, mainhue; + + + __madlib_id__ | id |name | landmass | zone | area | population | language | colours | mainhue +---++-+--+--+--++--+-+- + 6 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 5 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 8 | 17 | Sweden |3 |1 | 450 | 8 | 6 | 2 | blue + 7 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 2 | 7 | Denmark |3 |1 | 43 | 5 | 6 | 2 | red + 1 | 6 | China |5 |1 | 9561 | 1008 | 7 | 2 | red + 4 | 12 | Luxembourg |3 |1 |3 | 0 | 4 | 3 | red + 3 | 18 | Switzerland |3 |1 | 41 | 6 | 4 | 2 | red + 1 | 2 | Australia |6 |2 | 7690 | 15 | 1 | 3 | blue + 1 | 1 | Argentina |2 |3 | 2777 | 28 | 2 | 2 | blue + 2 | 4 | Brazil |2 |3 | 8512 |119 | 6 | 4 | green + 6 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue + 5 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue + 4 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue +12 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green +10 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green +11 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green + 1 | 19 | UK |3 |4 | 245 | 56 | 1 | 3 | red + 3 | 5 | Canada |1 |4 | 9976 | 24 | 1 | 2 | red + 2 | 15 | Portugal|3 |4 | 92 | 10 | 6 | 5 | red + 8 | 20 | USA |1 |4 | 9363 |231 | 1 | 3 | white + 7 | 20 | USA |1 |4 | 9363 |231 | 1 | 3 | white + 9 | 10 | Ireland |3 |4 | 70 | 3 | 1 | 3 | white +(23 rows) + + +-# Grouping can be used with class size specification as well. Note below that +'blue=' is the only valid class value since 'blue' is the only class +value that is present in each group. + +DROP TABLE IF EXISTS output_table; +SELECT madlib.balance_sample( +'flags', -- Source table +'output_table', -- Output table +'mainhue',-- Class column +'blue=2', -- Specified class value size. Rest of the values are sampled as is. +NULL, -- Output table size +'zone'-- No grouping +); +SELECT * FROM output_table ORDER BY zone, mainhue; + + + __madlib_id__ | id |name | landmass | zone | area | population | language | colours | mainhue +---++-+--+--+--++--+-+- + 1 | 17 | Sweden |3 |1 | 450 | 8 | 6 | 2 | blue + 7 | 18 | Switzerland |3 |1 | 41 | 6 | 4 | 2 | red + 6 | 14 | Norway |3 |1 | 324 | 4 | 6 | 3 | red + 5 | 12 | Luxembourg |3 |1 |3 | 0 | 4 | 3 | red + 4 | 7 | Denmar
[GitHub] madlib pull request #239: Balance Sample: Add support for grouping
Github user iyerr3 commented on a diff in the pull request: https://github.com/apache/madlib/pull/239#discussion_r172941914 --- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in --- @@ -543,6 +544,90 @@ SELECT * FROM output_table ORDER BY mainhue, name; (25 rows) +-# To perform the balance sampling for independent groups, use the 'grouping_cols' +parameter. Note below that each group (zone) has a different count of the classes (mainhue), +with some groups not containing some class values. + +DROP TABLE IF EXISTS output_table; +SELECT madlib.balance_sample( +'flags', -- Source table +'output_table', -- Output table +'mainhue',-- Class column +NULL, -- Uniform +NULL, -- Output table size +'zone'-- No grouping +); +SELECT * FROM output_table ORDER BY zone, mainhue; + + + __madlib_id__ | id |name | landmass | zone | area | population | language | colours | mainhue +---++-+--+--+--++--+-+- + 6 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 5 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 8 | 17 | Sweden |3 |1 | 450 | 8 | 6 | 2 | blue + 7 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 2 | 7 | Denmark |3 |1 | 43 | 5 | 6 | 2 | red + 1 | 6 | China |5 |1 | 9561 | 1008 | 7 | 2 | red + 4 | 12 | Luxembourg |3 |1 |3 | 0 | 4 | 3 | red + 3 | 18 | Switzerland |3 |1 | 41 | 6 | 4 | 2 | red + 1 | 2 | Australia |6 |2 | 7690 | 15 | 1 | 3 | blue + 1 | 1 | Argentina |2 |3 | 2777 | 28 | 2 | 2 | blue + 2 | 4 | Brazil |2 |3 | 8512 |119 | 6 | 4 | green + 6 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue + 5 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue + 4 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue +12 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green +10 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green +11 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green + 1 | 19 | UK |3 |4 | 245 | 56 | 1 | 3 | red + 3 | 5 | Canada |1 |4 | 9976 | 24 | 1 | 2 | red + 2 | 15 | Portugal|3 |4 | 92 | 10 | 6 | 5 | red + 8 | 20 | USA |1 |4 | 9363 |231 | 1 | 3 | white + 7 | 20 | USA |1 |4 | 9363 |231 | 1 | 3 | white + 9 | 10 | Ireland |3 |4 | 70 | 3 | 1 | 3 | white +(23 rows) --- End diff -- Right. We're always using ceil instead of round and hence will provide more than the expected rows in cases where division is not exact. ---
[GitHub] madlib pull request #239: Balance Sample: Add support for grouping
Github user fmcquillan99 commented on a diff in the pull request: https://github.com/apache/madlib/pull/239#discussion_r172922334 --- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in --- @@ -543,6 +544,90 @@ SELECT * FROM output_table ORDER BY mainhue, name; (25 rows) +-# To perform the balance sampling for independent groups, use the 'grouping_cols' +parameter. Note below that each group (zone) has a different count of the classes (mainhue), +with some groups not containing some class values. + +DROP TABLE IF EXISTS output_table; +SELECT madlib.balance_sample( +'flags', -- Source table +'output_table', -- Output table +'mainhue',-- Class column +NULL, -- Uniform +NULL, -- Output table size +'zone'-- No grouping +); +SELECT * FROM output_table ORDER BY zone, mainhue; + + + __madlib_id__ | id |name | landmass | zone | area | population | language | colours | mainhue +---++-+--+--+--++--+-+- + 6 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 5 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 8 | 17 | Sweden |3 |1 | 450 | 8 | 6 | 2 | blue + 7 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 2 | 7 | Denmark |3 |1 | 43 | 5 | 6 | 2 | red + 1 | 6 | China |5 |1 | 9561 | 1008 | 7 | 2 | red + 4 | 12 | Luxembourg |3 |1 |3 | 0 | 4 | 3 | red + 3 | 18 | Switzerland |3 |1 | 41 | 6 | 4 | 2 | red + 1 | 2 | Australia |6 |2 | 7690 | 15 | 1 | 3 | blue + 1 | 1 | Argentina |2 |3 | 2777 | 28 | 2 | 2 | blue + 2 | 4 | Brazil |2 |3 | 8512 |119 | 6 | 4 | green + 6 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue + 5 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue + 4 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue +12 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green +10 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green +11 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green + 1 | 19 | UK |3 |4 | 245 | 56 | 1 | 3 | red + 3 | 5 | Canada |1 |4 | 9976 | 24 | 1 | 2 | red + 2 | 15 | Portugal|3 |4 | 92 | 10 | 6 | 5 | red + 8 | 20 | USA |1 |4 | 9363 |231 | 1 | 3 | white + 7 | 20 | USA |1 |4 | 9363 |231 | 1 | 3 | white + 9 | 10 | Ireland |3 |4 | 70 | 3 | 1 | 3 | white +(23 rows) + + +-# Grouping can be used with class size specification as well. Note below that +'blue=' is the only valid class value since 'blue' is the only class +value that is present in each group. + +DROP TABLE IF EXISTS output_table; +SELECT madlib.balance_sample( +'flags', -- Source table +'output_table', -- Output table +'mainhue',-- Class column +'blue=2', -- Specified class value size. Rest of the values are sampled as is. +NULL, -- Output table size +'zone'-- No grouping +); +SELECT * FROM output_table ORDER BY zone, mainhue; + + + __madlib_id__ | id |name | landmass | zone | area | population | language | colours | mainhue +---++-+--+--+--++--+-+- + 1 | 17 | Sweden |3 |1 | 450 | 8 | 6 | 2 | blue + 7 | 18 | Switzerland |3 |1 | 41 | 6 | 4 | 2 | red + 6 | 14 | Norway |3 |1 | 324 | 4 | 6 | 3 | red + 5 | 12 | Luxembourg |3 |1 |3 | 0 | 4 | 3 | red + 4 | 7 |
[GitHub] madlib pull request #239: Balance Sample: Add support for grouping
Github user fmcquillan99 commented on a diff in the pull request: https://github.com/apache/madlib/pull/239#discussion_r172921714 --- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in --- @@ -543,6 +544,90 @@ SELECT * FROM output_table ORDER BY mainhue, name; (25 rows) +-# To perform the balance sampling for independent groups, use the 'grouping_cols' +parameter. Note below that each group (zone) has a different count of the classes (mainhue), +with some groups not containing some class values. + +DROP TABLE IF EXISTS output_table; +SELECT madlib.balance_sample( +'flags', -- Source table +'output_table', -- Output table +'mainhue',-- Class column +NULL, -- Uniform +NULL, -- Output table size +'zone'-- No grouping +); +SELECT * FROM output_table ORDER BY zone, mainhue; + + + __madlib_id__ | id |name | landmass | zone | area | population | language | colours | mainhue +---++-+--+--+--++--+-+- + 6 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 5 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 8 | 17 | Sweden |3 |1 | 450 | 8 | 6 | 2 | blue + 7 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 2 | 7 | Denmark |3 |1 | 43 | 5 | 6 | 2 | red + 1 | 6 | China |5 |1 | 9561 | 1008 | 7 | 2 | red + 4 | 12 | Luxembourg |3 |1 |3 | 0 | 4 | 3 | red + 3 | 18 | Switzerland |3 |1 | 41 | 6 | 4 | 2 | red + 1 | 2 | Australia |6 |2 | 7690 | 15 | 1 | 3 | blue + 1 | 1 | Argentina |2 |3 | 2777 | 28 | 2 | 2 | blue + 2 | 4 | Brazil |2 |3 | 8512 |119 | 6 | 4 | green + 6 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue + 5 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue + 4 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue +12 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green +10 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green +11 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green + 1 | 19 | UK |3 |4 | 245 | 56 | 1 | 3 | red + 3 | 5 | Canada |1 |4 | 9976 | 24 | 1 | 2 | red + 2 | 15 | Portugal|3 |4 | 92 | 10 | 6 | 5 | red + 8 | 20 | USA |1 |4 | 9363 |231 | 1 | 3 | white + 7 | 20 | USA |1 |4 | 9363 |231 | 1 | 3 | white + 9 | 10 | Ireland |3 |4 | 70 | 3 | 1 | 3 | white +(23 rows) + + +-# Grouping can be used with class size specification as well. Note below that +'blue=' is the only valid class value since 'blue' is the only class +value that is present in each group. + +DROP TABLE IF EXISTS output_table; +SELECT madlib.balance_sample( +'flags', -- Source table +'output_table', -- Output table +'mainhue',-- Class column +'blue=2', -- Specified class value size. Rest of the values are sampled as is. --- End diff -- Do you mean âRest of the values are left as is.â since they are not actually sampled. ? ---
[GitHub] madlib pull request #239: Balance Sample: Add support for grouping
Github user fmcquillan99 commented on a diff in the pull request: https://github.com/apache/madlib/pull/239#discussion_r172921328 --- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in --- @@ -543,6 +544,90 @@ SELECT * FROM output_table ORDER BY mainhue, name; (25 rows) +-# To perform the balance sampling for independent groups, use the 'grouping_cols' +parameter. Note below that each group (zone) has a different count of the classes (mainhue), +with some groups not containing some class values. + +DROP TABLE IF EXISTS output_table; +SELECT madlib.balance_sample( +'flags', -- Source table +'output_table', -- Output table +'mainhue',-- Class column +NULL, -- Uniform +NULL, -- Output table size +'zone'-- No grouping +); +SELECT * FROM output_table ORDER BY zone, mainhue; + + + __madlib_id__ | id |name | landmass | zone | area | population | language | colours | mainhue +---++-+--+--+--++--+-+- + 6 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 5 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 8 | 17 | Sweden |3 |1 | 450 | 8 | 6 | 2 | blue + 7 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 2 | 7 | Denmark |3 |1 | 43 | 5 | 6 | 2 | red + 1 | 6 | China |5 |1 | 9561 | 1008 | 7 | 2 | red + 4 | 12 | Luxembourg |3 |1 |3 | 0 | 4 | 3 | red + 3 | 18 | Switzerland |3 |1 | 41 | 6 | 4 | 2 | red + 1 | 2 | Australia |6 |2 | 7690 | 15 | 1 | 3 | blue + 1 | 1 | Argentina |2 |3 | 2777 | 28 | 2 | 2 | blue + 2 | 4 | Brazil |2 |3 | 8512 |119 | 6 | 4 | green + 6 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue + 5 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue + 4 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue +12 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green +10 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green +11 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green + 1 | 19 | UK |3 |4 | 245 | 56 | 1 | 3 | red + 3 | 5 | Canada |1 |4 | 9976 | 24 | 1 | 2 | red + 2 | 15 | Portugal|3 |4 | 92 | 10 | 6 | 5 | red + 8 | 20 | USA |1 |4 | 9363 |231 | 1 | 3 | white + 7 | 20 | USA |1 |4 | 9363 |231 | 1 | 3 | white + 9 | 10 | Ireland |3 |4 | 70 | 3 | 1 | 3 | white +(23 rows) --- End diff -- For zone 4 there are 12 output rows sampled from 9 input rows. I am assuming this is just rounding due to small numbers in the example? i.e., could have been 8 rows or 12 rows in the output table for uniform sampling for 4 groups (blue, green, red, white) but we went with 12? Which is fine, just confirming. ---
[GitHub] madlib pull request #239: Balance Sample: Add support for grouping
Github user fmcquillan99 commented on a diff in the pull request: https://github.com/apache/madlib/pull/239#discussion_r172920935 --- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in --- @@ -543,6 +544,90 @@ SELECT * FROM output_table ORDER BY mainhue, name; (25 rows) +-# To perform the balance sampling for independent groups, use the 'grouping_cols' +parameter. Note below that each group (zone) has a different count of the classes (mainhue), +with some groups not containing some class values. + +DROP TABLE IF EXISTS output_table; +SELECT madlib.balance_sample( +'flags', -- Source table +'output_table', -- Output table +'mainhue',-- Class column +NULL, -- Uniform +NULL, -- Output table size +'zone'-- No grouping +); +SELECT * FROM output_table ORDER BY zone, mainhue; + + + __madlib_id__ | id |name | landmass | zone | area | population | language | colours | mainhue +---++-+--+--+--++--+-+- + 6 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 5 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 8 | 17 | Sweden |3 |1 | 450 | 8 | 6 | 2 | blue + 7 | 8 | Greece |3 |1 | 132 | 10 | 6 | 2 | blue + 2 | 7 | Denmark |3 |1 | 43 | 5 | 6 | 2 | red + 1 | 6 | China |5 |1 | 9561 | 1008 | 7 | 2 | red + 4 | 12 | Luxembourg |3 |1 |3 | 0 | 4 | 3 | red + 3 | 18 | Switzerland |3 |1 | 41 | 6 | 4 | 2 | red + 1 | 2 | Australia |6 |2 | 7690 | 15 | 1 | 3 | blue + 1 | 1 | Argentina |2 |3 | 2777 | 28 | 2 | 2 | blue + 2 | 4 | Brazil |2 |3 | 8512 |119 | 6 | 4 | green + 6 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue + 5 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue + 4 | 9 | Guatemala |1 |4 | 109 | 8 | 2 | 2 | blue +12 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green +10 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green +11 | 13 | Mexico |1 |4 | 1973 | 77 | 2 | 4 | green + 1 | 19 | UK |3 |4 | 245 | 56 | 1 | 3 | red + 3 | 5 | Canada |1 |4 | 9976 | 24 | 1 | 2 | red + 2 | 15 | Portugal|3 |4 | 92 | 10 | 6 | 5 | red + 8 | 20 | USA |1 |4 | 9363 |231 | 1 | 3 | white + 7 | 20 | USA |1 |4 | 9363 |231 | 1 | 3 | white + 9 | 10 | Ireland |3 |4 | 70 | 3 | 1 | 3 | white +(23 rows) + + +-# Grouping can be used with class size specification as well. Note below that +'blue=' is the only valid class value since 'blue' is the only class +value that is present in each group. + +DROP TABLE IF EXISTS output_table; +SELECT madlib.balance_sample( +'flags', -- Source table +'output_table', -- Output table +'mainhue',-- Class column +'blue=2', -- Specified class value size. Rest of the values are sampled as is. +NULL, -- Output table size +'zone'-- No grouping +); --- End diff -- I think you mean "Group by zone" or something like that, not "No grouping" . ---
[GitHub] madlib pull request #239: Balance Sample: Add support for grouping
Github user fmcquillan99 commented on a diff in the pull request: https://github.com/apache/madlib/pull/239#discussion_r172920825 --- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in --- @@ -543,6 +544,90 @@ SELECT * FROM output_table ORDER BY mainhue, name; (25 rows) +-# To perform the balance sampling for independent groups, use the 'grouping_cols' +parameter. Note below that each group (zone) has a different count of the classes (mainhue), +with some groups not containing some class values. + +DROP TABLE IF EXISTS output_table; +SELECT madlib.balance_sample( +'flags', -- Source table +'output_table', -- Output table +'mainhue',-- Class column +NULL, -- Uniform +NULL, -- Output table size +'zone'-- No grouping +); --- End diff -- I think you mean "Group by zone" or something like that, not "No grouping" . ---
[GitHub] madlib pull request #239: Balance Sample: Add support for grouping
Github user fmcquillan99 commented on a diff in the pull request: https://github.com/apache/madlib/pull/239#discussion_r172920581 --- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in --- @@ -149,8 +149,10 @@ non-stratified, that is, the whole table is treated as a single group. @note -Current implementation does not support grouping_cols. It -will be added in an upcoming release. +The 'output_table_size' and the 'class_sizes' are defined for the whole table. +When grouping is used, these parameters are split evenly for each group. +Further, if a specific class value is specified in the 'class_sizes' parameter, +that particular class value should be present in each group. --- End diff -- Please change/add: "...that particular class value should be present in each group." to "...that particular class value should be present in each group. If not, an error will be thrown." ---
[GitHub] madlib pull request #239: Balance Sample: Add support for grouping
GitHub user iyerr3 opened a pull request: https://github.com/apache/madlib/pull/239 Balance Sample: Add support for grouping JIRA: MADLIB-1168 This commit adds grouping support for balanced sampling. Grouping is implemented as a loop over the existing logic, with the sampling for each group run independently. Closes #239 You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib feature/balanced-datasets-grouping Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/239.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #239 commit 6c5fcfb375eaf7dc68e1ede4aca2a47b8e55309b Author: Rahul Iyer Date: 2018-02-24T02:45:32Z Clean code + conform to PEP8 commit a5a0c1e2c851a923b9eb550d42dfc594b4635c64 Author: Rahul Iyer Date: 2018-02-26T23:01:34Z Add a Collate plpy results function commit 8e8eca2960207ca0317ded68608c660b8d4ddb55 Author: Rahul Iyer Date: 2018-03-02T00:44:54Z Add grouping in get_level_frequency_distribution commit cad4a5be732f89504ff62f4d9e68367d174fc322 Author: Rahul Iyer Date: 2018-03-07T07:07:00Z Ensure subqueries are filtering groups and using right count commit 39dd6f436bb9b8d505be5204226dcc3053b1b4df Author: Rahul Iyer Date: 2018-03-07T07:07:14Z Update install check to include grouping commit d61ff28290dad27ead0f1c68d740a8ccb79f4aec Author: Rahul Iyer Date: 2018-03-07T07:07:27Z Update documentation with grouping examples ---