[GitHub] madlib pull request #239: Balance Sample: Add support for grouping

2018-03-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/madlib/pull/239


---


[GitHub] madlib pull request #239: Balance Sample: Add support for grouping

2018-03-08 Thread fmcquillan99
Github user fmcquillan99 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/239#discussion_r173254469
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in ---
@@ -543,6 +545,95 @@ SELECT * FROM output_table ORDER BY mainhue, name;
 (25 rows)
 
 
+-# To perform the balance sampling for independent groups, use the 
'grouping_cols'
+parameter. Note below that each group (zone) has a different count of the
+classes (mainhue), with some groups not containing some class values.
+
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+'flags',  -- Source table
+'output_table',   -- Output table
+'mainhue',-- Class column
+NULL, -- Uniform
+NULL, -- Output table size
+'zone'-- Grouping by zone
+);
+SELECT * FROM output_table ORDER BY zone, mainhue;
+
+
+ __madlib_id__ | id |name | landmass | zone | area | population | 
language | colours | mainhue

+---++-+--+--+--++--+-+-
+ 6 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 5 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 8 | 17 | Sweden  |3 |1 |  450 |  8 |  
  6 |   2 | blue
+ 7 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 2 |  7 | Denmark |3 |1 |   43 |  5 |  
  6 |   2 | red
+ 1 |  6 | China   |5 |1 | 9561 |   1008 |  
  7 |   2 | red
+ 4 | 12 | Luxembourg  |3 |1 |3 |  0 |  
  4 |   3 | red
+ 3 | 18 | Switzerland |3 |1 |   41 |  6 |  
  4 |   2 | red
+ 1 |  2 | Australia   |6 |2 | 7690 | 15 |  
  1 |   3 | blue
+ 1 |  1 | Argentina   |2 |3 | 2777 | 28 |  
  2 |   2 | blue
+ 2 |  4 | Brazil  |2 |3 | 8512 |119 |  
  6 |   4 | green
+ 6 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+ 5 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+ 4 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+12 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+10 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+11 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+ 1 | 19 | UK  |3 |4 |  245 | 56 |  
  1 |   3 | red
+ 3 |  5 | Canada  |1 |4 | 9976 | 24 |  
  1 |   2 | red
+ 2 | 15 | Portugal|3 |4 |   92 | 10 |  
  6 |   5 | red
+ 8 | 20 | USA |1 |4 | 9363 |231 |  
  1 |   3 | white
+ 7 | 20 | USA |1 |4 | 9363 |231 |  
  1 |   3 | white
+ 9 | 10 | Ireland |3 |4 |   70 |  3 |  
  1 |   3 | white
+(23 rows)
+
+
+-# Grouping can be used with class size specification as well. Note below 
that
+'blue=' is the only valid class value since 'blue' is the only 
class
+value that is present in each group. Further, `blue=8` in the example 
below will
+be split between the four groups, resulting in two blue rows for each 
group.
+
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+'flags',  -- Source table
+'output_table',   -- Output table
+'mainhue',-- Class column
+'blue=8', -- Specified class value size. Rest of the values 
are outputed as is.
+NULL, -- Output table size
+'zone'-- Group by zone
+);
+SELECT * FROM output_table ORDER BY zone, mainhue;
+
+
+ __madlib_id__ | id |name | landmass | zone | area | population | 
language | colours | mainhue

+---++-+--+--+--++--+-+-
+ 2 | 17 | Sweden  |3 |1 |  450 |  8 |  
  6 |   2 | blue
+ 1 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 3 |  3 | Austria |3 |1 |   84 |  8 |  
  4 |   2 | 

[GitHub] madlib pull request #239: Balance Sample: Add support for grouping

2018-03-08 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/239#discussion_r173254181
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in ---
@@ -543,6 +545,95 @@ SELECT * FROM output_table ORDER BY mainhue, name;
 (25 rows)
 
 
+-# To perform the balance sampling for independent groups, use the 
'grouping_cols'
+parameter. Note below that each group (zone) has a different count of the
+classes (mainhue), with some groups not containing some class values.
+
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+'flags',  -- Source table
+'output_table',   -- Output table
+'mainhue',-- Class column
+NULL, -- Uniform
+NULL, -- Output table size
+'zone'-- Grouping by zone
+);
+SELECT * FROM output_table ORDER BY zone, mainhue;
+
+
+ __madlib_id__ | id |name | landmass | zone | area | population | 
language | colours | mainhue

+---++-+--+--+--++--+-+-
+ 6 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 5 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 8 | 17 | Sweden  |3 |1 |  450 |  8 |  
  6 |   2 | blue
+ 7 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 2 |  7 | Denmark |3 |1 |   43 |  5 |  
  6 |   2 | red
+ 1 |  6 | China   |5 |1 | 9561 |   1008 |  
  7 |   2 | red
+ 4 | 12 | Luxembourg  |3 |1 |3 |  0 |  
  4 |   3 | red
+ 3 | 18 | Switzerland |3 |1 |   41 |  6 |  
  4 |   2 | red
+ 1 |  2 | Australia   |6 |2 | 7690 | 15 |  
  1 |   3 | blue
+ 1 |  1 | Argentina   |2 |3 | 2777 | 28 |  
  2 |   2 | blue
+ 2 |  4 | Brazil  |2 |3 | 8512 |119 |  
  6 |   4 | green
+ 6 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+ 5 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+ 4 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+12 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+10 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+11 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+ 1 | 19 | UK  |3 |4 |  245 | 56 |  
  1 |   3 | red
+ 3 |  5 | Canada  |1 |4 | 9976 | 24 |  
  1 |   2 | red
+ 2 | 15 | Portugal|3 |4 |   92 | 10 |  
  6 |   5 | red
+ 8 | 20 | USA |1 |4 | 9363 |231 |  
  1 |   3 | white
+ 7 | 20 | USA |1 |4 | 9363 |231 |  
  1 |   3 | white
+ 9 | 10 | Ireland |3 |4 |   70 |  3 |  
  1 |   3 | white
+(23 rows)
+
+
+-# Grouping can be used with class size specification as well. Note below 
that
+'blue=' is the only valid class value since 'blue' is the only 
class
+value that is present in each group. Further, `blue=8` in the example 
below will
+be split between the four groups, resulting in two blue rows for each 
group.
+
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+'flags',  -- Source table
+'output_table',   -- Output table
+'mainhue',-- Class column
+'blue=8', -- Specified class value size. Rest of the values 
are outputed as is.
+NULL, -- Output table size
+'zone'-- Group by zone
+);
+SELECT * FROM output_table ORDER BY zone, mainhue;
+
+
+ __madlib_id__ | id |name | landmass | zone | area | population | 
language | colours | mainhue

+---++-+--+--+--++--+-+-
+ 2 | 17 | Sweden  |3 |1 |  450 |  8 |  
  6 |   2 | blue
+ 1 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 3 |  3 | Austria |3 |1 |   84 |  8 |  
  4 |   2 | red

[GitHub] madlib pull request #239: Balance Sample: Add support for grouping

2018-03-08 Thread fmcquillan99
Github user fmcquillan99 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/239#discussion_r173239594
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in ---
@@ -543,6 +545,95 @@ SELECT * FROM output_table ORDER BY mainhue, name;
 (25 rows)
 
 
+-# To perform the balance sampling for independent groups, use the 
'grouping_cols'
+parameter. Note below that each group (zone) has a different count of the
+classes (mainhue), with some groups not containing some class values.
+
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+'flags',  -- Source table
+'output_table',   -- Output table
+'mainhue',-- Class column
+NULL, -- Uniform
+NULL, -- Output table size
+'zone'-- Grouping by zone
+);
+SELECT * FROM output_table ORDER BY zone, mainhue;
+
+
+ __madlib_id__ | id |name | landmass | zone | area | population | 
language | colours | mainhue

+---++-+--+--+--++--+-+-
+ 6 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 5 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 8 | 17 | Sweden  |3 |1 |  450 |  8 |  
  6 |   2 | blue
+ 7 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 2 |  7 | Denmark |3 |1 |   43 |  5 |  
  6 |   2 | red
+ 1 |  6 | China   |5 |1 | 9561 |   1008 |  
  7 |   2 | red
+ 4 | 12 | Luxembourg  |3 |1 |3 |  0 |  
  4 |   3 | red
+ 3 | 18 | Switzerland |3 |1 |   41 |  6 |  
  4 |   2 | red
+ 1 |  2 | Australia   |6 |2 | 7690 | 15 |  
  1 |   3 | blue
+ 1 |  1 | Argentina   |2 |3 | 2777 | 28 |  
  2 |   2 | blue
+ 2 |  4 | Brazil  |2 |3 | 8512 |119 |  
  6 |   4 | green
+ 6 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+ 5 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+ 4 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+12 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+10 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+11 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+ 1 | 19 | UK  |3 |4 |  245 | 56 |  
  1 |   3 | red
+ 3 |  5 | Canada  |1 |4 | 9976 | 24 |  
  1 |   2 | red
+ 2 | 15 | Portugal|3 |4 |   92 | 10 |  
  6 |   5 | red
+ 8 | 20 | USA |1 |4 | 9363 |231 |  
  1 |   3 | white
+ 7 | 20 | USA |1 |4 | 9363 |231 |  
  1 |   3 | white
+ 9 | 10 | Ireland |3 |4 |   70 |  3 |  
  1 |   3 | white
+(23 rows)
+
+
+-# Grouping can be used with class size specification as well. Note below 
that
+'blue=' is the only valid class value since 'blue' is the only 
class
+value that is present in each group. Further, `blue=8` in the example 
below will
+be split between the four groups, resulting in two blue rows for each 
group.
+
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+'flags',  -- Source table
+'output_table',   -- Output table
+'mainhue',-- Class column
+'blue=8', -- Specified class value size. Rest of the values 
are outputed as is.
+NULL, -- Output table size
+'zone'-- Group by zone
+);
+SELECT * FROM output_table ORDER BY zone, mainhue;
+
+
+ __madlib_id__ | id |name | landmass | zone | area | population | 
language | colours | mainhue

+---++-+--+--+--++--+-+-
+ 2 | 17 | Sweden  |3 |1 |  450 |  8 |  
  6 |   2 | blue
+ 1 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 3 |  3 | Austria |3 |1 |   84 |  8 |  
  4 |   2 | 

[GitHub] madlib pull request #239: Balance Sample: Add support for grouping

2018-03-08 Thread fmcquillan99
Github user fmcquillan99 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/239#discussion_r173238804
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in ---
@@ -543,6 +545,95 @@ SELECT * FROM output_table ORDER BY mainhue, name;
 (25 rows)
 
 
+-# To perform the balance sampling for independent groups, use the 
'grouping_cols'
+parameter. Note below that each group (zone) has a different count of the
+classes (mainhue), with some groups not containing some class values.
+
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+'flags',  -- Source table
+'output_table',   -- Output table
+'mainhue',-- Class column
+NULL, -- Uniform
+NULL, -- Output table size
+'zone'-- Grouping by zone
+);
+SELECT * FROM output_table ORDER BY zone, mainhue;
+
+
+ __madlib_id__ | id |name | landmass | zone | area | population | 
language | colours | mainhue

+---++-+--+--+--++--+-+-
+ 6 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 5 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 8 | 17 | Sweden  |3 |1 |  450 |  8 |  
  6 |   2 | blue
+ 7 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 2 |  7 | Denmark |3 |1 |   43 |  5 |  
  6 |   2 | red
+ 1 |  6 | China   |5 |1 | 9561 |   1008 |  
  7 |   2 | red
+ 4 | 12 | Luxembourg  |3 |1 |3 |  0 |  
  4 |   3 | red
+ 3 | 18 | Switzerland |3 |1 |   41 |  6 |  
  4 |   2 | red
+ 1 |  2 | Australia   |6 |2 | 7690 | 15 |  
  1 |   3 | blue
+ 1 |  1 | Argentina   |2 |3 | 2777 | 28 |  
  2 |   2 | blue
+ 2 |  4 | Brazil  |2 |3 | 8512 |119 |  
  6 |   4 | green
+ 6 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+ 5 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+ 4 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+12 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+10 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+11 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+ 1 | 19 | UK  |3 |4 |  245 | 56 |  
  1 |   3 | red
+ 3 |  5 | Canada  |1 |4 | 9976 | 24 |  
  1 |   2 | red
+ 2 | 15 | Portugal|3 |4 |   92 | 10 |  
  6 |   5 | red
+ 8 | 20 | USA |1 |4 | 9363 |231 |  
  1 |   3 | white
+ 7 | 20 | USA |1 |4 | 9363 |231 |  
  1 |   3 | white
+ 9 | 10 | Ireland |3 |4 |   70 |  3 |  
  1 |   3 | white
+(23 rows)
+
+
+-# Grouping can be used with class size specification as well. Note below 
that
+'blue=' is the only valid class value since 'blue' is the only 
class
+value that is present in each group. Further, `blue=8` in the example 
below will
+be split between the four groups, resulting in two blue rows for each 
group.
+
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+'flags',  -- Source table
+'output_table',   -- Output table
+'mainhue',-- Class column
+'blue=8', -- Specified class value size. Rest of the values 
are outputed as is.
+NULL, -- Output table size
+'zone'-- Group by zone
+);
+SELECT * FROM output_table ORDER BY zone, mainhue;
--- End diff --

For some reason, doxygen builds the above as
```
DROP TABLE IF EXISTS output_table;
SELECT madlib.balance_sample(
'flags,  -- Source table
'output_table',   -- Output table
'mainhue',-- Class column
'blue=8', -- Specified class value size. Rest of the values are 
outputed as is.
NULL, -- Output table size
'zone'-- Group by zone
);
SELECT * FROM output_table ORDER BY zone, mainhue;
```
whe

[GitHub] madlib pull request #239: Balance Sample: Add support for grouping

2018-03-07 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/239#discussion_r173080726
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -468,81 +544,107 @@ def balance_sample(schema_madlib, source_table, 
output_table, class_col,
 parsed_class_sizes = extract_keyvalue_params(class_sizes,
  
allow_duplicates=False,
  
lower_case_names=False)
+distinct_levels = collate_plpy_result(
+plpy.execute("SELECT DISTINCT ({0})::TEXT as levels FROM {1} ".
+ format(class_col, source_table)))['levels']
 if not parsed_class_sizes:
-sampling_strategy_str = 
_validate_and_get_sampling_strategy(class_sizes,
-output_table_size)
+sampling_strategy_str = _validate_and_get_sampling_strategy(
+class_sizes, output_table_size)
 else:
 sampling_strategy_str = None
 try:
 for each_level, each_class_size in 
parsed_class_sizes.items():
-_assert(each_level in actual_level_counts,
+_assert(each_level in distinct_levels,
 "Sample: Invalid class value specified ({0})".
-   format(each_level))
+format(each_level))
 each_class_size = int(each_class_size)
 _assert(each_class_size >= 1,
 "Sample: Class size has to be greater than 
zero")
 parsed_class_sizes[each_level] = each_class_size
-
-except TypeError:
+except ValueError:
 plpy.error("Sample: Invalid value for class_sizes ({0})".
format(class_sizes))
 
 # Get the number of rows to be sampled for each class level, based 
on
 # the input table, class_sizes, and output_table_size params. This 
also
 # includes info about the resulting sampling strategy, i.e., one of
 # UNDERSAMPLE, OVERSAMPLE, or NOSAMPLE for each level.
-target_class_sizes = 
_get_target_level_counts(sampling_strategy_str,
-  parsed_class_sizes,
-  actual_level_counts,
-  output_table_size)
-
-undersample_level_dict, oversample_level_dict, nosample_level_dict 
= \
-_get_sampling_strategy_specific_dict(target_class_sizes)
-
-# Get subqueries for each sampling strategy, so that they can be 
used
-# together in one big query.
-
-# Subquery that will be used to get rows as is for those class 
levels
-# that need no sampling.
-nosample_subquery = _get_nosample_subquery(
-new_source_table, class_col, nosample_level_dict.keys())
-# Subquery that will be used to sample those class levels that
-# have to be oversampled.
-oversample_subquery = _get_with_replacement_subquery(
-schema_madlib, new_source_table, source_table_columns, 
class_col,
-actual_level_counts, oversample_level_dict)
-# Subquery that will be used to sample those class levels that
-# have to be undersampled. Undersampling supports both with and 
without
-# replacement, so fetch the appropriate subquery.
-if with_replacement:
-undersample_subquery = _get_with_replacement_subquery(
-schema_madlib, new_source_table, source_table_columns, 
class_col,
-actual_level_counts, undersample_level_dict)
-else:
-undersample_subquery = _get_without_replacement_subquery(
-schema_madlib, new_source_table, source_table_columns, 
class_col,
-actual_level_counts, undersample_level_dict)
-
-# Merge the three subqueries using a UNION ALL clause.
-union_all_subquery = ' UNION ALL '.join(
-['({0})'.format(subquery)
- for subquery in [undersample_subquery, oversample_subquery, 
nosample_subquery]
- if subquery])
-
-final_query = """
-CREATE TABLE {output_table} AS
-SELECT row_number() OVER() AS {new_col_name}, *
+grp_col_str, grp_cols = get_grouping_col_str(
+schema_madlib, 'Balance sample', [NEW_ID_COLUMN, class_col],
+source_table, grouping_cols)
+actual_grp_level_cou

[GitHub] madlib pull request #239: Balance Sample: Add support for grouping

2018-03-07 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/239#discussion_r173080410
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -58,28 +60,64 @@ NOSAMPLE = 'nosample'
 NEW_ID_COLUMN = '__madlib_id__'
 NULL_IDENTIFIER = '__madlib_null_id__'
 
-def _get_level_frequency_distribution(source_table, class_col):
-""" Returns a dict containing the number of rows associated with each 
class
+
+def _get_level_frequency_distribution(source_table, class_col,
+  grp_by_cols=None):
+""" Count the number of rows for each class, partitioned by the 
grp_by_cols
+
+Returns a dict containing the number of rows associated with each 
class
 level. Each class level count is converted to a string using 
::text.
 None is a valid key in this dict, capturing NULL value in the 
database.
 """
+if grp_by_cols and grp_by_cols.lower() != 'null':
+is_grouping = True
+grp_by_cols_comma = grp_by_cols + ', '
+array_grp_by_cols_comma = "array[{0}]".format(grp_by_cols) + " as 
group_values, "
+else:
+is_grouping = False
+grp_by_cols_comma = array_grp_by_cols_comma = ""
+
+# In below query, the inner query groups the data using grp_by_cols + 
classes
+# and obtains the count for each combination. The outer query then 
groups
+# again by the grp_by_cols to collect the classes and counts in an 
array.
 query_result = plpy.execute("""
-SELECT {class_col}::text AS classes,
-   count(*) AS class_count
-FROM {source_table}
-GROUP BY {class_col}
- """.format(**locals()))
+SELECT
+-- For each group get the classes and their rows counts
+{grp_identifier} as group_values,
+array_agg(classes) as classes,
+array_agg(class_count) as class_count
+FROM(
+-- for each group and class combination present in source table
+-- get the count of rows for that combination
+SELECT
+{array_grp_by_cols_comma}
+({class_col})::TEXT AS classes,
+count(*) AS class_count
+FROM {source_table}
+GROUP BY {grp_by_cols_comma} ({class_col})
+) q
+GROUP BY {grp_identifier}
--- End diff --

Thanks for checking this! I've removed the constant group by clause to 
avoid this. 


---


[GitHub] madlib pull request #239: Balance Sample: Add support for grouping

2018-03-07 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/239#discussion_r172954235
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -468,81 +544,107 @@ def balance_sample(schema_madlib, source_table, 
output_table, class_col,
 parsed_class_sizes = extract_keyvalue_params(class_sizes,
  
allow_duplicates=False,
  
lower_case_names=False)
+distinct_levels = collate_plpy_result(
+plpy.execute("SELECT DISTINCT ({0})::TEXT as levels FROM {1} ".
+ format(class_col, source_table)))['levels']
 if not parsed_class_sizes:
-sampling_strategy_str = 
_validate_and_get_sampling_strategy(class_sizes,
-output_table_size)
+sampling_strategy_str = _validate_and_get_sampling_strategy(
+class_sizes, output_table_size)
 else:
 sampling_strategy_str = None
 try:
 for each_level, each_class_size in 
parsed_class_sizes.items():
-_assert(each_level in actual_level_counts,
+_assert(each_level in distinct_levels,
 "Sample: Invalid class value specified ({0})".
-   format(each_level))
+format(each_level))
 each_class_size = int(each_class_size)
 _assert(each_class_size >= 1,
 "Sample: Class size has to be greater than 
zero")
 parsed_class_sizes[each_level] = each_class_size
-
-except TypeError:
+except ValueError:
 plpy.error("Sample: Invalid value for class_sizes ({0})".
format(class_sizes))
 
 # Get the number of rows to be sampled for each class level, based 
on
 # the input table, class_sizes, and output_table_size params. This 
also
 # includes info about the resulting sampling strategy, i.e., one of
 # UNDERSAMPLE, OVERSAMPLE, or NOSAMPLE for each level.
-target_class_sizes = 
_get_target_level_counts(sampling_strategy_str,
-  parsed_class_sizes,
-  actual_level_counts,
-  output_table_size)
-
-undersample_level_dict, oversample_level_dict, nosample_level_dict 
= \
-_get_sampling_strategy_specific_dict(target_class_sizes)
-
-# Get subqueries for each sampling strategy, so that they can be 
used
-# together in one big query.
-
-# Subquery that will be used to get rows as is for those class 
levels
-# that need no sampling.
-nosample_subquery = _get_nosample_subquery(
-new_source_table, class_col, nosample_level_dict.keys())
-# Subquery that will be used to sample those class levels that
-# have to be oversampled.
-oversample_subquery = _get_with_replacement_subquery(
-schema_madlib, new_source_table, source_table_columns, 
class_col,
-actual_level_counts, oversample_level_dict)
-# Subquery that will be used to sample those class levels that
-# have to be undersampled. Undersampling supports both with and 
without
-# replacement, so fetch the appropriate subquery.
-if with_replacement:
-undersample_subquery = _get_with_replacement_subquery(
-schema_madlib, new_source_table, source_table_columns, 
class_col,
-actual_level_counts, undersample_level_dict)
-else:
-undersample_subquery = _get_without_replacement_subquery(
-schema_madlib, new_source_table, source_table_columns, 
class_col,
-actual_level_counts, undersample_level_dict)
-
-# Merge the three subqueries using a UNION ALL clause.
-union_all_subquery = ' UNION ALL '.join(
-['({0})'.format(subquery)
- for subquery in [undersample_subquery, oversample_subquery, 
nosample_subquery]
- if subquery])
-
-final_query = """
-CREATE TABLE {output_table} AS
-SELECT row_number() OVER() AS {new_col_name}, *
+grp_col_str, grp_cols = get_grouping_col_str(
+schema_madlib, 'Balance sample', [NEW_ID_COLUMN, class_col],
+source_table, grouping_cols)
+actual_grp_level_

[GitHub] madlib pull request #239: Balance Sample: Add support for grouping

2018-03-07 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/239#discussion_r172953687
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -468,81 +544,107 @@ def balance_sample(schema_madlib, source_table, 
output_table, class_col,
 parsed_class_sizes = extract_keyvalue_params(class_sizes,
  
allow_duplicates=False,
  
lower_case_names=False)
+distinct_levels = collate_plpy_result(
+plpy.execute("SELECT DISTINCT ({0})::TEXT as levels FROM {1} ".
+ format(class_col, source_table)))['levels']
 if not parsed_class_sizes:
-sampling_strategy_str = 
_validate_and_get_sampling_strategy(class_sizes,
-output_table_size)
+sampling_strategy_str = _validate_and_get_sampling_strategy(
+class_sizes, output_table_size)
 else:
 sampling_strategy_str = None
 try:
 for each_level, each_class_size in 
parsed_class_sizes.items():
-_assert(each_level in actual_level_counts,
+_assert(each_level in distinct_levels,
 "Sample: Invalid class value specified ({0})".
-   format(each_level))
+format(each_level))
 each_class_size = int(each_class_size)
 _assert(each_class_size >= 1,
 "Sample: Class size has to be greater than 
zero")
 parsed_class_sizes[each_level] = each_class_size
-
-except TypeError:
+except ValueError:
 plpy.error("Sample: Invalid value for class_sizes ({0})".
format(class_sizes))
 
 # Get the number of rows to be sampled for each class level, based 
on
 # the input table, class_sizes, and output_table_size params. This 
also
 # includes info about the resulting sampling strategy, i.e., one of
 # UNDERSAMPLE, OVERSAMPLE, or NOSAMPLE for each level.
-target_class_sizes = 
_get_target_level_counts(sampling_strategy_str,
-  parsed_class_sizes,
-  actual_level_counts,
-  output_table_size)
-
-undersample_level_dict, oversample_level_dict, nosample_level_dict 
= \
-_get_sampling_strategy_specific_dict(target_class_sizes)
-
-# Get subqueries for each sampling strategy, so that they can be 
used
-# together in one big query.
-
-# Subquery that will be used to get rows as is for those class 
levels
-# that need no sampling.
-nosample_subquery = _get_nosample_subquery(
-new_source_table, class_col, nosample_level_dict.keys())
-# Subquery that will be used to sample those class levels that
-# have to be oversampled.
-oversample_subquery = _get_with_replacement_subquery(
-schema_madlib, new_source_table, source_table_columns, 
class_col,
-actual_level_counts, oversample_level_dict)
-# Subquery that will be used to sample those class levels that
-# have to be undersampled. Undersampling supports both with and 
without
-# replacement, so fetch the appropriate subquery.
-if with_replacement:
-undersample_subquery = _get_with_replacement_subquery(
-schema_madlib, new_source_table, source_table_columns, 
class_col,
-actual_level_counts, undersample_level_dict)
-else:
-undersample_subquery = _get_without_replacement_subquery(
-schema_madlib, new_source_table, source_table_columns, 
class_col,
-actual_level_counts, undersample_level_dict)
-
-# Merge the three subqueries using a UNION ALL clause.
-union_all_subquery = ' UNION ALL '.join(
-['({0})'.format(subquery)
- for subquery in [undersample_subquery, oversample_subquery, 
nosample_subquery]
- if subquery])
-
-final_query = """
-CREATE TABLE {output_table} AS
-SELECT row_number() OVER() AS {new_col_name}, *
+grp_col_str, grp_cols = get_grouping_col_str(
+schema_madlib, 'Balance sample', [NEW_ID_COLUMN, class_col],
+source_table, grouping_cols)
+actual_grp_level_

[GitHub] madlib pull request #239: Balance Sample: Add support for grouping

2018-03-07 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/239#discussion_r172958587
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -58,28 +60,64 @@ NOSAMPLE = 'nosample'
 NEW_ID_COLUMN = '__madlib_id__'
 NULL_IDENTIFIER = '__madlib_null_id__'
 
-def _get_level_frequency_distribution(source_table, class_col):
-""" Returns a dict containing the number of rows associated with each 
class
+
+def _get_level_frequency_distribution(source_table, class_col,
+  grp_by_cols=None):
+""" Count the number of rows for each class, partitioned by the 
grp_by_cols
+
+Returns a dict containing the number of rows associated with each 
class
 level. Each class level count is converted to a string using 
::text.
 None is a valid key in this dict, capturing NULL value in the 
database.
 """
+if grp_by_cols and grp_by_cols.lower() != 'null':
+is_grouping = True
+grp_by_cols_comma = grp_by_cols + ', '
+array_grp_by_cols_comma = "array[{0}]".format(grp_by_cols) + " as 
group_values, "
+else:
+is_grouping = False
+grp_by_cols_comma = array_grp_by_cols_comma = ""
+
+# In below query, the inner query groups the data using grp_by_cols + 
classes
+# and obtains the count for each combination. The outer query then 
groups
+# again by the grp_by_cols to collect the classes and counts in an 
array.
 query_result = plpy.execute("""
-SELECT {class_col}::text AS classes,
-   count(*) AS class_count
-FROM {source_table}
-GROUP BY {class_col}
- """.format(**locals()))
+SELECT
+-- For each group get the classes and their rows counts
+{grp_identifier} as group_values,
+array_agg(classes) as classes,
+array_agg(class_count) as class_count
+FROM(
+-- for each group and class combination present in source table
+-- get the count of rows for that combination
+SELECT
+{array_grp_by_cols_comma}
+({class_col})::TEXT AS classes,
+count(*) AS class_count
+FROM {source_table}
+GROUP BY {grp_by_cols_comma} ({class_col})
+) q
+GROUP BY {grp_identifier}
--- End diff --

In Greenplum 5.x, the above query fails on install check, with the 
following error:
```
SELECT balance_sample('"TEST_s"', 'out_sr2', 'gr1', 'undersample ', NULL, 
NULL, TRUE, TRUE);
psql:/tmp/madlib.2N5sjK/sample/test/balance_sample.sql_in.tmp:111: ERROR:  
plpy.SPIError: non-integer constant in GROUP BY
LINE 17: GROUP BY true
  ^
QUERY:
SELECT
-- For each group get the classes and their rows counts
true as group_values,
array_agg(classes) as classes,
array_agg(class_count) as class_count
FROM(
-- for each group and class combination present in source table
-- get the count of rows for that combination
SELECT

(gr1)::TEXT AS classes,
count(*) AS class_count
FROM "TEST_s"
GROUP BY  (gr1)
) q
GROUP BY true

CONTEXT:  Traceback (most recent call last):
  PL/Python function "balance_sample", line 23, in 
return balance_sample.balance_sample(**globals())
  PL/Python function "balance_sample", line 575, in balance_sample
  PL/Python function "balance_sample", line 100, in 
_get_level_frequency_distribution
PL/Python function "balance_sample"
```


---


[GitHub] madlib pull request #239: Balance Sample: Add support for grouping

2018-03-07 Thread fmcquillan99
Github user fmcquillan99 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/239#discussion_r172949439
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in ---
@@ -543,6 +544,90 @@ SELECT * FROM output_table ORDER BY mainhue, name;
 (25 rows)
 
 
+-# To perform the balance sampling for independent groups, use the 
'grouping_cols'
+parameter. Note below that each group (zone) has a different count of the 
classes (mainhue),
+with some groups not containing some class values.
+
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+'flags',  -- Source table
+'output_table',   -- Output table
+'mainhue',-- Class column
+NULL, -- Uniform
+NULL, -- Output table size
+'zone'-- No grouping
+);
+SELECT * FROM output_table ORDER BY zone, mainhue;
+
+
+ __madlib_id__ | id |name | landmass | zone | area | population | 
language | colours | mainhue

+---++-+--+--+--++--+-+-
+ 6 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 5 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 8 | 17 | Sweden  |3 |1 |  450 |  8 |  
  6 |   2 | blue
+ 7 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 2 |  7 | Denmark |3 |1 |   43 |  5 |  
  6 |   2 | red
+ 1 |  6 | China   |5 |1 | 9561 |   1008 |  
  7 |   2 | red
+ 4 | 12 | Luxembourg  |3 |1 |3 |  0 |  
  4 |   3 | red
+ 3 | 18 | Switzerland |3 |1 |   41 |  6 |  
  4 |   2 | red
+ 1 |  2 | Australia   |6 |2 | 7690 | 15 |  
  1 |   3 | blue
+ 1 |  1 | Argentina   |2 |3 | 2777 | 28 |  
  2 |   2 | blue
+ 2 |  4 | Brazil  |2 |3 | 8512 |119 |  
  6 |   4 | green
+ 6 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+ 5 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+ 4 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+12 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+10 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+11 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+ 1 | 19 | UK  |3 |4 |  245 | 56 |  
  1 |   3 | red
+ 3 |  5 | Canada  |1 |4 | 9976 | 24 |  
  1 |   2 | red
+ 2 | 15 | Portugal|3 |4 |   92 | 10 |  
  6 |   5 | red
+ 8 | 20 | USA |1 |4 | 9363 |231 |  
  1 |   3 | white
+ 7 | 20 | USA |1 |4 | 9363 |231 |  
  1 |   3 | white
+ 9 | 10 | Ireland |3 |4 |   70 |  3 |  
  1 |   3 | white
+(23 rows)
+
+
+-# Grouping can be used with class size specification as well. Note below 
that
+'blue=' is the only valid class value since 'blue' is the only 
class
+value that is present in each group.
+
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+'flags',  -- Source table
+'output_table',   -- Output table
+'mainhue',-- Class column
+'blue=2', -- Specified class value size. Rest of the values 
are sampled as is.
+NULL, -- Output table size
+'zone'-- No grouping
+);
+SELECT * FROM output_table ORDER BY zone, mainhue;
+
+
+ __madlib_id__ | id |name | landmass | zone | area | population | 
language | colours | mainhue

+---++-+--+--+--++--+-+-
+ 1 | 17 | Sweden  |3 |1 |  450 |  8 |  
  6 |   2 | blue
+ 7 | 18 | Switzerland |3 |1 |   41 |  6 |  
  4 |   2 | red
+ 6 | 14 | Norway  |3 |1 |  324 |  4 |  
  6 |   3 | red
+ 5 | 12 | Luxembourg  |3 |1 |3 |  0 |  
  4 |   3 | red
+ 4 |  7 | 

[GitHub] madlib pull request #239: Balance Sample: Add support for grouping

2018-03-07 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/239#discussion_r172943311
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in ---
@@ -543,6 +544,90 @@ SELECT * FROM output_table ORDER BY mainhue, name;
 (25 rows)
 
 
+-# To perform the balance sampling for independent groups, use the 
'grouping_cols'
+parameter. Note below that each group (zone) has a different count of the 
classes (mainhue),
+with some groups not containing some class values.
+
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+'flags',  -- Source table
+'output_table',   -- Output table
+'mainhue',-- Class column
+NULL, -- Uniform
+NULL, -- Output table size
+'zone'-- No grouping
+);
+SELECT * FROM output_table ORDER BY zone, mainhue;
+
+
+ __madlib_id__ | id |name | landmass | zone | area | population | 
language | colours | mainhue

+---++-+--+--+--++--+-+-
+ 6 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 5 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 8 | 17 | Sweden  |3 |1 |  450 |  8 |  
  6 |   2 | blue
+ 7 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 2 |  7 | Denmark |3 |1 |   43 |  5 |  
  6 |   2 | red
+ 1 |  6 | China   |5 |1 | 9561 |   1008 |  
  7 |   2 | red
+ 4 | 12 | Luxembourg  |3 |1 |3 |  0 |  
  4 |   3 | red
+ 3 | 18 | Switzerland |3 |1 |   41 |  6 |  
  4 |   2 | red
+ 1 |  2 | Australia   |6 |2 | 7690 | 15 |  
  1 |   3 | blue
+ 1 |  1 | Argentina   |2 |3 | 2777 | 28 |  
  2 |   2 | blue
+ 2 |  4 | Brazil  |2 |3 | 8512 |119 |  
  6 |   4 | green
+ 6 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+ 5 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+ 4 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+12 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+10 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+11 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+ 1 | 19 | UK  |3 |4 |  245 | 56 |  
  1 |   3 | red
+ 3 |  5 | Canada  |1 |4 | 9976 | 24 |  
  1 |   2 | red
+ 2 | 15 | Portugal|3 |4 |   92 | 10 |  
  6 |   5 | red
+ 8 | 20 | USA |1 |4 | 9363 |231 |  
  1 |   3 | white
+ 7 | 20 | USA |1 |4 | 9363 |231 |  
  1 |   3 | white
+ 9 | 10 | Ireland |3 |4 |   70 |  3 |  
  1 |   3 | white
+(23 rows)
+
+
+-# Grouping can be used with class size specification as well. Note below 
that
+'blue=' is the only valid class value since 'blue' is the only 
class
+value that is present in each group.
+
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+'flags',  -- Source table
+'output_table',   -- Output table
+'mainhue',-- Class column
+'blue=2', -- Specified class value size. Rest of the values 
are sampled as is.
+NULL, -- Output table size
+'zone'-- No grouping
+);
+SELECT * FROM output_table ORDER BY zone, mainhue;
+
+
+ __madlib_id__ | id |name | landmass | zone | area | population | 
language | colours | mainhue

+---++-+--+--+--++--+-+-
+ 1 | 17 | Sweden  |3 |1 |  450 |  8 |  
  6 |   2 | blue
+ 7 | 18 | Switzerland |3 |1 |   41 |  6 |  
  4 |   2 | red
+ 6 | 14 | Norway  |3 |1 |  324 |  4 |  
  6 |   3 | red
+ 5 | 12 | Luxembourg  |3 |1 |3 |  0 |  
  4 |   3 | red
+ 4 |  7 | Denmar

[GitHub] madlib pull request #239: Balance Sample: Add support for grouping

2018-03-07 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/239#discussion_r172941914
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in ---
@@ -543,6 +544,90 @@ SELECT * FROM output_table ORDER BY mainhue, name;
 (25 rows)
 
 
+-# To perform the balance sampling for independent groups, use the 
'grouping_cols'
+parameter. Note below that each group (zone) has a different count of the 
classes (mainhue),
+with some groups not containing some class values.
+
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+'flags',  -- Source table
+'output_table',   -- Output table
+'mainhue',-- Class column
+NULL, -- Uniform
+NULL, -- Output table size
+'zone'-- No grouping
+);
+SELECT * FROM output_table ORDER BY zone, mainhue;
+
+
+ __madlib_id__ | id |name | landmass | zone | area | population | 
language | colours | mainhue

+---++-+--+--+--++--+-+-
+ 6 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 5 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 8 | 17 | Sweden  |3 |1 |  450 |  8 |  
  6 |   2 | blue
+ 7 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 2 |  7 | Denmark |3 |1 |   43 |  5 |  
  6 |   2 | red
+ 1 |  6 | China   |5 |1 | 9561 |   1008 |  
  7 |   2 | red
+ 4 | 12 | Luxembourg  |3 |1 |3 |  0 |  
  4 |   3 | red
+ 3 | 18 | Switzerland |3 |1 |   41 |  6 |  
  4 |   2 | red
+ 1 |  2 | Australia   |6 |2 | 7690 | 15 |  
  1 |   3 | blue
+ 1 |  1 | Argentina   |2 |3 | 2777 | 28 |  
  2 |   2 | blue
+ 2 |  4 | Brazil  |2 |3 | 8512 |119 |  
  6 |   4 | green
+ 6 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+ 5 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+ 4 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+12 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+10 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+11 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+ 1 | 19 | UK  |3 |4 |  245 | 56 |  
  1 |   3 | red
+ 3 |  5 | Canada  |1 |4 | 9976 | 24 |  
  1 |   2 | red
+ 2 | 15 | Portugal|3 |4 |   92 | 10 |  
  6 |   5 | red
+ 8 | 20 | USA |1 |4 | 9363 |231 |  
  1 |   3 | white
+ 7 | 20 | USA |1 |4 | 9363 |231 |  
  1 |   3 | white
+ 9 | 10 | Ireland |3 |4 |   70 |  3 |  
  1 |   3 | white
+(23 rows)
--- End diff --

Right. We're always using ceil instead of round and hence will provide more 
than the expected rows in cases where division is not exact.  


---


[GitHub] madlib pull request #239: Balance Sample: Add support for grouping

2018-03-07 Thread fmcquillan99
Github user fmcquillan99 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/239#discussion_r172922334
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in ---
@@ -543,6 +544,90 @@ SELECT * FROM output_table ORDER BY mainhue, name;
 (25 rows)
 
 
+-# To perform the balance sampling for independent groups, use the 
'grouping_cols'
+parameter. Note below that each group (zone) has a different count of the 
classes (mainhue),
+with some groups not containing some class values.
+
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+'flags',  -- Source table
+'output_table',   -- Output table
+'mainhue',-- Class column
+NULL, -- Uniform
+NULL, -- Output table size
+'zone'-- No grouping
+);
+SELECT * FROM output_table ORDER BY zone, mainhue;
+
+
+ __madlib_id__ | id |name | landmass | zone | area | population | 
language | colours | mainhue

+---++-+--+--+--++--+-+-
+ 6 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 5 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 8 | 17 | Sweden  |3 |1 |  450 |  8 |  
  6 |   2 | blue
+ 7 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 2 |  7 | Denmark |3 |1 |   43 |  5 |  
  6 |   2 | red
+ 1 |  6 | China   |5 |1 | 9561 |   1008 |  
  7 |   2 | red
+ 4 | 12 | Luxembourg  |3 |1 |3 |  0 |  
  4 |   3 | red
+ 3 | 18 | Switzerland |3 |1 |   41 |  6 |  
  4 |   2 | red
+ 1 |  2 | Australia   |6 |2 | 7690 | 15 |  
  1 |   3 | blue
+ 1 |  1 | Argentina   |2 |3 | 2777 | 28 |  
  2 |   2 | blue
+ 2 |  4 | Brazil  |2 |3 | 8512 |119 |  
  6 |   4 | green
+ 6 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+ 5 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+ 4 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+12 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+10 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+11 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+ 1 | 19 | UK  |3 |4 |  245 | 56 |  
  1 |   3 | red
+ 3 |  5 | Canada  |1 |4 | 9976 | 24 |  
  1 |   2 | red
+ 2 | 15 | Portugal|3 |4 |   92 | 10 |  
  6 |   5 | red
+ 8 | 20 | USA |1 |4 | 9363 |231 |  
  1 |   3 | white
+ 7 | 20 | USA |1 |4 | 9363 |231 |  
  1 |   3 | white
+ 9 | 10 | Ireland |3 |4 |   70 |  3 |  
  1 |   3 | white
+(23 rows)
+
+
+-# Grouping can be used with class size specification as well. Note below 
that
+'blue=' is the only valid class value since 'blue' is the only 
class
+value that is present in each group.
+
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+'flags',  -- Source table
+'output_table',   -- Output table
+'mainhue',-- Class column
+'blue=2', -- Specified class value size. Rest of the values 
are sampled as is.
+NULL, -- Output table size
+'zone'-- No grouping
+);
+SELECT * FROM output_table ORDER BY zone, mainhue;
+
+
+ __madlib_id__ | id |name | landmass | zone | area | population | 
language | colours | mainhue

+---++-+--+--+--++--+-+-
+ 1 | 17 | Sweden  |3 |1 |  450 |  8 |  
  6 |   2 | blue
+ 7 | 18 | Switzerland |3 |1 |   41 |  6 |  
  4 |   2 | red
+ 6 | 14 | Norway  |3 |1 |  324 |  4 |  
  6 |   3 | red
+ 5 | 12 | Luxembourg  |3 |1 |3 |  0 |  
  4 |   3 | red
+ 4 |  7 | 

[GitHub] madlib pull request #239: Balance Sample: Add support for grouping

2018-03-07 Thread fmcquillan99
Github user fmcquillan99 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/239#discussion_r172921714
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in ---
@@ -543,6 +544,90 @@ SELECT * FROM output_table ORDER BY mainhue, name;
 (25 rows)
 
 
+-# To perform the balance sampling for independent groups, use the 
'grouping_cols'
+parameter. Note below that each group (zone) has a different count of the 
classes (mainhue),
+with some groups not containing some class values.
+
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+'flags',  -- Source table
+'output_table',   -- Output table
+'mainhue',-- Class column
+NULL, -- Uniform
+NULL, -- Output table size
+'zone'-- No grouping
+);
+SELECT * FROM output_table ORDER BY zone, mainhue;
+
+
+ __madlib_id__ | id |name | landmass | zone | area | population | 
language | colours | mainhue

+---++-+--+--+--++--+-+-
+ 6 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 5 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 8 | 17 | Sweden  |3 |1 |  450 |  8 |  
  6 |   2 | blue
+ 7 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 2 |  7 | Denmark |3 |1 |   43 |  5 |  
  6 |   2 | red
+ 1 |  6 | China   |5 |1 | 9561 |   1008 |  
  7 |   2 | red
+ 4 | 12 | Luxembourg  |3 |1 |3 |  0 |  
  4 |   3 | red
+ 3 | 18 | Switzerland |3 |1 |   41 |  6 |  
  4 |   2 | red
+ 1 |  2 | Australia   |6 |2 | 7690 | 15 |  
  1 |   3 | blue
+ 1 |  1 | Argentina   |2 |3 | 2777 | 28 |  
  2 |   2 | blue
+ 2 |  4 | Brazil  |2 |3 | 8512 |119 |  
  6 |   4 | green
+ 6 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+ 5 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+ 4 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+12 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+10 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+11 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+ 1 | 19 | UK  |3 |4 |  245 | 56 |  
  1 |   3 | red
+ 3 |  5 | Canada  |1 |4 | 9976 | 24 |  
  1 |   2 | red
+ 2 | 15 | Portugal|3 |4 |   92 | 10 |  
  6 |   5 | red
+ 8 | 20 | USA |1 |4 | 9363 |231 |  
  1 |   3 | white
+ 7 | 20 | USA |1 |4 | 9363 |231 |  
  1 |   3 | white
+ 9 | 10 | Ireland |3 |4 |   70 |  3 |  
  1 |   3 | white
+(23 rows)
+
+
+-# Grouping can be used with class size specification as well. Note below 
that
+'blue=' is the only valid class value since 'blue' is the only 
class
+value that is present in each group.
+
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+'flags',  -- Source table
+'output_table',   -- Output table
+'mainhue',-- Class column
+'blue=2', -- Specified class value size. Rest of the values 
are sampled as is.
--- End diff --

Do you mean 
“Rest of the values are left as is.”
since they are not actually sampled. ?


---


[GitHub] madlib pull request #239: Balance Sample: Add support for grouping

2018-03-07 Thread fmcquillan99
Github user fmcquillan99 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/239#discussion_r172921328
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in ---
@@ -543,6 +544,90 @@ SELECT * FROM output_table ORDER BY mainhue, name;
 (25 rows)
 
 
+-# To perform the balance sampling for independent groups, use the 
'grouping_cols'
+parameter. Note below that each group (zone) has a different count of the 
classes (mainhue),
+with some groups not containing some class values.
+
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+'flags',  -- Source table
+'output_table',   -- Output table
+'mainhue',-- Class column
+NULL, -- Uniform
+NULL, -- Output table size
+'zone'-- No grouping
+);
+SELECT * FROM output_table ORDER BY zone, mainhue;
+
+
+ __madlib_id__ | id |name | landmass | zone | area | population | 
language | colours | mainhue

+---++-+--+--+--++--+-+-
+ 6 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 5 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 8 | 17 | Sweden  |3 |1 |  450 |  8 |  
  6 |   2 | blue
+ 7 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 2 |  7 | Denmark |3 |1 |   43 |  5 |  
  6 |   2 | red
+ 1 |  6 | China   |5 |1 | 9561 |   1008 |  
  7 |   2 | red
+ 4 | 12 | Luxembourg  |3 |1 |3 |  0 |  
  4 |   3 | red
+ 3 | 18 | Switzerland |3 |1 |   41 |  6 |  
  4 |   2 | red
+ 1 |  2 | Australia   |6 |2 | 7690 | 15 |  
  1 |   3 | blue
+ 1 |  1 | Argentina   |2 |3 | 2777 | 28 |  
  2 |   2 | blue
+ 2 |  4 | Brazil  |2 |3 | 8512 |119 |  
  6 |   4 | green
+ 6 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+ 5 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+ 4 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+12 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+10 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+11 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+ 1 | 19 | UK  |3 |4 |  245 | 56 |  
  1 |   3 | red
+ 3 |  5 | Canada  |1 |4 | 9976 | 24 |  
  1 |   2 | red
+ 2 | 15 | Portugal|3 |4 |   92 | 10 |  
  6 |   5 | red
+ 8 | 20 | USA |1 |4 | 9363 |231 |  
  1 |   3 | white
+ 7 | 20 | USA |1 |4 | 9363 |231 |  
  1 |   3 | white
+ 9 | 10 | Ireland |3 |4 |   70 |  3 |  
  1 |   3 | white
+(23 rows)
--- End diff --

For zone 4 there are 12 output rows sampled from 9 input rows.  I am 
assuming this is just rounding due to small numbers in the example?  i.e., 
could have been 8 rows or 12 rows in the output table for uniform sampling for 
4 groups (blue, green, red, white) but we went with 12? 

Which is fine, just confirming.



---


[GitHub] madlib pull request #239: Balance Sample: Add support for grouping

2018-03-07 Thread fmcquillan99
Github user fmcquillan99 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/239#discussion_r172920935
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in ---
@@ -543,6 +544,90 @@ SELECT * FROM output_table ORDER BY mainhue, name;
 (25 rows)
 
 
+-# To perform the balance sampling for independent groups, use the 
'grouping_cols'
+parameter. Note below that each group (zone) has a different count of the 
classes (mainhue),
+with some groups not containing some class values.
+
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+'flags',  -- Source table
+'output_table',   -- Output table
+'mainhue',-- Class column
+NULL, -- Uniform
+NULL, -- Output table size
+'zone'-- No grouping
+);
+SELECT * FROM output_table ORDER BY zone, mainhue;
+
+
+ __madlib_id__ | id |name | landmass | zone | area | population | 
language | colours | mainhue

+---++-+--+--+--++--+-+-
+ 6 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 5 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 8 | 17 | Sweden  |3 |1 |  450 |  8 |  
  6 |   2 | blue
+ 7 |  8 | Greece  |3 |1 |  132 | 10 |  
  6 |   2 | blue
+ 2 |  7 | Denmark |3 |1 |   43 |  5 |  
  6 |   2 | red
+ 1 |  6 | China   |5 |1 | 9561 |   1008 |  
  7 |   2 | red
+ 4 | 12 | Luxembourg  |3 |1 |3 |  0 |  
  4 |   3 | red
+ 3 | 18 | Switzerland |3 |1 |   41 |  6 |  
  4 |   2 | red
+ 1 |  2 | Australia   |6 |2 | 7690 | 15 |  
  1 |   3 | blue
+ 1 |  1 | Argentina   |2 |3 | 2777 | 28 |  
  2 |   2 | blue
+ 2 |  4 | Brazil  |2 |3 | 8512 |119 |  
  6 |   4 | green
+ 6 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+ 5 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+ 4 |  9 | Guatemala   |1 |4 |  109 |  8 |  
  2 |   2 | blue
+12 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+10 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+11 | 13 | Mexico  |1 |4 | 1973 | 77 |  
  2 |   4 | green
+ 1 | 19 | UK  |3 |4 |  245 | 56 |  
  1 |   3 | red
+ 3 |  5 | Canada  |1 |4 | 9976 | 24 |  
  1 |   2 | red
+ 2 | 15 | Portugal|3 |4 |   92 | 10 |  
  6 |   5 | red
+ 8 | 20 | USA |1 |4 | 9363 |231 |  
  1 |   3 | white
+ 7 | 20 | USA |1 |4 | 9363 |231 |  
  1 |   3 | white
+ 9 | 10 | Ireland |3 |4 |   70 |  3 |  
  1 |   3 | white
+(23 rows)
+
+
+-# Grouping can be used with class size specification as well. Note below 
that
+'blue=' is the only valid class value since 'blue' is the only 
class
+value that is present in each group.
+
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+'flags',  -- Source table
+'output_table',   -- Output table
+'mainhue',-- Class column
+'blue=2', -- Specified class value size. Rest of the values 
are sampled as is.
+NULL, -- Output table size
+'zone'-- No grouping
+);
--- End diff --

I think you mean "Group by zone" or something like that, not "No grouping" .


---


[GitHub] madlib pull request #239: Balance Sample: Add support for grouping

2018-03-07 Thread fmcquillan99
Github user fmcquillan99 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/239#discussion_r172920825
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in ---
@@ -543,6 +544,90 @@ SELECT * FROM output_table ORDER BY mainhue, name;
 (25 rows)
 
 
+-# To perform the balance sampling for independent groups, use the 
'grouping_cols'
+parameter. Note below that each group (zone) has a different count of the 
classes (mainhue),
+with some groups not containing some class values.
+
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+'flags',  -- Source table
+'output_table',   -- Output table
+'mainhue',-- Class column
+NULL, -- Uniform
+NULL, -- Output table size
+'zone'-- No grouping
+);
--- End diff --

I think you mean "Group by zone" or something like that, not "No grouping" .


---


[GitHub] madlib pull request #239: Balance Sample: Add support for grouping

2018-03-07 Thread fmcquillan99
Github user fmcquillan99 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/239#discussion_r172920581
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in ---
@@ -149,8 +149,10 @@ non-stratified, that is, the whole table is treated as 
a
 single group.
 
 @note
-Current implementation does not support grouping_cols.  It
-will be added in an upcoming release.
+The 'output_table_size' and the 'class_sizes' are defined for the whole 
table.
+When grouping is used, these parameters are split evenly for each group.
+Further, if a specific class value is specified in the 'class_sizes' 
parameter,
+that particular class value should be present in each group.
 
--- End diff --

Please change/add:
"...that particular class value should be present in each group."
to 
"...that particular class value should be present in each group.  If not, 
an error will be thrown."


---


[GitHub] madlib pull request #239: Balance Sample: Add support for grouping

2018-03-06 Thread iyerr3
GitHub user iyerr3 opened a pull request:

https://github.com/apache/madlib/pull/239

Balance Sample: Add support for grouping

JIRA: MADLIB-1168

This commit adds grouping support for balanced sampling. 
Grouping is implemented as a loop over the existing logic, 
with the sampling for each group run independently. 

Closes #239

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib 
feature/balanced-datasets-grouping

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/239.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #239


commit 6c5fcfb375eaf7dc68e1ede4aca2a47b8e55309b
Author: Rahul Iyer 
Date:   2018-02-24T02:45:32Z

Clean code + conform to PEP8

commit a5a0c1e2c851a923b9eb550d42dfc594b4635c64
Author: Rahul Iyer 
Date:   2018-02-26T23:01:34Z

Add a Collate plpy results function

commit 8e8eca2960207ca0317ded68608c660b8d4ddb55
Author: Rahul Iyer 
Date:   2018-03-02T00:44:54Z

Add grouping in get_level_frequency_distribution

commit cad4a5be732f89504ff62f4d9e68367d174fc322
Author: Rahul Iyer 
Date:   2018-03-07T07:07:00Z

Ensure subqueries are filtering groups and using right count

commit 39dd6f436bb9b8d505be5204226dcc3053b1b4df
Author: Rahul Iyer 
Date:   2018-03-07T07:07:14Z

Update install check to include grouping

commit d61ff28290dad27ead0f1c68d740a8ccb79f4aec
Author: Rahul Iyer 
Date:   2018-03-07T07:07:27Z

Update documentation with grouping examples




---