[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user asfgit closed the pull request at: https://github.com/apache/madlib/pull/315 ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r215464110 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id, if label_column_name and label_column_name.strip(): cols_in_tbl_valid(point_source, [label_column_name], 'kNN') -cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN') -cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN') - -if not is_col_array(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table is not" - " an array.".format(point_column_name)) -if not is_col_array(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table is not" - " an array.".format(test_column_name)) --- End diff -- @fmcquillan99 yes, I was looking at the alternatives. @hpandeycodeit I can take care of this. I will push a commit to this PR (you may have to merge it, since I am not a co-author on your branch). ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user fmcquillan99 commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r215462116 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id, if label_column_name and label_column_name.strip(): cols_in_tbl_valid(point_source, [label_column_name], 'kNN') -cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN') -cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN') - -if not is_col_array(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table is not" - " an array.".format(point_column_name)) -if not is_col_array(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table is not" - " an array.".format(test_column_name)) --- End diff -- Yes I agree that this error message is too vague. So in retrospect I suggest we do check that a valid array type is generated by the expression. Sorry if I mislead earlier @hpandeycodeit @njayaram2 do you have an existing function that we can use for this check? ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user hpandeycodeit commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r215149356 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id, if label_column_name and label_column_name.strip(): cols_in_tbl_valid(point_source, [label_column_name], 'kNN') -cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN') -cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN') --- End diff -- Added these functions back. ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user hpandeycodeit commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r215149235 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -264,12 +260,17 @@ def knn(schema_madlib, point_source, point_column_name, point_id, SELECT test.{test_id} AS {test_id_temp}, train.{point_id} as train_id, {fn_dist}( -train.{point_column_name}, -test.{test_column_name}) +p_col_name, +t_col_name) AS dist {label_out} -FROM {point_source} AS train, - {test_source} AS test +FROM +( +SELECT {point_id} , {point_column_name} as p_col_name , {label_column_name} from {point_source} +) train, +( +SELECT {test_id} ,{test_column_name} as t_col_name from {test_source} +) test --- End diff -- Done. ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user hpandeycodeit commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r215148966 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id, if label_column_name and label_column_name.strip(): cols_in_tbl_valid(point_source, [label_column_name], 'kNN') -cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN') -cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN') - -if not is_col_array(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table is not" - " an array.".format(point_column_name)) -if not is_col_array(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table is not" - " an array.".format(test_column_name)) --- End diff -- As per the Jira and our discussion, it is okay to fail at distance function if the point_column_name and test_column_name params are not arrays. ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user hpandeycodeit commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r215148640 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id, if label_column_name and label_column_name.strip(): cols_in_tbl_valid(point_source, [label_column_name], 'kNN') -cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN') -cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN') - -if not is_col_array(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table is not" - " an array.".format(point_column_name)) -if not is_col_array(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table is not" - " an array.".format(test_column_name)) - -if not array_col_has_no_null(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table has some" - " NULL values.".format(point_column_name)) -if not array_col_has_no_null(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table has some" - " NULL values.".format(test_column_name)) --- End diff -- Added it back. Had to do a minor change in array_col_has_no_null() for the expressions. ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r214485090 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id, if label_column_name and label_column_name.strip(): cols_in_tbl_valid(point_source, [label_column_name], 'kNN') -cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN') -cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN') - -if not is_col_array(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table is not" - " an array.".format(point_column_name)) -if not is_col_array(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table is not" - " an array.".format(test_column_name)) --- End diff -- `point_column_name` and `test_column_name` params must be an array as this if check suggests. If it's not an array it fails further down when the distance function (such as `squared_dist_norm2`) is called. I don't think the function `is_var_valid()` checks for these being arrays. You may have to check them after the new asserts, using a new helper function (`is_col_array()` cannot be used as is for expressions, and `is_var_valid()` does not check for an array, but just the validity of the expression) ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r214487281 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id, if label_column_name and label_column_name.strip(): cols_in_tbl_valid(point_source, [label_column_name], 'kNN') -cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN') -cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN') - -if not is_col_array(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table is not" - " an array.".format(point_column_name)) -if not is_col_array(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table is not" - " an array.".format(test_column_name)) - -if not array_col_has_no_null(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table has some" - " NULL values.".format(point_column_name)) -if not array_col_has_no_null(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table has some" - " NULL values.".format(test_column_name)) + +_assert(is_var_valid(point_source, point_column_name), +"kNN error: Invalid point_column_name expression") --- End diff -- Can we change the error message to indicate invalid column name or expression? How about `"kNN error: {0} is an invalid column name or expression for point_column_name param".format(point_column_name)`? ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r214484071 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id, if label_column_name and label_column_name.strip(): cols_in_tbl_valid(point_source, [label_column_name], 'kNN') -cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN') -cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN') --- End diff -- `point_id` and `test_id` params are not validated anymore. This should still be done, since the new asserts only check for `point_column_name` and `test_column_name` being valid expressions. This validation is required to catch invalid values such as `NULL`, `''`, invalid or non-existing column name etc. for `point_id` and `test_id` params. ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r214485621 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id, if label_column_name and label_column_name.strip(): cols_in_tbl_valid(point_source, [label_column_name], 'kNN') -cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN') -cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN') - -if not is_col_array(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table is not" - " an array.".format(point_column_name)) -if not is_col_array(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table is not" - " an array.".format(test_column_name)) - -if not array_col_has_no_null(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table has some" - " NULL values.".format(point_column_name)) -if not array_col_has_no_null(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table has some" - " NULL values.".format(test_column_name)) --- End diff -- I think we will need to keep these asserts too. They seem to be checking for a different validation. If the helper function `array_col_has_no_null()` does not work for expressions, please go ahead and change that, or create a new helper function to handle the scenario (both expressions and column names). ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r214487426 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id, if label_column_name and label_column_name.strip(): cols_in_tbl_valid(point_source, [label_column_name], 'kNN') -cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN') -cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN') - -if not is_col_array(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table is not" - " an array.".format(point_column_name)) -if not is_col_array(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table is not" - " an array.".format(test_column_name)) - -if not array_col_has_no_null(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table has some" - " NULL values.".format(point_column_name)) -if not array_col_has_no_null(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table has some" - " NULL values.".format(test_column_name)) + +_assert(is_var_valid(point_source, point_column_name), +"kNN error: Invalid point_column_name expression") + +_assert(is_var_valid(test_source, test_column_name), +"kNN error: Invalid test_column_name expression") --- End diff -- Similar to the above comment, can we change the error message to: `"kNN error: {0} is an invalid column name or expression for test_column_name param".format(test_column_name)`? ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r214482318 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -264,12 +260,17 @@ def knn(schema_madlib, point_source, point_column_name, point_id, SELECT test.{test_id} AS {test_id_temp}, train.{point_id} as train_id, {fn_dist}( -train.{point_column_name}, -test.{test_column_name}) +p_col_name, +t_col_name) AS dist {label_out} -FROM {point_source} AS train, - {test_source} AS test +FROM +( +SELECT {point_id} , {point_column_name} as p_col_name , {label_column_name} from {point_source} +) train, +( +SELECT {test_id} ,{test_column_name} as t_col_name from {test_source} +) test --- End diff -- Can you please use variables with unique strings for `train`, `test`, `p_col_name` and `t_col_name`. If the train or test table is named any of those, the query would fail I guess. While you are at it, could you also do the same for other variables in this query: `train_id`, `r`, `dist_inverse` and others I may have missed listing out? ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user hpandeycodeit commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r214415305 --- Diff: src/ports/postgres/modules/knn/test/knn.sql_in --- @@ -122,5 +122,13 @@ select knn('knn_train_data_reg','data','id','label','knn_test_data','data','id', select assert(array_agg(prediction::numeric order by id)='{1,1,0.0408728591876018,1,0,0}', 'Wrong output in regression') from madlib_knn_result_regression; +drop table if exists madlib_knn_result_classification; +select knn('knn_train_data','data[1:1]||data[2:2]','id','label','knn_test_data','data[1:1]||data[2:2]','id','madlib_knn_result_classification',3,False,'MADLIB_SCHEMA.squared_dist_norm2', True); +select assert(array_agg(prediction::numeric order by id)='{1,1,0,1,0,0}', 'Wrong output in classification') from madlib_knn_result_classification; + +drop table if exists madlib_knn_result_regression; +select knn('knn_train_data_reg','data[1:1]||data[2:2]','id','label','knn_test_data','data[1:1]||data[2:2]','id','madlib_knn_result_regression',3,False,'MADLIB_SCHEMA.squared_dist_norm2', True); +select assert(array_agg(prediction::numeric order by id)='{1,1,0.0408728591876018,1,0,0}', 'Wrong output in regression') from madlib_knn_result_regression; + --- End diff -- Added this test as well along with other minor changes as suggested. ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user hpandeycodeit commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r214415227 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -290,11 +303,11 @@ def knn(schema_madlib, point_source, point_column_name, point_id, ON knn_temp.{test_id_temp} = knn_test.{test_id} {view_join} GROUP BY knn_temp.{test_id_temp}, - knn_test.{test_column_name} +{test_column_name} {view_grp_by} """ plpy.execute(sql.format(**locals())) -plpy.execute("DROP TABLE IF EXISTS {0}".format(interim_table)) +# plpy.execute("DROP TABLE IF EXISTS {0}".format(interim_table)) --- End diff -- I pushed by mistake. I have reverted the changes. ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user hpandeycodeit commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r214414914 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -264,12 +275,14 @@ def knn(schema_madlib, point_source, point_column_name, point_id, SELECT test.{test_id} AS {test_id_temp}, train.{point_id} as train_id, {fn_dist}( -train.{point_column_name}, -test.{test_column_name}) +train.{point_col_name_temp}, +test.{test_col_name_temp}) AS dist {label_out} -FROM {point_source} AS train, - {test_source} AS test +FROM + {point_source_temp_table} as train, + {test_source_temp_table} as test --- End diff -- Replaced the temp tables with subqueries. ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r213806554 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -53,22 +55,10 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id, if label_column_name and label_column_name.strip(): cols_in_tbl_valid(point_source, [label_column_name], 'kNN') -cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN') -cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN') - -if not is_col_array(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table is not" - " an array.".format(point_column_name)) -if not is_col_array(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table is not" - " an array.".format(test_column_name)) - -if not array_col_has_no_null(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table has some" - " NULL values.".format(point_column_name)) -if not array_col_has_no_null(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table has some" - " NULL values.".format(test_column_name)) + +_assert(point_column_name, "KNN error: Invalid point_column_name expression") + +_assert(test_column_name, "KNN error: Invalid test_column_name expression") --- End diff -- `KNN` in the error message is different from `kNN` used in other error messages (capital `K`). Please keep it consistent as `kNN`. ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r213791885 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -53,22 +55,10 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id, if label_column_name and label_column_name.strip(): cols_in_tbl_valid(point_source, [label_column_name], 'kNN') -cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN') -cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN') - -if not is_col_array(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table is not" - " an array.".format(point_column_name)) -if not is_col_array(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table is not" - " an array.".format(test_column_name)) - -if not array_col_has_no_null(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table has some" - " NULL values.".format(point_column_name)) -if not array_col_has_no_null(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table has some" - " NULL values.".format(test_column_name)) + +_assert(point_column_name, "KNN error: Invalid point_column_name expression") + +_assert(test_column_name, "KNN error: Invalid test_column_name expression") --- End diff -- Since the original asserts are removed, this results in the function call not exiting gracefully when we have incorrect param values. You may have to use function `is_var_valid()` in `validate_args.py_in` to validate `point_column_name` and `test_column_name`. ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r213792935 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -264,12 +275,14 @@ def knn(schema_madlib, point_source, point_column_name, point_id, SELECT test.{test_id} AS {test_id_temp}, train.{point_id} as train_id, {fn_dist}( -train.{point_column_name}, -test.{test_column_name}) +train.{point_col_name_temp}, +test.{test_col_name_temp}) AS dist {label_out} -FROM {point_source} AS train, - {test_source} AS test +FROM + {point_source_temp_table} as train, + {test_source_temp_table} as test --- End diff -- Please use subqueries, instead of tables: ``` (select {point_id} , {point_column_name} as {point_col_name_temp} , {label_column_name} from {point_source}) train, (select {test_id}, {test_column_name} as {test_col_name_temp} from {test_source}) test ``` ---