[GitHub] madlib issue #317: Fixed trailing whitespace in many sql_in files
Github user asfgit commented on the issue: https://github.com/apache/madlib/pull/317 Refer to this link for build results (access rights to CI server needed): https://builds.apache.org/job/madlib-pr-build/667/ ---
[GitHub] madlib issue #316: Build: Disable AppendOnly if available
Github user asfgit commented on the issue: https://github.com/apache/madlib/pull/316 Refer to this link for build results (access rights to CI server needed): https://builds.apache.org/job/madlib-pr-build/666/ ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r214485090 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id, if label_column_name and label_column_name.strip(): cols_in_tbl_valid(point_source, [label_column_name], 'kNN') -cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN') -cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN') - -if not is_col_array(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table is not" - " an array.".format(point_column_name)) -if not is_col_array(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table is not" - " an array.".format(test_column_name)) --- End diff -- `point_column_name` and `test_column_name` params must be an array as this if check suggests. If it's not an array it fails further down when the distance function (such as `squared_dist_norm2`) is called. I don't think the function `is_var_valid()` checks for these being arrays. You may have to check them after the new asserts, using a new helper function (`is_col_array()` cannot be used as is for expressions, and `is_var_valid()` does not check for an array, but just the validity of the expression) ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r214487281 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id, if label_column_name and label_column_name.strip(): cols_in_tbl_valid(point_source, [label_column_name], 'kNN') -cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN') -cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN') - -if not is_col_array(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table is not" - " an array.".format(point_column_name)) -if not is_col_array(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table is not" - " an array.".format(test_column_name)) - -if not array_col_has_no_null(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table has some" - " NULL values.".format(point_column_name)) -if not array_col_has_no_null(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table has some" - " NULL values.".format(test_column_name)) + +_assert(is_var_valid(point_source, point_column_name), +"kNN error: Invalid point_column_name expression") --- End diff -- Can we change the error message to indicate invalid column name or expression? How about `"kNN error: {0} is an invalid column name or expression for point_column_name param".format(point_column_name)`? ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r214484071 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id, if label_column_name and label_column_name.strip(): cols_in_tbl_valid(point_source, [label_column_name], 'kNN') -cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN') -cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN') --- End diff -- `point_id` and `test_id` params are not validated anymore. This should still be done, since the new asserts only check for `point_column_name` and `test_column_name` being valid expressions. This validation is required to catch invalid values such as `NULL`, `''`, invalid or non-existing column name etc. for `point_id` and `test_id` params. ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r214485621 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id, if label_column_name and label_column_name.strip(): cols_in_tbl_valid(point_source, [label_column_name], 'kNN') -cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN') -cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN') - -if not is_col_array(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table is not" - " an array.".format(point_column_name)) -if not is_col_array(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table is not" - " an array.".format(test_column_name)) - -if not array_col_has_no_null(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table has some" - " NULL values.".format(point_column_name)) -if not array_col_has_no_null(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table has some" - " NULL values.".format(test_column_name)) --- End diff -- I think we will need to keep these asserts too. They seem to be checking for a different validation. If the helper function `array_col_has_no_null()` does not work for expressions, please go ahead and change that, or create a new helper function to handle the scenario (both expressions and column names). ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r214487426 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, point_column_name, point_id, if label_column_name and label_column_name.strip(): cols_in_tbl_valid(point_source, [label_column_name], 'kNN') -cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN') -cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN') - -if not is_col_array(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table is not" - " an array.".format(point_column_name)) -if not is_col_array(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table is not" - " an array.".format(test_column_name)) - -if not array_col_has_no_null(point_source, point_column_name): -plpy.error("kNN Error: Feature column '{0}' in train table has some" - " NULL values.".format(point_column_name)) -if not array_col_has_no_null(test_source, test_column_name): -plpy.error("kNN Error: Feature column '{0}' in test table has some" - " NULL values.".format(test_column_name)) + +_assert(is_var_valid(point_source, point_column_name), +"kNN error: Invalid point_column_name expression") + +_assert(is_var_valid(test_source, test_column_name), +"kNN error: Invalid test_column_name expression") --- End diff -- Similar to the above comment, can we change the error message to: `"kNN error: {0} is an invalid column name or expression for test_column_name param".format(test_column_name)`? ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r214482318 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -264,12 +260,17 @@ def knn(schema_madlib, point_source, point_column_name, point_id, SELECT test.{test_id} AS {test_id_temp}, train.{point_id} as train_id, {fn_dist}( -train.{point_column_name}, -test.{test_column_name}) +p_col_name, +t_col_name) AS dist {label_out} -FROM {point_source} AS train, - {test_source} AS test +FROM +( +SELECT {point_id} , {point_column_name} as p_col_name , {label_column_name} from {point_source} +) train, +( +SELECT {test_id} ,{test_column_name} as t_col_name from {test_source} +) test --- End diff -- Can you please use variables with unique strings for `train`, `test`, `p_col_name` and `t_col_name`. If the train or test table is named any of those, the query would fail I guess. While you are at it, could you also do the same for other variables in this query: `train_id`, `r`, `dist_inverse` and others I may have missed listing out? ---
[GitHub] madlib issue #316: Build: Disable AppendOnly if available
Github user asfgit commented on the issue: https://github.com/apache/madlib/pull/316 Refer to this link for build results (access rights to CI server needed): https://builds.apache.org/job/madlib-pr-build/665/ ---
[GitHub] madlib issue #315: JIRA:1060 - Modified KNN to accept expressions in point_c...
Github user asfgit commented on the issue: https://github.com/apache/madlib/pull/315 Refer to this link for build results (access rights to CI server needed): https://builds.apache.org/job/madlib-pr-build/664/ ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user hpandeycodeit commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r214415305 --- Diff: src/ports/postgres/modules/knn/test/knn.sql_in --- @@ -122,5 +122,13 @@ select knn('knn_train_data_reg','data','id','label','knn_test_data','data','id', select assert(array_agg(prediction::numeric order by id)='{1,1,0.0408728591876018,1,0,0}', 'Wrong output in regression') from madlib_knn_result_regression; +drop table if exists madlib_knn_result_classification; +select knn('knn_train_data','data[1:1]||data[2:2]','id','label','knn_test_data','data[1:1]||data[2:2]','id','madlib_knn_result_classification',3,False,'MADLIB_SCHEMA.squared_dist_norm2', True); +select assert(array_agg(prediction::numeric order by id)='{1,1,0,1,0,0}', 'Wrong output in classification') from madlib_knn_result_classification; + +drop table if exists madlib_knn_result_regression; +select knn('knn_train_data_reg','data[1:1]||data[2:2]','id','label','knn_test_data','data[1:1]||data[2:2]','id','madlib_knn_result_regression',3,False,'MADLIB_SCHEMA.squared_dist_norm2', True); +select assert(array_agg(prediction::numeric order by id)='{1,1,0.0408728591876018,1,0,0}', 'Wrong output in regression') from madlib_knn_result_regression; + --- End diff -- Added this test as well along with other minor changes as suggested. ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user hpandeycodeit commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r214415227 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -290,11 +303,11 @@ def knn(schema_madlib, point_source, point_column_name, point_id, ON knn_temp.{test_id_temp} = knn_test.{test_id} {view_join} GROUP BY knn_temp.{test_id_temp}, - knn_test.{test_column_name} +{test_column_name} {view_grp_by} """ plpy.execute(sql.format(**locals())) -plpy.execute("DROP TABLE IF EXISTS {0}".format(interim_table)) +# plpy.execute("DROP TABLE IF EXISTS {0}".format(interim_table)) --- End diff -- I pushed by mistake. I have reverted the changes. ---
[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...
Github user hpandeycodeit commented on a diff in the pull request: https://github.com/apache/madlib/pull/315#discussion_r214414914 --- Diff: src/ports/postgres/modules/knn/knn.py_in --- @@ -264,12 +275,14 @@ def knn(schema_madlib, point_source, point_column_name, point_id, SELECT test.{test_id} AS {test_id_temp}, train.{point_id} as train_id, {fn_dist}( -train.{point_column_name}, -test.{test_column_name}) +train.{point_col_name_temp}, +test.{test_col_name_temp}) AS dist {label_out} -FROM {point_source} AS train, - {test_source} AS test +FROM + {point_source_temp_table} as train, + {test_source_temp_table} as test --- End diff -- Replaced the temp tables with subqueries. ---
[GitHub] madlib issue #316: Build: Disable AppendOnly if available
Github user asfgit commented on the issue: https://github.com/apache/madlib/pull/316 Refer to this link for build results (access rights to CI server needed): https://builds.apache.org/job/madlib-pr-build/663/ ---