[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-09-07 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/madlib/pull/315


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-09-05 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r215464110
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, 
point_column_name, point_id,
 
 if label_column_name and label_column_name.strip():
 cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
-cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
-cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
-
-if not is_col_array(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table is not"
-   " an array.".format(point_column_name))
-if not is_col_array(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table is not"
-   " an array.".format(test_column_name))
--- End diff --

@fmcquillan99 yes, I was looking at the alternatives. @hpandeycodeit I can 
take care of this. I will push a commit to this PR (you may have to merge it, 
since I am not a co-author on your branch).


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-09-05 Thread fmcquillan99
Github user fmcquillan99 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r215462116
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, 
point_column_name, point_id,
 
 if label_column_name and label_column_name.strip():
 cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
-cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
-cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
-
-if not is_col_array(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table is not"
-   " an array.".format(point_column_name))
-if not is_col_array(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table is not"
-   " an array.".format(test_column_name))
--- End diff --

Yes I agree that this error message is too vague.  So in retrospect I 
suggest we do check that a valid array type is generated by the expression.   
Sorry if I mislead earlier @hpandeycodeit 

@njayaram2 do you have an existing function that we can use for this check?


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-09-05 Thread hpandeycodeit
Github user hpandeycodeit commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r215149356
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, 
point_column_name, point_id,
 
 if label_column_name and label_column_name.strip():
 cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
-cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
-cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
--- End diff --

Added these functions back. 


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-09-05 Thread hpandeycodeit
Github user hpandeycodeit commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r215149235
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -264,12 +260,17 @@ def knn(schema_madlib, point_source, 
point_column_name, point_id,
 SELECT test.{test_id} AS {test_id_temp},
 train.{point_id} as train_id,
 {fn_dist}(
-train.{point_column_name},
-test.{test_column_name})
+p_col_name,
+t_col_name)
 AS dist
 {label_out}
-FROM {point_source} AS train,
- {test_source} AS test
+FROM
+(
+SELECT {point_id} , {point_column_name} as 
p_col_name , {label_column_name} from {point_source}
+) train,
+(
+SELECT {test_id} ,{test_column_name} as 
t_col_name from {test_source}
+) test
--- End diff --

Done. 


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-09-05 Thread hpandeycodeit
Github user hpandeycodeit commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r215148966
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, 
point_column_name, point_id,
 
 if label_column_name and label_column_name.strip():
 cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
-cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
-cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
-
-if not is_col_array(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table is not"
-   " an array.".format(point_column_name))
-if not is_col_array(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table is not"
-   " an array.".format(test_column_name))
--- End diff --

As per the Jira and our discussion, it is okay to fail at distance function 
if the point_column_name and test_column_name params are not arrays. 


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-09-05 Thread hpandeycodeit
Github user hpandeycodeit commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r215148640
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, 
point_column_name, point_id,
 
 if label_column_name and label_column_name.strip():
 cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
-cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
-cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
-
-if not is_col_array(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table is not"
-   " an array.".format(point_column_name))
-if not is_col_array(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table is not"
-   " an array.".format(test_column_name))
-
-if not array_col_has_no_null(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table has 
some"
-   " NULL values.".format(point_column_name))
-if not array_col_has_no_null(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table has some"
-   " NULL values.".format(test_column_name))
--- End diff --

Added it back. Had to do a minor change in array_col_has_no_null() for the 
expressions. 


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-08-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r214485090
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, 
point_column_name, point_id,
 
 if label_column_name and label_column_name.strip():
 cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
-cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
-cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
-
-if not is_col_array(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table is not"
-   " an array.".format(point_column_name))
-if not is_col_array(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table is not"
-   " an array.".format(test_column_name))
--- End diff --

`point_column_name` and `test_column_name` params must be an array as this 
if check suggests. If it's not an array it fails further down when the distance 
function (such as `squared_dist_norm2`) is called.

I don't think the function `is_var_valid()` checks for these being arrays. 
You may have to check them after the new asserts, using a new helper function 
(`is_col_array()` cannot be used as is for expressions, and `is_var_valid()` 
does not check for an array, but just the validity of the expression)


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-08-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r214487281
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, 
point_column_name, point_id,
 
 if label_column_name and label_column_name.strip():
 cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
-cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
-cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
-
-if not is_col_array(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table is not"
-   " an array.".format(point_column_name))
-if not is_col_array(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table is not"
-   " an array.".format(test_column_name))
-
-if not array_col_has_no_null(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table has 
some"
-   " NULL values.".format(point_column_name))
-if not array_col_has_no_null(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table has some"
-   " NULL values.".format(test_column_name))
+
+_assert(is_var_valid(point_source, point_column_name),
+"kNN error: Invalid point_column_name expression")
--- End diff --

Can we change the error message to indicate invalid column name or 
expression? How about 
`"kNN error: {0} is an invalid column name or expression for 
point_column_name param".format(point_column_name)`?


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-08-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r214484071
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, 
point_column_name, point_id,
 
 if label_column_name and label_column_name.strip():
 cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
-cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
-cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
--- End diff --

`point_id` and `test_id` params are not validated anymore. This should 
still be done, since the new asserts only check for `point_column_name` and 
`test_column_name` being valid expressions. This validation is required to 
catch invalid values such as `NULL`, `''`, invalid or non-existing column name 
etc. for `point_id` and `test_id` params.


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-08-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r214485621
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, 
point_column_name, point_id,
 
 if label_column_name and label_column_name.strip():
 cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
-cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
-cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
-
-if not is_col_array(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table is not"
-   " an array.".format(point_column_name))
-if not is_col_array(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table is not"
-   " an array.".format(test_column_name))
-
-if not array_col_has_no_null(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table has 
some"
-   " NULL values.".format(point_column_name))
-if not array_col_has_no_null(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table has some"
-   " NULL values.".format(test_column_name))
--- End diff --

I think we will need to keep these asserts too. They seem to be checking 
for a different validation. If the helper function `array_col_has_no_null()` 
does not work for expressions, please go ahead and change that, or create a new 
helper function to handle the scenario (both expressions and column names).


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-08-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r214487426
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -53,22 +55,12 @@ def knn_validate_src(schema_madlib, point_source, 
point_column_name, point_id,
 
 if label_column_name and label_column_name.strip():
 cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
-cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
-cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
-
-if not is_col_array(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table is not"
-   " an array.".format(point_column_name))
-if not is_col_array(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table is not"
-   " an array.".format(test_column_name))
-
-if not array_col_has_no_null(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table has 
some"
-   " NULL values.".format(point_column_name))
-if not array_col_has_no_null(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table has some"
-   " NULL values.".format(test_column_name))
+
+_assert(is_var_valid(point_source, point_column_name),
+"kNN error: Invalid point_column_name expression")
+
+_assert(is_var_valid(test_source, test_column_name),
+"kNN error: Invalid test_column_name expression")
--- End diff --

Similar to the above comment, can we change the error message to:
`"kNN error: {0} is an invalid column name or expression for 
test_column_name param".format(test_column_name)`?


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-08-31 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r214482318
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -264,12 +260,17 @@ def knn(schema_madlib, point_source, 
point_column_name, point_id,
 SELECT test.{test_id} AS {test_id_temp},
 train.{point_id} as train_id,
 {fn_dist}(
-train.{point_column_name},
-test.{test_column_name})
+p_col_name,
+t_col_name)
 AS dist
 {label_out}
-FROM {point_source} AS train,
- {test_source} AS test
+FROM
+(
+SELECT {point_id} , {point_column_name} as 
p_col_name , {label_column_name} from {point_source}
+) train,
+(
+SELECT {test_id} ,{test_column_name} as 
t_col_name from {test_source}
+) test
--- End diff --

Can you please use variables with unique strings for `train`, `test`, 
`p_col_name` and `t_col_name`. If the train or test table is named any of 
those, the query would fail I guess.
While you are at it, could you also do the same for other variables in this 
query: `train_id`, `r`, `dist_inverse` and others I may have missed listing out?


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-08-31 Thread hpandeycodeit
Github user hpandeycodeit commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r214415305
  
--- Diff: src/ports/postgres/modules/knn/test/knn.sql_in ---
@@ -122,5 +122,13 @@ select 
knn('knn_train_data_reg','data','id','label','knn_test_data','data','id',
 select assert(array_agg(prediction::numeric order by 
id)='{1,1,0.0408728591876018,1,0,0}', 'Wrong output in regression') from 
madlib_knn_result_regression;
 
 
+drop table if exists madlib_knn_result_classification;
+select 
knn('knn_train_data','data[1:1]||data[2:2]','id','label','knn_test_data','data[1:1]||data[2:2]','id','madlib_knn_result_classification',3,False,'MADLIB_SCHEMA.squared_dist_norm2',
 True);
+select assert(array_agg(prediction::numeric order by id)='{1,1,0,1,0,0}', 
'Wrong output in classification') from madlib_knn_result_classification;
+
+drop table if exists madlib_knn_result_regression;
+select 
knn('knn_train_data_reg','data[1:1]||data[2:2]','id','label','knn_test_data','data[1:1]||data[2:2]','id','madlib_knn_result_regression',3,False,'MADLIB_SCHEMA.squared_dist_norm2',
 True);
+select assert(array_agg(prediction::numeric order by 
id)='{1,1,0.0408728591876018,1,0,0}', 'Wrong output in regression') from 
madlib_knn_result_regression;
+
--- End diff --

Added this test as well along with other minor changes as suggested. 


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-08-31 Thread hpandeycodeit
Github user hpandeycodeit commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r214415227
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -290,11 +303,11 @@ def knn(schema_madlib, point_source, 
point_column_name, point_id,
 ON knn_temp.{test_id_temp} = knn_test.{test_id}
 {view_join}
 GROUP BY knn_temp.{test_id_temp},
- knn_test.{test_column_name}
+{test_column_name}
  {view_grp_by}
 """
 plpy.execute(sql.format(**locals()))
-plpy.execute("DROP TABLE IF EXISTS {0}".format(interim_table))
+# plpy.execute("DROP TABLE IF EXISTS {0}".format(interim_table))
--- End diff --

I pushed by mistake. I have reverted the changes. 


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-08-31 Thread hpandeycodeit
Github user hpandeycodeit commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r214414914
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -264,12 +275,14 @@ def knn(schema_madlib, point_source, 
point_column_name, point_id,
 SELECT test.{test_id} AS {test_id_temp},
 train.{point_id} as train_id,
 {fn_dist}(
-train.{point_column_name},
-test.{test_column_name})
+train.{point_col_name_temp},
+test.{test_col_name_temp})
 AS dist
 {label_out}
-FROM {point_source} AS train,
- {test_source} AS test
+FROM
+ {point_source_temp_table} as train,
+ {test_source_temp_table} as test
--- End diff --

Replaced the temp tables with subqueries. 


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-08-29 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r213806554
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -53,22 +55,10 @@ def knn_validate_src(schema_madlib, point_source, 
point_column_name, point_id,
 
 if label_column_name and label_column_name.strip():
 cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
-cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
-cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
-
-if not is_col_array(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table is not"
-   " an array.".format(point_column_name))
-if not is_col_array(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table is not"
-   " an array.".format(test_column_name))
-
-if not array_col_has_no_null(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table has 
some"
-   " NULL values.".format(point_column_name))
-if not array_col_has_no_null(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table has some"
-   " NULL values.".format(test_column_name))
+
+_assert(point_column_name, "KNN error: Invalid point_column_name 
expression")
+
+_assert(test_column_name, "KNN error: Invalid test_column_name 
expression")
--- End diff --

`KNN` in the error message is different from `kNN` used in other error 
messages (capital `K`). Please keep it consistent as `kNN`.


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-08-29 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r213791885
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -53,22 +55,10 @@ def knn_validate_src(schema_madlib, point_source, 
point_column_name, point_id,
 
 if label_column_name and label_column_name.strip():
 cols_in_tbl_valid(point_source, [label_column_name], 'kNN')
-cols_in_tbl_valid(point_source, (point_column_name, point_id), 'kNN')
-cols_in_tbl_valid(test_source, (test_column_name, test_id), 'kNN')
-
-if not is_col_array(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table is not"
-   " an array.".format(point_column_name))
-if not is_col_array(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table is not"
-   " an array.".format(test_column_name))
-
-if not array_col_has_no_null(point_source, point_column_name):
-plpy.error("kNN Error: Feature column '{0}' in train table has 
some"
-   " NULL values.".format(point_column_name))
-if not array_col_has_no_null(test_source, test_column_name):
-plpy.error("kNN Error: Feature column '{0}' in test table has some"
-   " NULL values.".format(test_column_name))
+
+_assert(point_column_name, "KNN error: Invalid point_column_name 
expression")
+
+_assert(test_column_name, "KNN error: Invalid test_column_name 
expression")
--- End diff --

Since the original asserts are removed, this results in the function call 
not exiting gracefully when we have incorrect param values. You may have to use 
function `is_var_valid()` in `validate_args.py_in` to validate 
`point_column_name` and `test_column_name`.


---


[GitHub] madlib pull request #315: JIRA:1060 - Modified KNN to accept expressions in ...

2018-08-29 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/315#discussion_r213792935
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -264,12 +275,14 @@ def knn(schema_madlib, point_source, 
point_column_name, point_id,
 SELECT test.{test_id} AS {test_id_temp},
 train.{point_id} as train_id,
 {fn_dist}(
-train.{point_column_name},
-test.{test_column_name})
+train.{point_col_name_temp},
+test.{test_col_name_temp})
 AS dist
 {label_out}
-FROM {point_source} AS train,
- {test_source} AS test
+FROM
+ {point_source_temp_table} as train,
+ {test_source_temp_table} as test
--- End diff --

Please use subqueries, instead of tables:
```
(select {point_id} , {point_column_name} as {point_col_name_temp} , 
{label_column_name} from {point_source}) train,
(select {test_id}, {test_column_name} as {test_col_name_temp} from 
{test_source}) test
```


---