[GitHub] [flink-ml] weibozhao commented on a diff in pull request #156: [FLINK-29323] Refine Transformer for VectorAssembler

2022-11-07 Thread GitBox


weibozhao commented on code in PR #156:
URL: https://github.com/apache/flink-ml/pull/156#discussion_r1016098840


##
flink-ml-lib/src/main/java/org/apache/flink/ml/feature/vectorassembler/VectorAssemblerParams.java:
##
@@ -21,11 +21,44 @@
 import org.apache.flink.ml.common.param.HasHandleInvalid;
 import org.apache.flink.ml.common.param.HasInputCols;
 import org.apache.flink.ml.common.param.HasOutputCol;
+import org.apache.flink.ml.param.IntArrayParam;
+import org.apache.flink.ml.param.Param;
+import org.apache.flink.ml.param.ParamValidator;
 
 /**
  * Params of {@link VectorAssembler}.
  *
  * @param  The class type of this instance.
  */
 public interface VectorAssemblerParams
-extends HasInputCols, HasOutputCol, HasHandleInvalid {}
+extends HasInputCols, HasOutputCol, HasHandleInvalid {
+Param INPUT_SIZES =
+new IntArrayParam(
+"inputSizes",
+"Sizes of the input elements to be assembled.",
+null,
+sizesValidator());
+
+default Integer[] getInputSizes() {
+return get(INPUT_SIZES);
+}
+
+default T setInputSizes(Integer... value) {
+return set(INPUT_SIZES, value);
+}
+
+// Checks the inputSizes parameter.
+static ParamValidator sizesValidator() {
+return inputSizes -> {
+if (inputSizes == null) {
+return false;
+}
+for (Integer ele : inputSizes) {
+if (ele < 0) {

Review Comment:
   If size of element equals to 0, return false as `return inputSizes.length != 
0;` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-ml] weibozhao commented on a diff in pull request #156: [FLINK-29323] Refine Transformer for VectorAssembler

2022-10-31 Thread GitBox


weibozhao commented on code in PR #156:
URL: https://github.com/apache/flink-ml/pull/156#discussion_r1010020986


##
flink-ml-lib/src/test/java/org/apache/flink/ml/feature/VectorAssemblerTest.java:
##
@@ -45,13 +45,16 @@
 

Review Comment:
   I have refine code as spark's functions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-ml] weibozhao commented on a diff in pull request #156: [FLINK-29323] Refine Transformer for VectorAssembler

2022-10-31 Thread GitBox


weibozhao commented on code in PR #156:
URL: https://github.com/apache/flink-ml/pull/156#discussion_r1009193152


##
flink-ml-lib/src/main/java/org/apache/flink/ml/feature/vectorassembler/VectorAssemblerParams.java:
##
@@ -21,11 +21,44 @@
 import org.apache.flink.ml.common.param.HasHandleInvalid;
 import org.apache.flink.ml.common.param.HasInputCols;
 import org.apache.flink.ml.common.param.HasOutputCol;
+import org.apache.flink.ml.param.IntArrayParam;
+import org.apache.flink.ml.param.Param;
+import org.apache.flink.ml.param.ParamValidator;
 
 /**
  * Params of {@link VectorAssembler}.
  *
  * @param  The class type of this instance.
  */
 public interface VectorAssemblerParams
-extends HasInputCols, HasOutputCol, HasHandleInvalid {}
+extends HasInputCols, HasOutputCol, HasHandleInvalid {
+Param INPUT_SIZES =
+new IntArrayParam(
+"inputSizes",
+"Sizes of the input elements to be assembled.",
+null,
+sizesValidator());
+
+default Integer[] getInputSizes() {
+return get(INPUT_SIZES);
+}
+
+default T setInputSizes(Integer... value) {
+return set(INPUT_SIZES, value);
+}
+
+// Checks the inputSizes parameter.
+static ParamValidator sizesValidator() {
+return inputSizes -> {
+if (inputSizes == null) {
+return false;

Review Comment:
   We have discuss this case already. If record-1 and record-2 has different 
sizes, then we can't decide which one is used. For spark stores the sizes in 
meta info, but flink-ml has no meta info.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-ml] weibozhao commented on a diff in pull request #156: [FLINK-29323] Refine Transformer for VectorAssembler

2022-10-11 Thread GitBox


weibozhao commented on code in PR #156:
URL: https://github.com/apache/flink-ml/pull/156#discussion_r992921852


##
flink-ml-lib/src/main/java/org/apache/flink/ml/feature/vectorassembler/VectorAssembler.java:
##
@@ -47,10 +47,15 @@
 
 /**
  * A Transformer which combines a given list of input columns into a vector 
column. Types of input
- * columns must be either vector or numerical value.
+ * columns must be either vector or numerical types. The elements assembled in 
the same column must
+ * have the same size. The operator deals with null values or records with 
wrong sizes according to
+ * the strategy specified by the {@link HasHandleInvalid} parameter as follows:
  *
- * The `keep` option of {@link HasHandleInvalid} means that we output bad 
rows with output column
- * set to null.
+ * The `keep` option means that we do the assembling action without 
checking the vector size.

Review Comment:
   OK, I will fix this ambiguity.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-ml] weibozhao commented on a diff in pull request #156: [FLINK-29323] Refine Transformer for VectorAssembler

2022-10-11 Thread GitBox


weibozhao commented on code in PR #156:
URL: https://github.com/apache/flink-ml/pull/156#discussion_r992921852


##
flink-ml-lib/src/main/java/org/apache/flink/ml/feature/vectorassembler/VectorAssembler.java:
##
@@ -47,10 +47,15 @@
 
 /**
  * A Transformer which combines a given list of input columns into a vector 
column. Types of input
- * columns must be either vector or numerical value.
+ * columns must be either vector or numerical types. The elements assembled in 
the same column must
+ * have the same size. The operator deals with null values or records with 
wrong sizes according to
+ * the strategy specified by the {@link HasHandleInvalid} parameter as follows:
  *
- * The `keep` option of {@link HasHandleInvalid} means that we output bad 
rows with output column
- * set to null.
+ * The `keep` option means that we do the assembling action without 
checking the vector size.

Review Comment:
   assembling [1,2], [3,4,5] with sizes 2,2 may get [1, 2, 3, 4] (trim to fit 
size)
   assembling [1,2], [3,4,5] with sizes 2,2 may get [1, 2, NaN, NaN]
   assembling [1,2], null with sizes 2,2 may get [1, 2, NaN, NaN]
   assembling [1,2], null with sizes 2,2 may get [1, 2, 0, 0]
   
   These situations will not occur. These vectors do not exist, at least not in 
the flink-ml code.
   
   assembling [1,2], [3,4] with sizes 2,3 may get [1, 2, 3, 4]
   assembling [1,2], [3,4] with sizes 2,3 may get [1, 2, 3, 4, 0] (padding with 
zeros)
   
   The assembling of two vectors is splicing, the result size is the sum of the 
vector sizes. This is common sense. It should be [1, 2, 3, 4, 0]. There is no 
ambiguity.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-ml] weibozhao commented on a diff in pull request #156: [FLINK-29323] Refine Transformer for VectorAssembler

2022-10-11 Thread GitBox


weibozhao commented on code in PR #156:
URL: https://github.com/apache/flink-ml/pull/156#discussion_r992921852


##
flink-ml-lib/src/main/java/org/apache/flink/ml/feature/vectorassembler/VectorAssembler.java:
##
@@ -47,10 +47,15 @@
 
 /**
  * A Transformer which combines a given list of input columns into a vector 
column. Types of input
- * columns must be either vector or numerical value.
+ * columns must be either vector or numerical types. The elements assembled in 
the same column must
+ * have the same size. The operator deals with null values or records with 
wrong sizes according to
+ * the strategy specified by the {@link HasHandleInvalid} parameter as follows:
  *
- * The `keep` option of {@link HasHandleInvalid} means that we output bad 
rows with output column
- * set to null.
+ * The `keep` option means that we do the assembling action without 
checking the vector size.

Review Comment:
   assembling [1,2], [3,4,5] with sizes 2,2 may get [1, 2, 3, 4] (trim to fit 
size)
   assembling [1,2], [3,4,5] with sizes 2,2 may get [1, 2, NaN, NaN]
   assembling [1,2], null with sizes 2,2 may get [1, 2, NaN, NaN]
   assembling [1,2], null with sizes 2,2 may get [1, 2, 0, 0]



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-ml] weibozhao commented on a diff in pull request #156: [FLINK-29323] Refine Transformer for VectorAssembler

2022-10-10 Thread GitBox


weibozhao commented on code in PR #156:
URL: https://github.com/apache/flink-ml/pull/156#discussion_r991765238


##
flink-ml-lib/src/main/java/org/apache/flink/ml/feature/vectorassembler/VectorAssembler.java:
##
@@ -47,10 +47,15 @@
 
 /**
  * A Transformer which combines a given list of input columns into a vector 
column. Types of input
- * columns must be either vector or numerical value.
+ * columns must be either vector or numerical types. The elements assembled in 
the same column must
+ * have the same size. If the element is null or has the wrong size, we will 
process this case with
+ * {@link HasHandleInvalid} parameter as follows:
  *
- * The `keep` option of {@link HasHandleInvalid} means that we output bad 
rows with output column
- * set to null.
+ * The `keep` option means that we do not check the vector size, and keep 
all rows.

Review Comment:
   The output value will have a different vector size with the normal ones. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-ml] weibozhao commented on a diff in pull request #156: [FLINK-29323] Refine Transformer for VectorAssembler

2022-10-09 Thread GitBox


weibozhao commented on code in PR #156:
URL: https://github.com/apache/flink-ml/pull/156#discussion_r990880285


##
docs/content/docs/operators/feature/vectorassembler.md:
##
@@ -44,11 +46,12 @@ Types of input columns must be either vector or numerical 
value.
 
 ### Parameters
 
-| Key   | Default| Type | Required | Description   
 |
-|---||--|--||
-| inputCols | `null` | String[] | yes  | Input column names.   
 |
-| outputCol | `"output"` | String   | no   | Output column name.   
 |
-| handleInvalid | `"error"`  | String   | no   | Strategy to handle 
invalid entries. Supported values: 'error', 'skip', 'keep'. |
+| Key | Default| Type  | Required | Description
|
+|-||---|--||
+| inputCols   | `null` | String[]  | yes  | Input column names.
|
+| outputCol   | `"output"` | String| no   | Output column name.
|
+| inputSizes  | `null` | Integer[] | yes  | Sizes of the 
assembling elements.  |

Review Comment:
   I think the integer is much better.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-ml] weibozhao commented on a diff in pull request #156: [FLINK-29323] Refine Transformer for VectorAssembler

2022-10-08 Thread GitBox


weibozhao commented on code in PR #156:
URL: https://github.com/apache/flink-ml/pull/156#discussion_r990730701


##
docs/content/docs/operators/feature/vectorassembler.md:
##
@@ -27,8 +27,10 @@ under the License.
 

Review Comment:
   After discussing offline, we need not add this parameter to these two 
algorithms.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-ml] weibozhao commented on a diff in pull request #156: [FLINK-29323] Refine Transformer for VectorAssembler

2022-10-08 Thread GitBox


weibozhao commented on code in PR #156:
URL: https://github.com/apache/flink-ml/pull/156#discussion_r990730701


##
docs/content/docs/operators/feature/vectorassembler.md:
##
@@ -27,8 +27,10 @@ under the License.
 

Review Comment:
   Discuss offline, we need not add this parameter to these two algorithms.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-ml] weibozhao commented on a diff in pull request #156: [FLINK-29323] Refine Transformer for VectorAssembler

2022-10-08 Thread GitBox


weibozhao commented on code in PR #156:
URL: https://github.com/apache/flink-ml/pull/156#discussion_r990730701


##
docs/content/docs/operators/feature/vectorassembler.md:
##
@@ -27,8 +27,10 @@ under the License.
 

Review Comment:
   Discuss with some people, we need not add this parameter to these two 
algorithms.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-ml] weibozhao commented on a diff in pull request #156: [FLINK-29323] Refine Transformer for VectorAssembler

2022-10-08 Thread GitBox


weibozhao commented on code in PR #156:
URL: https://github.com/apache/flink-ml/pull/156#discussion_r990719982


##
flink-ml-lib/src/main/java/org/apache/flink/ml/feature/vectorassembler/VectorAssembler.java:
##
@@ -47,7 +47,9 @@
 
 /**
  * A Transformer which combines a given list of input columns into a vector 
column. Types of input
- * columns must be either vector or numerical value.
+ * columns must be either vector or numerical types. The elements assembled in 
the same column must
+ * have the same size. If the size of the element is not equal to 
sizes[columnIdx], it will throw an
+ * IllegalArgumentException.

Review Comment:
   If the job is a streaming job, I think passing the null record is more 
proper.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-ml] weibozhao commented on a diff in pull request #156: [FLINK-29323] Refine Transformer for VectorAssembler

2022-10-08 Thread GitBox


weibozhao commented on code in PR #156:
URL: https://github.com/apache/flink-ml/pull/156#discussion_r990719982


##
flink-ml-lib/src/main/java/org/apache/flink/ml/feature/vectorassembler/VectorAssembler.java:
##
@@ -47,7 +47,9 @@
 
 /**
  * A Transformer which combines a given list of input columns into a vector 
column. Types of input
- * columns must be either vector or numerical value.
+ * columns must be either vector or numerical types. The elements assembled in 
the same column must
+ * have the same size. If the size of the element is not equal to 
sizes[columnIdx], it will throw an
+ * IllegalArgumentException.

Review Comment:
   If the job is a streaming job, I think passing this record is more proper.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-ml] weibozhao commented on a diff in pull request #156: [FLINK-29323] Refine Transformer for VectorAssembler

2022-10-08 Thread GitBox


weibozhao commented on code in PR #156:
URL: https://github.com/apache/flink-ml/pull/156#discussion_r990719844


##
flink-ml-lib/src/main/java/org/apache/flink/ml/feature/vectorassembler/VectorAssemblerParams.java:
##
@@ -21,11 +21,31 @@
 import org.apache.flink.ml.common.param.HasHandleInvalid;
 import org.apache.flink.ml.common.param.HasInputCols;
 import org.apache.flink.ml.common.param.HasOutputCol;
+import org.apache.flink.ml.param.IntArrayParam;
+import org.apache.flink.ml.param.Param;
+import org.apache.flink.ml.param.ParamValidators;
+
+import java.util.Arrays;
 
 /**
  * Params of {@link VectorAssembler}.
  *
  * @param  The class type of this instance.
  */
 public interface VectorAssemblerParams
-extends HasInputCols, HasOutputCol, HasHandleInvalid {}
+extends HasInputCols, HasOutputCol, HasHandleInvalid {
+Param SIZES =
+new IntArrayParam(
+"sizeArray",
+"Sizes of the assembling elements.",
+null,
+ParamValidators.notNull());
+
+default int[] getSizes() {
+return Arrays.stream(get(SIZES)).mapToInt(Integer::intValue).toArray();
+}
+
+default T setSizes(Integer... value) {

Review Comment:
   this `nit` is not suited.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-ml] weibozhao commented on a diff in pull request #156: [FLINK-29323] Refine Transformer for VectorAssembler

2022-09-23 Thread GitBox


weibozhao commented on code in PR #156:
URL: https://github.com/apache/flink-ml/pull/156#discussion_r978481060


##
flink-ml-lib/src/main/java/org/apache/flink/ml/feature/vectorassembler/VectorAssembler.java:
##
@@ -74,38 +74,65 @@ public Table[] transform(Table... inputs) {
 DataStream output =
 tEnv.toDataStream(inputs[0])
 .flatMap(
-new AssemblerFunc(getInputCols(), 
getHandleInvalid()),
+new AssemblerFunction(
+getInputCols(), getHandleInvalid(), 
getSizes()),
 outputTypeInfo);
 Table outputTable = tEnv.fromDataStream(output);
 return new Table[] {outputTable};
 }
 
-private static class AssemblerFunc implements FlatMapFunction {
+private static class AssemblerFunction implements FlatMapFunction {
 private final String[] inputCols;
 private final String handleInvalid;
+private final int[] sizeArray;
 
-public AssemblerFunc(String[] inputCols, String handleInvalid) {
+public AssemblerFunction(String[] inputCols, String handleInvalid, 
int[] sizeArray) {
 this.inputCols = inputCols;
 this.handleInvalid = handleInvalid;
+this.sizeArray = sizeArray;
 }
 
 @Override
 public void flatMap(Row value, Collector out) {
 int nnz = 0;
 int vectorSize = 0;
 try {
-for (String inputCol : inputCols) {
+for (int i = 0; i < inputCols.length; ++i) {
+String inputCol = inputCols[i];
 Object object = value.getField(inputCol);
 Preconditions.checkNotNull(object, "Input column value 
should not be null.");
 if (object instanceof Number) {
+Preconditions.checkArgument(

Review Comment:
   I don't know which record has the error size, then I must check the sizes 
for every record. 
   When the sizes are error, the code also can run OK, that's why we need to 
check every record.   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-ml] weibozhao commented on a diff in pull request #156: [FLINK-29323] Refine Transformer for VectorAssembler

2022-09-23 Thread GitBox


weibozhao commented on code in PR #156:
URL: https://github.com/apache/flink-ml/pull/156#discussion_r978475159


##
flink-ml-lib/src/main/java/org/apache/flink/ml/feature/vectorassembler/VectorAssemblerParams.java:
##
@@ -21,11 +21,31 @@
 import org.apache.flink.ml.common.param.HasHandleInvalid;
 import org.apache.flink.ml.common.param.HasInputCols;
 import org.apache.flink.ml.common.param.HasOutputCol;
+import org.apache.flink.ml.param.IntArrayParam;
+import org.apache.flink.ml.param.Param;
+import org.apache.flink.ml.param.ParamValidators;
+
+import java.util.Arrays;
 
 /**
  * Params of {@link VectorAssembler}.
  *
  * @param  The class type of this instance.
  */
 public interface VectorAssemblerParams
-extends HasInputCols, HasOutputCol, HasHandleInvalid {}
+extends HasInputCols, HasOutputCol, HasHandleInvalid {
+Param SIZES =

Review Comment:
   Splits is an array, splitsArray is a `double[][]`.  VectorSizeHint has no 
parameter like splitsArray.  
   I think Sizes is much better than `vectorSizeArray` or `elementSizeArray`. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink-ml] weibozhao commented on a diff in pull request #156: [FLINK-29323] Refine Transformer for VectorAssembler

2022-09-23 Thread GitBox


weibozhao commented on code in PR #156:
URL: https://github.com/apache/flink-ml/pull/156#discussion_r978475159


##
flink-ml-lib/src/main/java/org/apache/flink/ml/feature/vectorassembler/VectorAssemblerParams.java:
##
@@ -21,11 +21,31 @@
 import org.apache.flink.ml.common.param.HasHandleInvalid;
 import org.apache.flink.ml.common.param.HasInputCols;
 import org.apache.flink.ml.common.param.HasOutputCol;
+import org.apache.flink.ml.param.IntArrayParam;
+import org.apache.flink.ml.param.Param;
+import org.apache.flink.ml.param.ParamValidators;
+
+import java.util.Arrays;
 
 /**
  * Params of {@link VectorAssembler}.
  *
  * @param  The class type of this instance.
  */
 public interface VectorAssemblerParams
-extends HasInputCols, HasOutputCol, HasHandleInvalid {}
+extends HasInputCols, HasOutputCol, HasHandleInvalid {
+Param SIZES =

Review Comment:
   Splits is an array, splitsArray is a `double[][]`.  VectorSizeHint has no 
parameter like splitsArray.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org