zhengruifeng created SPARK-23805:
------------------------------------

             Summary: support vector-size validation and Inference
                 Key: SPARK-23805
                 URL: https://issues.apache.org/jira/browse/SPARK-23805
             Project: Spark
          Issue Type: Improvement
          Components: ML
    Affects Versions: 2.4.0
            Reporter: zhengruifeng


I think it maybe miningful to unify the usage of \{{AttributeGroup}} and 
support vector-size validation and inference in algs.

My thoughts are:
 * In \{{transformSchema}}, validate the input vector-size if possible. If the 
input vector-size can be obtained from schema, check it.
 ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will 
require the vector-size to be no more than 4.
 ** Suppose a \{{PCAModel}} trained with vectors of length 10, the 
\{{transformSchema}} will require the vector-size to be 10.
 * In \{{transformSchema}}, inference the output vector-size if possible.
 ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will return 
a schema with output vector-size=4.
 ** Suppose a \{{PCAModel}} trained with k=4, the \{{transformSchema}} will 
return a schema with output vector-size=4.
 * In \{{transform}}, inference the output vector-size if possible.
 * In \{{fit}}, obtain the input vector-size from schema if possible. This can 
help eliminating redundant \{{first}} jobs.

 

Current PR only modifies \{{PCA}} and \{{MaxAbsScaler}} to illustrate my idea. 
Since the validation and inference is quite alg-speciafic, we may need to 
sperate the task into several small subtasks.

How do you think about this? [~srowen] [~yanboliang] [~WeichenXu123] [~mlnick]

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to