[jira] [Commented] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame
[ https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419745#comment-16419745 ] V Luong commented on SPARK-2: - [~bago.amirbekian] thank you, that is indeed a good solution available in Spark 2.3. I'm using that successfully. Will close this issue now. > SparkML VectorAssembler.transform slow when needing to invoke .first() on > sorted DataFrame > -- > > Key: SPARK-2 > URL: https://issues.apache.org/jira/browse/SPARK-2 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, SQL >Affects Versions: 2.2.1 >Reporter: V Luong >Priority: Minor > > Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes > oldDF.first() in order to establish some metadata/attributes: > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.] > When oldDF is sorted, the above triggering of oldDF.first() can be very slow. > For the purpose of establishing metadata, taking an arbitrary row from oldDF > will be just as good as taking oldDF.first(). Is there hence a way we can > speed up a great deal by somehow grabbing a random row, instead of relying on > oldDF.first()? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame
[ https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379296#comment-16379296 ] Bago Amirbekian commented on SPARK-2: - [~MBALearnsToCode] you can use a `VectorSizeHint` transformer to include `numAttributes` in the dataframe column metadata and avoid the call to `first`. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala > SparkML VectorAssembler.transform slow when needing to invoke .first() on > sorted DataFrame > -- > > Key: SPARK-2 > URL: https://issues.apache.org/jira/browse/SPARK-2 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, SQL >Affects Versions: 2.2.1 >Reporter: V Luong >Priority: Minor > > Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes > oldDF.first() in order to establish some metadata/attributes: > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.] > When oldDF is sorted, the above triggering of oldDF.first() can be very slow. > For the purpose of establishing metadata, taking an arbitrary row from oldDF > will be just as good as taking oldDF.first(). Is there hence a way we can > speed up a great deal by somehow grabbing a random row, instead of relying on > oldDF.first()? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame
[ https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358634#comment-16358634 ] V Luong commented on SPARK-2: - [~cloud_fan] alternatively, is there any way that VectorAssembler.transform(...) can get the "numAttributes" ([https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88)] metadata from somewhere else instead of materializing a row? Does the current need to materialize a row mean that some metadata is lacking somewhere? > SparkML VectorAssembler.transform slow when needing to invoke .first() on > sorted DataFrame > -- > > Key: SPARK-2 > URL: https://issues.apache.org/jira/browse/SPARK-2 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, SQL >Affects Versions: 2.2.1 >Reporter: V Luong >Priority: Minor > > Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes > oldDF.first() in order to establish some metadata/attributes: > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.] > When oldDF is sorted, the above triggering of oldDF.first() can be very slow. > For the purpose of establishing metadata, taking an arbitrary row from oldDF > will be just as good as taking oldDF.first(). Is there hence a way we can > speed up a great deal by somehow grabbing a random row, instead of relying on > oldDF.first()? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame
[ https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358623#comment-16358623 ] Wenchen Fan commented on SPARK-2: - This is not a trivial change, we need to introduce an `AnyRow` operator that can eliminate unneeded sort(maybe more) operators. If we can get what we want from any row, does it mean we want something like a metadata? > SparkML VectorAssembler.transform slow when needing to invoke .first() on > sorted DataFrame > -- > > Key: SPARK-2 > URL: https://issues.apache.org/jira/browse/SPARK-2 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, SQL >Affects Versions: 2.2.1 >Reporter: V Luong >Priority: Minor > > Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes > oldDF.first() in order to establish some metadata/attributes: > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.] > When oldDF is sorted, the above triggering of oldDF.first() can be very slow. > For the purpose of establishing metadata, taking an arbitrary row from oldDF > will be just as good as taking oldDF.first(). Is there hence a way we can > speed up a great deal by somehow grabbing a random row, instead of relying on > oldDF.first()? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame
[ https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358599#comment-16358599 ] V Luong commented on SPARK-2: - [~cloud_fan] there are many scenarios in which oldDF involves sorting in its plan, e.g. if certain feature columns are calculated using windowed functions. In general, it would be a pain to always make sure that oldDF doesn't involve sorting (e.g. by checkpointing to files) prior to VectorAssembler. Anyway, VectorAssembler metadata shouldn't strictly need the first row. > SparkML VectorAssembler.transform slow when needing to invoke .first() on > sorted DataFrame > -- > > Key: SPARK-2 > URL: https://issues.apache.org/jira/browse/SPARK-2 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, SQL >Affects Versions: 2.2.1 >Reporter: V Luong >Priority: Minor > > Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes > oldDF.first() in order to establish some metadata/attributes: > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.] > When oldDF is sorted, the above triggering of oldDF.first() can be very slow. > For the purpose of establishing metadata, taking an arbitrary row from oldDF > will be just as good as taking oldDF.first(). Is there hence a way we can > speed up a great deal by somehow grabbing a random row, instead of relying on > oldDF.first()? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame
[ https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358067#comment-16358067 ] Wenchen Fan commented on SPARK-2: - I'm a little confused. If we wanna get a random row, why we need to sort? Do we have a way to get the dataframe before the sort and call its `first`? > SparkML VectorAssembler.transform slow when needing to invoke .first() on > sorted DataFrame > -- > > Key: SPARK-2 > URL: https://issues.apache.org/jira/browse/SPARK-2 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, SQL >Affects Versions: 2.2.1 >Reporter: V Luong >Priority: Major > > Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes > oldDF.first() in order to establish some metadata/attributes: > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.] > When oldDF is sorted, the above triggering of oldDF.first() can be very slow. > For the purpose of establishing metadata, taking an arbitrary row from oldDF > will be just as good as taking oldDF.first(). Is there hence a way we can > speed up a great deal by somehow grabbing a random row, instead of relying on > oldDF.first()? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame
[ https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358048#comment-16358048 ] Liang-Chi Hsieh commented on SPARK-2: - Currently I think we don't have API in Dataset to just fetch an any row back. Is it reasonable to add a \{{def any(n: Int): Array[T]}} to Dataset? cc [~cloud_fan] > SparkML VectorAssembler.transform slow when needing to invoke .first() on > sorted DataFrame > -- > > Key: SPARK-2 > URL: https://issues.apache.org/jira/browse/SPARK-2 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, SQL >Affects Versions: 2.2.1 >Reporter: V Luong >Priority: Major > > Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes > oldDF.first() in order to establish some metadata/attributes: > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.] > When oldDF is sorted, the above triggering of oldDF.first() can be very slow. > For the purpose of establishing metadata, taking an arbitrary row from oldDF > will be just as good as taking oldDF.first(). Is there hence a way we can > speed up a great deal by somehow grabbing a random row, instead of relying on > oldDF.first()? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org