[jira] [Commented] (SPARK-25901) Barrier mode spawns a bunch of threads that get collected on gc
[ https://issues.apache.org/jira/browse/SPARK-25901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672202#comment-16672202 ] yogesh garg commented on SPARK-25901: - [~jiangxb1987] m thanks for approving the PR, can we assign this issue to me and merge the PR? > Barrier mode spawns a bunch of threads that get collected on gc > --- > > Key: SPARK-25901 > URL: https://issues.apache.org/jira/browse/SPARK-25901 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: yogesh garg >Priority: Major > Attachments: Screen Shot 2018-10-31 at 11.57.25 AM.png, Screen Shot > 2018-10-31 at 11.57.42 AM.png > > > After a barrier job is terminated (successfully or interrupted), the > accompanying thread created with `Timer` in `BarrierTaskContext` shows in a > waiting state until gc is called. We should probably have just one thread to > schedule all such tasks, since they just log every 60 seconds. > Here's a screen shot of the threads growing with more tasks: > !Screen Shot 2018-10-31 at 11.57.25 AM.png! > Here's a screen shot of constant number of threads with more tasks: > !Screen Shot 2018-10-31 at 11.57.42 AM.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25901) Barrier mode spawns a bunch of threads that get collected on gc
[ https://issues.apache.org/jira/browse/SPARK-25901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yogesh garg updated SPARK-25901: Description: After a barrier job is terminated (successfully or interrupted), the accompanying thread created with `Timer` in `BarrierTaskContext` shows in a waiting state until gc is called. We should probably have just one thread to schedule all such tasks, since they just log every 60 seconds. Here's a screen shot of the threads growing with more tasks: !Screen Shot 2018-10-31 at 11.57.25 AM.png! Here's a screen shot of constant number of threads with more tasks: !Screen Shot 2018-10-31 at 11.57.42 AM.png! was: After a barrier job is terminated (successfully or interrupted), the accompanying thread created with `Timer` in `BarrierTaskContext` shows in a waiting state until gc is called. We should probably have just one thread to schedule all such tasks, since they just log every 60 seconds. Here's a screen shot of the threads growing with more tasks: Here's a screen shot of constant number of threads with more tasks: > Barrier mode spawns a bunch of threads that get collected on gc > --- > > Key: SPARK-25901 > URL: https://issues.apache.org/jira/browse/SPARK-25901 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: yogesh garg >Priority: Major > Attachments: Screen Shot 2018-10-31 at 11.57.25 AM.png, Screen Shot > 2018-10-31 at 11.57.42 AM.png > > > After a barrier job is terminated (successfully or interrupted), the > accompanying thread created with `Timer` in `BarrierTaskContext` shows in a > waiting state until gc is called. We should probably have just one thread to > schedule all such tasks, since they just log every 60 seconds. > Here's a screen shot of the threads growing with more tasks: > !Screen Shot 2018-10-31 at 11.57.25 AM.png! > Here's a screen shot of constant number of threads with more tasks: > !Screen Shot 2018-10-31 at 11.57.42 AM.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25901) Barrier mode spawns a bunch of threads that get collected on gc
[ https://issues.apache.org/jira/browse/SPARK-25901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16670573#comment-16670573 ] yogesh garg edited comment on SPARK-25901 at 10/31/18 7:06 PM: --- I am working on this task in this PR: https://github.com/apache/spark/pull/22912 was (Author: yogeshgarg): I am working on this task. > Barrier mode spawns a bunch of threads that get collected on gc > --- > > Key: SPARK-25901 > URL: https://issues.apache.org/jira/browse/SPARK-25901 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: yogesh garg >Priority: Major > Attachments: Screen Shot 2018-10-31 at 11.57.25 AM.png, Screen Shot > 2018-10-31 at 11.57.42 AM.png > > > After a barrier job is terminated (successfully or interrupted), the > accompanying thread created with `Timer` in `BarrierTaskContext` shows in a > waiting state until gc is called. We should probably have just one thread to > schedule all such tasks, since they just log every 60 seconds. > Here's a screen shot of the threads growing with more tasks: > Here's a screen shot of constant number of threads with more tasks: -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25901) Barrier mode spawns a bunch of threads that get collected on gc
[ https://issues.apache.org/jira/browse/SPARK-25901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yogesh garg updated SPARK-25901: Attachment: Screen Shot 2018-10-31 at 11.57.25 AM.png Screen Shot 2018-10-31 at 11.57.42 AM.png > Barrier mode spawns a bunch of threads that get collected on gc > --- > > Key: SPARK-25901 > URL: https://issues.apache.org/jira/browse/SPARK-25901 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: yogesh garg >Priority: Major > Attachments: Screen Shot 2018-10-31 at 11.57.25 AM.png, Screen Shot > 2018-10-31 at 11.57.42 AM.png > > > After a barrier job is terminated (successfully or interrupted), the > accompanying thread created with `Timer` in `BarrierTaskContext` shows in a > waiting state until gc is called. We should probably have just one thread to > schedule all such tasks, since they just log every 60 seconds. > Here's a screen shot of the threads growing with more tasks: > Here's a screen shot of constant number of threads with more tasks: -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25901) Barrier mode spawns a bunch of threads that get collected on gc
yogesh garg created SPARK-25901: --- Summary: Barrier mode spawns a bunch of threads that get collected on gc Key: SPARK-25901 URL: https://issues.apache.org/jira/browse/SPARK-25901 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Reporter: yogesh garg Attachments: Screen Shot 2018-10-31 at 11.57.25 AM.png, Screen Shot 2018-10-31 at 11.57.42 AM.png After a barrier job is terminated (successfully or interrupted), the accompanying thread created with `Timer` in `BarrierTaskContext` shows in a waiting state until gc is called. We should probably have just one thread to schedule all such tasks, since they just log every 60 seconds. Here's a screen shot of the threads growing with more tasks: Here's a screen shot of constant number of threads with more tasks: -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25901) Barrier mode spawns a bunch of threads that get collected on gc
[ https://issues.apache.org/jira/browse/SPARK-25901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16670573#comment-16670573 ] yogesh garg commented on SPARK-25901: - I am working on this task. > Barrier mode spawns a bunch of threads that get collected on gc > --- > > Key: SPARK-25901 > URL: https://issues.apache.org/jira/browse/SPARK-25901 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: yogesh garg >Priority: Major > Attachments: Screen Shot 2018-10-31 at 11.57.25 AM.png, Screen Shot > 2018-10-31 at 11.57.42 AM.png > > > After a barrier job is terminated (successfully or interrupted), the > accompanying thread created with `Timer` in `BarrierTaskContext` shows in a > waiting state until gc is called. We should probably have just one thread to > schedule all such tasks, since they just log every 60 seconds. > Here's a screen shot of the threads growing with more tasks: > Here's a screen shot of constant number of threads with more tasks: -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24115) improve instrumentation for spark.ml.tuning
[ https://issues.apache.org/jira/browse/SPARK-24115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16457113#comment-16457113 ] yogesh garg commented on SPARK-24115: - I would like to work on this. > improve instrumentation for spark.ml.tuning > --- > > Key: SPARK-24115 > URL: https://issues.apache.org/jira/browse/SPARK-24115 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: yogesh garg >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24115) improve instrumentation for spark.ml.tuning
yogesh garg created SPARK-24115: --- Summary: improve instrumentation for spark.ml.tuning Key: SPARK-24115 URL: https://issues.apache.org/jira/browse/SPARK-24115 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 2.3.0 Reporter: yogesh garg -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24114) improve instrumentation for spark.ml.recommendation
yogesh garg created SPARK-24114: --- Summary: improve instrumentation for spark.ml.recommendation Key: SPARK-24114 URL: https://issues.apache.org/jira/browse/SPARK-24114 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 2.3.0 Reporter: yogesh garg -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24114) improve instrumentation for spark.ml.recommendation
[ https://issues.apache.org/jira/browse/SPARK-24114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16457110#comment-16457110 ] yogesh garg commented on SPARK-24114: - I would like to work on this. > improve instrumentation for spark.ml.recommendation > --- > > Key: SPARK-24114 > URL: https://issues.apache.org/jira/browse/SPARK-24114 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: yogesh garg >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23871) add python api for VectorAssembler handleInvalid
[ https://issues.apache.org/jira/browse/SPARK-23871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16427543#comment-16427543 ] yogesh garg commented on SPARK-23871: - I hadn't started working on this yet. Feel free to take it. > add python api for VectorAssembler handleInvalid > > > Key: SPARK-23871 > URL: https://issues.apache.org/jira/browse/SPARK-23871 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: yogesh garg >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23871) add python api for VectorAssembler handleInvalid
yogesh garg created SPARK-23871: --- Summary: add python api for VectorAssembler handleInvalid Key: SPARK-23871 URL: https://issues.apache.org/jira/browse/SPARK-23871 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 2.3.0 Reporter: yogesh garg -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23870) Forward RFormula handleInvalid Param to VectorAssembler
yogesh garg created SPARK-23870: --- Summary: Forward RFormula handleInvalid Param to VectorAssembler Key: SPARK-23870 URL: https://issues.apache.org/jira/browse/SPARK-23870 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 2.3.0 Reporter: yogesh garg Fix For: 2.4.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23690) VectorAssembler should have handleInvalid to handle columns with null values
[ https://issues.apache.org/jira/browse/SPARK-23690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405174#comment-16405174 ] yogesh garg edited comment on SPARK-23690 at 3/19/18 7:04 PM: -- In an offline discussion with [~mrbago], we discussed the following behavior for `handleInvalid`. We have to get the lengths of the vector columns that are involved in the assembly, ideally this information is present in the `attributeGroup` of the column, but that might return `size == -1`, in which case we earlier used the `d.select.first` to infer the size of these columns. This could raise an exception in the corner case that the first row itself has null values. We are abandoning the idea that we can get this information by finding a non-null row in each of such columns because this approach has complicated logic, terrible run time (O(#columns) distributed queries) and fewer guarantees for any such data we might see in the future (even if we infer the size right now, there's no guarantee we can do it later, leading to an un-expected error). 1. *Error*: Find the remaining lengths from `d.select.first` * if we get NullPointerException while iterating on the cells for sizes, throw an (early) error * if we get NoSuchElementError while looking for the first row, -give the rows 0 sizes and warn- throw error about incomplete metadata 2. *Skip*: Find remaining lengths from `d.drop.first` * if we get NoSuchElementError, -warn- throw error about incomplete metadata * Note that we can't get NullPointerException in this case (yay!) 3. *Keep*: If any column does not have attribute sizes, it's dangerous to infer sizes from the data because even if we get the information form the current dataset, a future cut of the data is not guaranteed to be infer-able. Thus, throw an error encouraging `VectorSizeHint` Please share thoughts and feedback on this! edit: In an offline talk with [~josephkb] we decided to throw errors instead of warning about any size inference failures. was (Author: yogeshgarg): In an offline discussion with [~mrbago], we discussed the following behavior for `handleInvalid`. We have to get the lengths of the vector columns that are involved in the assembly, ideally this information is present in the `attributeGroup` of the column, but that might return `size == -1`, in which case we earlier used the `d.select.first` to infer the size of these columns. This could raise an exception in the corner case that the first row itself has null values. We are abandoning the idea that we can get this information by finding a non-null row in each of such columns because this approach has complicated logic, terrible run time (O(#columns) distributed queries) and fewer guarantees for any such data we might see in the future (even if we infer the size right now, there's no guarantee we can do it later, leading to an un-expected error). 1. *Error*: Find the remaining lengths from `d.select.first` * if we get NullPointerException while iterating on the cells for sizes, throw an (early) error * if we get NoSuchElementError while looking for the first row, -give the rows 0 sizes and warn- throw error about incomplete metadata 2. *Skip*: Find remaining lengths from `d.drop.first` * if we get NoSuchElementError, -warn- throw error about incomplete metadata * Note that we can't get NullPointerException in this case (yay!) 3. *Keep*: If any column does not have attribute sizes, it's dangerous to infer sizes from the data because even if we get the information form the current dataset, a future cut of the data is not guaranteed to be infer-able. Thus, throw an error encouraging `VectorSizeHint` Please share thoughts and feedback on this! edit: In an offline talk with @jkbradley we decided to throw errors instead of warning about any size inference failures. > VectorAssembler should have handleInvalid to handle columns with null values > > > Key: SPARK-23690 > URL: https://issues.apache.org/jira/browse/SPARK-23690 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: yogesh garg >Priority: Major > > VectorAssembler only takes in numeric (and vectors (of numeric?)) columns as > an input and returns the assembled vector. It currently throws an error if it > sees a null value in any column. This behavior also affects `RFormula` that > uses VectorAssembler to assemble numeric columns. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23690) VectorAssembler should have handleInvalid to handle columns with null values
[ https://issues.apache.org/jira/browse/SPARK-23690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405174#comment-16405174 ] yogesh garg edited comment on SPARK-23690 at 3/19/18 7:03 PM: -- In an offline discussion with [~mrbago], we discussed the following behavior for `handleInvalid`. We have to get the lengths of the vector columns that are involved in the assembly, ideally this information is present in the `attributeGroup` of the column, but that might return `size == -1`, in which case we earlier used the `d.select.first` to infer the size of these columns. This could raise an exception in the corner case that the first row itself has null values. We are abandoning the idea that we can get this information by finding a non-null row in each of such columns because this approach has complicated logic, terrible run time (O(#columns) distributed queries) and fewer guarantees for any such data we might see in the future (even if we infer the size right now, there's no guarantee we can do it later, leading to an un-expected error). 1. *Error*: Find the remaining lengths from `d.select.first` * if we get NullPointerException while iterating on the cells for sizes, throw an (early) error * if we get NoSuchElementError while looking for the first row, -give the rows 0 sizes and warn- throw error about incomplete metadata 2. *Skip*: Find remaining lengths from `d.drop.first` * if we get NoSuchElementError, -warn- throw error about incomplete metadata * Note that we can't get NullPointerException in this case (yay!) 3. *Keep*: If any column does not have attribute sizes, it's dangerous to infer sizes from the data because even if we get the information form the current dataset, a future cut of the data is not guaranteed to be infer-able. Thus, throw an error encouraging `VectorSizeHint` Please share thoughts and feedback on this! edit: In an offline talk with @jkbradley we decided to throw errors instead of warning about any size inference failures. was (Author: yogeshgarg): In an offline discussion with [~mrbago], we discussed the following behavior for `handleInvalid`. We have to get the lengths of the vector columns that are involved in the assembly, ideally this information is present in the `attributeGroup` of the column, but that might return `size == -1`, in which case we earlier used the `d.select.first` to infer the size of these columns. This could raise an exception in the corner case that the first row itself has null values. We are abandoning the idea that we can get this information by finding a non-null row in each of such columns because this approach has complicated logic, terrible run time (O(#columns) distributed queries) and fewer guarantees for any such data we might see in the future (even if we infer the size right now, there's no guarantee we can do it later, leading to an un-expected error). 1. *Error*: Find the remaining lengths from `d.select.first` * if we get NullPointerException while iterating on the cells for sizes, throw an (early) error * if we get NoSuchElementError while looking for the first row, give the rows 0 sizes and warn about incomplete metadata 2. *Skip*: Find remaining lengths from `d.drop.first` * if we get NoSuchElementError, warn about incomplete metadata * Note that we can't get NullPointerException in this case (yay!) 3. *Keep*: If any column does not have attribute sizes, it's dangerous to infer sizes from the data because even if we get the information form the current dataset, a future cut of the data is not guaranteed to be infer-able. Thus, throw an error encouraging `VectorSizeHint` Please share thoughts and feedback on this! > VectorAssembler should have handleInvalid to handle columns with null values > > > Key: SPARK-23690 > URL: https://issues.apache.org/jira/browse/SPARK-23690 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: yogesh garg >Priority: Major > > VectorAssembler only takes in numeric (and vectors (of numeric?)) columns as > an input and returns the assembled vector. It currently throws an error if it > sees a null value in any column. This behavior also affects `RFormula` that > uses VectorAssembler to assemble numeric columns. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23690) VectorAssembler should have handleInvalid to handle columns with null values
[ https://issues.apache.org/jira/browse/SPARK-23690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405174#comment-16405174 ] yogesh garg commented on SPARK-23690: - In an offline discussion with [~mrbago], we discussed the following behavior for `handleInvalid`. We have to get the lengths of the vector columns that are involved in the assembly, ideally this information is present in the `attributeGroup` of the column, but that might return `size == -1`, in which case we earlier used the `d.select.first` to infer the size of these columns. This could raise an exception in the corner case that the first row itself has null values. We are abandoning the idea that we can get this information by finding a non-null row in each of such columns because this approach has complicated logic, terrible run time (O(#columns) distributed queries) and fewer guarantees for any such data we might see in the future (even if we infer the size right now, there's no guarantee we can do it later, leading to an un-expected error). 1. *Error*: Find the remaining lengths from `d.select.first` * if we get NullPointerException while iterating on the cells for sizes, throw an (early) error * if we get NoSuchElementError while looking for the first row, give the rows 0 sizes and warn about incomplete metadata 2. *Skip*: Find remaining lengths from `d.drop.first` * if we get NoSuchElementError, warn about incomplete metadata * Note that we can't get NullPointerException in this case (yay!) 3. *Keep*: If any column does not have attribute sizes, it's dangerous to infer sizes from the data because even if we get the information form the current dataset, a future cut of the data is not guaranteed to be infer-able. Thus, throw an error encouraging `VectorSizeHint` Please share thoughts and feedback on this! > VectorAssembler should have handleInvalid to handle columns with null values > > > Key: SPARK-23690 > URL: https://issues.apache.org/jira/browse/SPARK-23690 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: yogesh garg >Priority: Major > > VectorAssembler only takes in numeric (and vectors (of numeric?)) columns as > an input and returns the assembled vector. It currently throws an error if it > sees a null value in any column. This behavior also affects `RFormula` that > uses VectorAssembler to assemble numeric columns. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23690) VectorAssembler should have handleInvalid to handle columns with null values
yogesh garg created SPARK-23690: --- Summary: VectorAssembler should have handleInvalid to handle columns with null values Key: SPARK-23690 URL: https://issues.apache.org/jira/browse/SPARK-23690 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 2.3.0 Reporter: yogesh garg VectorAssembler only takes in numeric (and vectors (of numeric?)) columns as an input and returns the assembled vector. It currently throws an error if it sees a null value in any column. This behavior also affects `RFormula` that uses VectorAssembler to assemble numeric columns. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23562) RFormula handleInvalid should handle invalid values in non-string columns.
[ https://issues.apache.org/jira/browse/SPARK-23562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390434#comment-16390434 ] yogesh garg edited comment on SPARK-23562 at 3/7/18 11:33 PM: -- Error in question can be reproduced with the following code in scala {code:scala} val d1 = spark.createDataFrame(Seq( (1001, "a"), (1002, "b") )).toDF("id1", "c1") val seq: Seq[(java.lang.Long, String)] = (Seq( (20001, "x"), (20002, "y"), (null, null) )) val d2 = seq.toDF("id2", "c2") val dataset = d1.crossJoin(d2) d1.show() d2.show() dataset.show() def test(mode: String) = { val formula = new RFormula() .setFormula("c1 ~ id2") .setHandleInvalid(mode) val model = formula.fit(dataset) val output = model.transform(dataset) println(model) println(mode) output.select("features", "label").show(truncate=false) } List("skip", "keep", "error").foreach {test}{code} {code:java} org.apache.spark.SparkException: Job aborted due to stage failure: Task ** in stage ** failed ** times, most recent failure: Lost task ** in stage ** (TID **, **, executor **): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$3: (struct) => vector) Caused by: org.apache.spark.SparkException: Values to assemble cannot be null. {code} was (Author: yogeshgarg): Error in question can be reproduced with the following code in scala {code:scala} val d1 = spark.createDataFrame(Seq( (1001, "a"), (1002, "b") )).toDF("id1", "c1") val seq: Seq[(java.lang.Long, String)] = (Seq( (20001, "x"), (20002, "y"), (null, null) )) val d2 = seq.toDF("id2", "c2") val dataset = d1.crossJoin(d2) d1.show() d2.show() dataset.show() def test(mode: String) = { val formula = new RFormula() .setFormula("c1 ~ id2") .setHandleInvalid(mode) val model = formula.fit(dataset) val output = model.transform(dataset) println(model) println(mode) output.select("features", "label").show(truncate=false) } List("skip", "keep", "error").foreach {test}{code} > RFormula handleInvalid should handle invalid values in non-string columns. > -- > > Key: SPARK-23562 > URL: https://issues.apache.org/jira/browse/SPARK-23562 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Priority: Major > > Currently when handleInvalid is set to 'keep' or 'skip' this only applies to > String fields. Numeric fields that are null will either cause the transformer > to fail or might be null in the resulting label column. > I'm not sure what the semantics of keep might be for numeric columns with > null values, but we should be able to at least support skip for these types. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23562) RFormula handleInvalid should handle invalid values in non-string columns.
[ https://issues.apache.org/jira/browse/SPARK-23562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390434#comment-16390434 ] yogesh garg commented on SPARK-23562: - Error in question can be reproduced with the following code in scala ``` val d1 = spark.createDataFrame(Seq( (1001, "a"), (1002, "b") )).toDF("id1", "c1") val seq: Seq[(java.lang.Long, String)] = (Seq( (20001, "x"), (20002, "y"), (null, null) )) val d2 = seq.toDF("id2", "c2") val dataset = d1.crossJoin(d2) d1.show() d2.show() dataset.show() def test(mode: String) = { val formula = new RFormula() .setFormula("c1 ~ id2") .setHandleInvalid(mode) val model = formula.fit(dataset) val output = model.transform(dataset) println(model) println(mode) output.select("features", "label").show(truncate=false) } List("skip", "keep", "error").foreach {test} ``` > RFormula handleInvalid should handle invalid values in non-string columns. > -- > > Key: SPARK-23562 > URL: https://issues.apache.org/jira/browse/SPARK-23562 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Priority: Major > > Currently when handleInvalid is set to 'keep' or 'skip' this only applies to > String fields. Numeric fields that are null will either cause the transformer > to fail or might be null in the resulting label column. > I'm not sure what the semantics of keep might be for numeric columns with > null values, but we should be able to at least support skip for these types. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23562) RFormula handleInvalid should handle invalid values in non-string columns.
[ https://issues.apache.org/jira/browse/SPARK-23562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390434#comment-16390434 ] yogesh garg edited comment on SPARK-23562 at 3/7/18 11:30 PM: -- Error in question can be reproduced with the following code in scala {code:scala} val d1 = spark.createDataFrame(Seq( (1001, "a"), (1002, "b") )).toDF("id1", "c1") val seq: Seq[(java.lang.Long, String)] = (Seq( (20001, "x"), (20002, "y"), (null, null) )) val d2 = seq.toDF("id2", "c2") val dataset = d1.crossJoin(d2) d1.show() d2.show() dataset.show() def test(mode: String) = { val formula = new RFormula() .setFormula("c1 ~ id2") .setHandleInvalid(mode) val model = formula.fit(dataset) val output = model.transform(dataset) println(model) println(mode) output.select("features", "label").show(truncate=false) } List("skip", "keep", "error").foreach {test}{code} was (Author: yogeshgarg): Error in question can be reproduced with the following code in scala ``` val d1 = spark.createDataFrame(Seq( (1001, "a"), (1002, "b") )).toDF("id1", "c1") val seq: Seq[(java.lang.Long, String)] = (Seq( (20001, "x"), (20002, "y"), (null, null) )) val d2 = seq.toDF("id2", "c2") val dataset = d1.crossJoin(d2) d1.show() d2.show() dataset.show() def test(mode: String) = { val formula = new RFormula() .setFormula("c1 ~ id2") .setHandleInvalid(mode) val model = formula.fit(dataset) val output = model.transform(dataset) println(model) println(mode) output.select("features", "label").show(truncate=false) } List("skip", "keep", "error").foreach {test} ``` > RFormula handleInvalid should handle invalid values in non-string columns. > -- > > Key: SPARK-23562 > URL: https://issues.apache.org/jira/browse/SPARK-23562 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Priority: Major > > Currently when handleInvalid is set to 'keep' or 'skip' this only applies to > String fields. Numeric fields that are null will either cause the transformer > to fail or might be null in the resulting label column. > I'm not sure what the semantics of keep might be for numeric columns with > null values, but we should be able to at least support skip for these types. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18630) PySpark ML memory leak
[ https://issues.apache.org/jira/browse/SPARK-18630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382886#comment-16382886 ] yogesh garg commented on SPARK-18630: - After some discussion, I think it makes sense to move just the __del__ method to JavaWrapper and keep the copy method in JavaParams. The code also needs some testing. > PySpark ML memory leak > -- > > Key: SPARK-18630 > URL: https://issues.apache.org/jira/browse/SPARK-18630 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: holdenk >Priority: Minor > > After SPARK-18274 is fixed by https://github.com/apache/spark/pull/15843, it > would be good to follow up and address the potential memory leak for all > items handled by the `JavaWrapper`, not just `JavaParams`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18630) PySpark ML memory leak
[ https://issues.apache.org/jira/browse/SPARK-18630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381244#comment-16381244 ] yogesh garg commented on SPARK-18630: - I would like to take this. If I understand correctly, moving the `__del__` and (deep) `copy` methods to `JavaWrapper` should address this potential issue. Is there a reason why we might not want to do a deep copy of `JavaWrapper` class? > PySpark ML memory leak > -- > > Key: SPARK-18630 > URL: https://issues.apache.org/jira/browse/SPARK-18630 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: holdenk >Priority: Minor > > After SPARK-18274 is fixed by https://github.com/apache/spark/pull/15843, it > would be good to follow up and address the potential memory leak for all > items handled by the `JavaWrapper`, not just `JavaParams`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22915) ML test for StructuredStreaming: spark.ml.feature, N-Z
[ https://issues.apache.org/jira/browse/SPARK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379035#comment-16379035 ] yogesh garg commented on SPARK-22915: - Ah, doesn't make sense for me to take it then. Thanks! Please go ahead. > ML test for StructuredStreaming: spark.ml.feature, N-Z > -- > > Key: SPARK-22915 > URL: https://issues.apache.org/jira/browse/SPARK-22915 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Priority: Major > > *For featurizers with names from N - Z* > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22915) ML test for StructuredStreaming: spark.ml.feature, N-Z
[ https://issues.apache.org/jira/browse/SPARK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378978#comment-16378978 ] yogesh garg commented on SPARK-22915: - I have started working on this and can raise a PR soon. Thanks for the help! > ML test for StructuredStreaming: spark.ml.feature, N-Z > -- > > Key: SPARK-22915 > URL: https://issues.apache.org/jira/browse/SPARK-22915 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Priority: Major > > *For featurizers with names from N - Z* > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org