[jira] [Commented] (SPARK-25901) Barrier mode spawns a bunch of threads that get collected on gc

2018-11-01 Thread yogesh garg (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672202#comment-16672202
 ] 

yogesh garg commented on SPARK-25901:
-

[~jiangxb1987] m thanks for approving the PR, can we assign this issue to me 
and merge the PR?

> Barrier mode spawns a bunch of threads that get collected on gc
> ---
>
> Key: SPARK-25901
> URL: https://issues.apache.org/jira/browse/SPARK-25901
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: yogesh garg
>Priority: Major
> Attachments: Screen Shot 2018-10-31 at 11.57.25 AM.png, Screen Shot 
> 2018-10-31 at 11.57.42 AM.png
>
>
> After a barrier job is terminated (successfully or interrupted), the 
> accompanying thread created with `Timer` in `BarrierTaskContext` shows in a 
> waiting state until gc is called. We should probably have just one thread to 
> schedule all such tasks, since they just log every 60 seconds.
> Here's a screen shot of the threads growing with more tasks:
>  !Screen Shot 2018-10-31 at 11.57.25 AM.png! 
> Here's a screen shot of constant number of threads with more tasks:
>  !Screen Shot 2018-10-31 at 11.57.42 AM.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25901) Barrier mode spawns a bunch of threads that get collected on gc

2018-10-31 Thread yogesh garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yogesh garg updated SPARK-25901:

Description: 
After a barrier job is terminated (successfully or interrupted), the 
accompanying thread created with `Timer` in `BarrierTaskContext` shows in a 
waiting state until gc is called. We should probably have just one thread to 
schedule all such tasks, since they just log every 60 seconds.

Here's a screen shot of the threads growing with more tasks:
 !Screen Shot 2018-10-31 at 11.57.25 AM.png! 

Here's a screen shot of constant number of threads with more tasks:
 !Screen Shot 2018-10-31 at 11.57.42 AM.png! 

  was:
After a barrier job is terminated (successfully or interrupted), the 
accompanying thread created with `Timer` in `BarrierTaskContext` shows in a 
waiting state until gc is called. We should probably have just one thread to 
schedule all such tasks, since they just log every 60 seconds.

Here's a screen shot of the threads growing with more tasks:

Here's a screen shot of constant number of threads with more tasks:


> Barrier mode spawns a bunch of threads that get collected on gc
> ---
>
> Key: SPARK-25901
> URL: https://issues.apache.org/jira/browse/SPARK-25901
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: yogesh garg
>Priority: Major
> Attachments: Screen Shot 2018-10-31 at 11.57.25 AM.png, Screen Shot 
> 2018-10-31 at 11.57.42 AM.png
>
>
> After a barrier job is terminated (successfully or interrupted), the 
> accompanying thread created with `Timer` in `BarrierTaskContext` shows in a 
> waiting state until gc is called. We should probably have just one thread to 
> schedule all such tasks, since they just log every 60 seconds.
> Here's a screen shot of the threads growing with more tasks:
>  !Screen Shot 2018-10-31 at 11.57.25 AM.png! 
> Here's a screen shot of constant number of threads with more tasks:
>  !Screen Shot 2018-10-31 at 11.57.42 AM.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25901) Barrier mode spawns a bunch of threads that get collected on gc

2018-10-31 Thread yogesh garg (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16670573#comment-16670573
 ] 

yogesh garg edited comment on SPARK-25901 at 10/31/18 7:06 PM:
---

I am working on this task in this PR: https://github.com/apache/spark/pull/22912


was (Author: yogeshgarg):
I am working on this task.

> Barrier mode spawns a bunch of threads that get collected on gc
> ---
>
> Key: SPARK-25901
> URL: https://issues.apache.org/jira/browse/SPARK-25901
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: yogesh garg
>Priority: Major
> Attachments: Screen Shot 2018-10-31 at 11.57.25 AM.png, Screen Shot 
> 2018-10-31 at 11.57.42 AM.png
>
>
> After a barrier job is terminated (successfully or interrupted), the 
> accompanying thread created with `Timer` in `BarrierTaskContext` shows in a 
> waiting state until gc is called. We should probably have just one thread to 
> schedule all such tasks, since they just log every 60 seconds.
> Here's a screen shot of the threads growing with more tasks:
> Here's a screen shot of constant number of threads with more tasks:



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25901) Barrier mode spawns a bunch of threads that get collected on gc

2018-10-31 Thread yogesh garg (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yogesh garg updated SPARK-25901:

Attachment: Screen Shot 2018-10-31 at 11.57.25 AM.png
Screen Shot 2018-10-31 at 11.57.42 AM.png

> Barrier mode spawns a bunch of threads that get collected on gc
> ---
>
> Key: SPARK-25901
> URL: https://issues.apache.org/jira/browse/SPARK-25901
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: yogesh garg
>Priority: Major
> Attachments: Screen Shot 2018-10-31 at 11.57.25 AM.png, Screen Shot 
> 2018-10-31 at 11.57.42 AM.png
>
>
> After a barrier job is terminated (successfully or interrupted), the 
> accompanying thread created with `Timer` in `BarrierTaskContext` shows in a 
> waiting state until gc is called. We should probably have just one thread to 
> schedule all such tasks, since they just log every 60 seconds.
> Here's a screen shot of the threads growing with more tasks:
> Here's a screen shot of constant number of threads with more tasks:



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25901) Barrier mode spawns a bunch of threads that get collected on gc

2018-10-31 Thread yogesh garg (JIRA)
yogesh garg created SPARK-25901:
---

 Summary: Barrier mode spawns a bunch of threads that get collected 
on gc
 Key: SPARK-25901
 URL: https://issues.apache.org/jira/browse/SPARK-25901
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: yogesh garg
 Attachments: Screen Shot 2018-10-31 at 11.57.25 AM.png, Screen Shot 
2018-10-31 at 11.57.42 AM.png

After a barrier job is terminated (successfully or interrupted), the 
accompanying thread created with `Timer` in `BarrierTaskContext` shows in a 
waiting state until gc is called. We should probably have just one thread to 
schedule all such tasks, since they just log every 60 seconds.

Here's a screen shot of the threads growing with more tasks:

Here's a screen shot of constant number of threads with more tasks:



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25901) Barrier mode spawns a bunch of threads that get collected on gc

2018-10-31 Thread yogesh garg (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16670573#comment-16670573
 ] 

yogesh garg commented on SPARK-25901:
-

I am working on this task.

> Barrier mode spawns a bunch of threads that get collected on gc
> ---
>
> Key: SPARK-25901
> URL: https://issues.apache.org/jira/browse/SPARK-25901
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: yogesh garg
>Priority: Major
> Attachments: Screen Shot 2018-10-31 at 11.57.25 AM.png, Screen Shot 
> 2018-10-31 at 11.57.42 AM.png
>
>
> After a barrier job is terminated (successfully or interrupted), the 
> accompanying thread created with `Timer` in `BarrierTaskContext` shows in a 
> waiting state until gc is called. We should probably have just one thread to 
> schedule all such tasks, since they just log every 60 seconds.
> Here's a screen shot of the threads growing with more tasks:
> Here's a screen shot of constant number of threads with more tasks:



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24115) improve instrumentation for spark.ml.tuning

2018-04-27 Thread yogesh garg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16457113#comment-16457113
 ] 

yogesh garg commented on SPARK-24115:
-

I would like to work on this.

> improve instrumentation for spark.ml.tuning
> ---
>
> Key: SPARK-24115
> URL: https://issues.apache.org/jira/browse/SPARK-24115
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24115) improve instrumentation for spark.ml.tuning

2018-04-27 Thread yogesh garg (JIRA)
yogesh garg created SPARK-24115:
---

 Summary: improve instrumentation for spark.ml.tuning
 Key: SPARK-24115
 URL: https://issues.apache.org/jira/browse/SPARK-24115
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.3.0
Reporter: yogesh garg






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24114) improve instrumentation for spark.ml.recommendation

2018-04-27 Thread yogesh garg (JIRA)
yogesh garg created SPARK-24114:
---

 Summary: improve instrumentation for spark.ml.recommendation
 Key: SPARK-24114
 URL: https://issues.apache.org/jira/browse/SPARK-24114
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.3.0
Reporter: yogesh garg






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24114) improve instrumentation for spark.ml.recommendation

2018-04-27 Thread yogesh garg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16457110#comment-16457110
 ] 

yogesh garg commented on SPARK-24114:
-

I would like to work on this.

> improve instrumentation for spark.ml.recommendation
> ---
>
> Key: SPARK-24114
> URL: https://issues.apache.org/jira/browse/SPARK-24114
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23871) add python api for VectorAssembler handleInvalid

2018-04-05 Thread yogesh garg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16427543#comment-16427543
 ] 

yogesh garg commented on SPARK-23871:
-

I hadn't started working on this yet. Feel free to take it.

> add python api for VectorAssembler handleInvalid
> 
>
> Key: SPARK-23871
> URL: https://issues.apache.org/jira/browse/SPARK-23871
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23871) add python api for VectorAssembler handleInvalid

2018-04-04 Thread yogesh garg (JIRA)
yogesh garg created SPARK-23871:
---

 Summary: add python api for VectorAssembler handleInvalid
 Key: SPARK-23871
 URL: https://issues.apache.org/jira/browse/SPARK-23871
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.3.0
Reporter: yogesh garg






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23870) Forward RFormula handleInvalid Param to VectorAssembler

2018-04-04 Thread yogesh garg (JIRA)
yogesh garg created SPARK-23870:
---

 Summary:  Forward RFormula handleInvalid Param to VectorAssembler
 Key: SPARK-23870
 URL: https://issues.apache.org/jira/browse/SPARK-23870
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.3.0
Reporter: yogesh garg
 Fix For: 2.4.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23690) VectorAssembler should have handleInvalid to handle columns with null values

2018-03-19 Thread yogesh garg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405174#comment-16405174
 ] 

yogesh garg edited comment on SPARK-23690 at 3/19/18 7:04 PM:
--

In an offline discussion with [~mrbago], we discussed the following behavior 
for `handleInvalid`. We have to get the lengths of the vector columns that are 
involved in the assembly, ideally this information is present in the 
`attributeGroup` of the column, but that might return `size == -1`, in which 
case we earlier used the `d.select.first` to infer the size of these columns. 
This could raise an exception in the corner case that the first row itself has 
null values. We are abandoning the idea that we can get this information by 
finding a non-null row in each of such columns because this approach has 
complicated logic, terrible run time (O(#columns) distributed queries) and 
fewer guarantees for any such data we might see in the future (even if we infer 
the size right now, there's no guarantee we can do it later, leading to an 
un-expected error).

1. *Error*: Find the remaining lengths from `d.select.first`
  * if we get NullPointerException while iterating on the cells for sizes, 
throw an (early) error
  * if we get NoSuchElementError while looking for the first row, -give the 
rows 0 sizes and warn- throw error about incomplete metadata

2. *Skip*: Find remaining lengths from `d.drop.first`
  * if we get NoSuchElementError, -warn- throw error about incomplete metadata
  * Note that we can't get NullPointerException in this case (yay!)

3. *Keep*: If any column does not have attribute sizes, it's dangerous to infer 
sizes from the data because even if we get the information form the current 
dataset, a future cut of the data is not guaranteed to be infer-able. Thus, 
throw an error encouraging `VectorSizeHint`

Please share thoughts and feedback on this!


edit: In an offline talk with [~josephkb] we decided to throw errors instead of 
warning about any size inference failures.


was (Author: yogeshgarg):
In an offline discussion with [~mrbago], we discussed the following behavior 
for `handleInvalid`. We have to get the lengths of the vector columns that are 
involved in the assembly, ideally this information is present in the 
`attributeGroup` of the column, but that might return `size == -1`, in which 
case we earlier used the `d.select.first` to infer the size of these columns. 
This could raise an exception in the corner case that the first row itself has 
null values. We are abandoning the idea that we can get this information by 
finding a non-null row in each of such columns because this approach has 
complicated logic, terrible run time (O(#columns) distributed queries) and 
fewer guarantees for any such data we might see in the future (even if we infer 
the size right now, there's no guarantee we can do it later, leading to an 
un-expected error).

1. *Error*: Find the remaining lengths from `d.select.first`
  * if we get NullPointerException while iterating on the cells for sizes, 
throw an (early) error
  * if we get NoSuchElementError while looking for the first row, -give the 
rows 0 sizes and warn- throw error about incomplete metadata

2. *Skip*: Find remaining lengths from `d.drop.first`
  * if we get NoSuchElementError, -warn- throw error about incomplete metadata
  * Note that we can't get NullPointerException in this case (yay!)

3. *Keep*: If any column does not have attribute sizes, it's dangerous to infer 
sizes from the data because even if we get the information form the current 
dataset, a future cut of the data is not guaranteed to be infer-able. Thus, 
throw an error encouraging `VectorSizeHint`

Please share thoughts and feedback on this!


edit: In an offline talk with @jkbradley we decided to throw errors instead of 
warning about any size inference failures.

> VectorAssembler should have handleInvalid to handle columns with null values
> 
>
> Key: SPARK-23690
> URL: https://issues.apache.org/jira/browse/SPARK-23690
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Priority: Major
>
> VectorAssembler only takes in numeric (and vectors (of numeric?)) columns as 
> an input and returns the assembled vector. It currently throws an error if it 
> sees a null value in any column. This behavior also affects `RFormula` that 
> uses VectorAssembler to assemble numeric columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23690) VectorAssembler should have handleInvalid to handle columns with null values

2018-03-19 Thread yogesh garg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405174#comment-16405174
 ] 

yogesh garg edited comment on SPARK-23690 at 3/19/18 7:03 PM:
--

In an offline discussion with [~mrbago], we discussed the following behavior 
for `handleInvalid`. We have to get the lengths of the vector columns that are 
involved in the assembly, ideally this information is present in the 
`attributeGroup` of the column, but that might return `size == -1`, in which 
case we earlier used the `d.select.first` to infer the size of these columns. 
This could raise an exception in the corner case that the first row itself has 
null values. We are abandoning the idea that we can get this information by 
finding a non-null row in each of such columns because this approach has 
complicated logic, terrible run time (O(#columns) distributed queries) and 
fewer guarantees for any such data we might see in the future (even if we infer 
the size right now, there's no guarantee we can do it later, leading to an 
un-expected error).

1. *Error*: Find the remaining lengths from `d.select.first`
  * if we get NullPointerException while iterating on the cells for sizes, 
throw an (early) error
  * if we get NoSuchElementError while looking for the first row, -give the 
rows 0 sizes and warn- throw error about incomplete metadata

2. *Skip*: Find remaining lengths from `d.drop.first`
  * if we get NoSuchElementError, -warn- throw error about incomplete metadata
  * Note that we can't get NullPointerException in this case (yay!)

3. *Keep*: If any column does not have attribute sizes, it's dangerous to infer 
sizes from the data because even if we get the information form the current 
dataset, a future cut of the data is not guaranteed to be infer-able. Thus, 
throw an error encouraging `VectorSizeHint`

Please share thoughts and feedback on this!


edit: In an offline talk with @jkbradley we decided to throw errors instead of 
warning about any size inference failures.


was (Author: yogeshgarg):
In an offline discussion with [~mrbago], we discussed the following behavior 
for `handleInvalid`. We have to get the lengths of the vector columns that are 
involved in the assembly, ideally this information is present in the 
`attributeGroup` of the column, but that might return `size == -1`, in which 
case we earlier used the `d.select.first` to infer the size of these columns. 
This could raise an exception in the corner case that the first row itself has 
null values. We are abandoning the idea that we can get this information by 
finding a non-null row in each of such columns because this approach has 
complicated logic, terrible run time (O(#columns) distributed queries) and 
fewer guarantees for any such data we might see in the future (even if we infer 
the size right now, there's no guarantee we can do it later, leading to an 
un-expected error).

1. *Error*: Find the remaining lengths from `d.select.first`
  * if we get NullPointerException while iterating on the cells for sizes, 
throw an (early) error
  * if we get NoSuchElementError while looking for the first row, give the rows 
0 sizes and warn about incomplete metadata

2. *Skip*: Find remaining lengths from `d.drop.first`
  * if we get NoSuchElementError, warn about incomplete metadata
  * Note that we can't get NullPointerException in this case (yay!)

3. *Keep*: If any column does not have attribute sizes, it's dangerous to infer 
sizes from the data because even if we get the information form the current 
dataset, a future cut of the data is not guaranteed to be infer-able. Thus, 
throw an error encouraging `VectorSizeHint`

Please share thoughts and feedback on this!

> VectorAssembler should have handleInvalid to handle columns with null values
> 
>
> Key: SPARK-23690
> URL: https://issues.apache.org/jira/browse/SPARK-23690
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Priority: Major
>
> VectorAssembler only takes in numeric (and vectors (of numeric?)) columns as 
> an input and returns the assembled vector. It currently throws an error if it 
> sees a null value in any column. This behavior also affects `RFormula` that 
> uses VectorAssembler to assemble numeric columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23690) VectorAssembler should have handleInvalid to handle columns with null values

2018-03-19 Thread yogesh garg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405174#comment-16405174
 ] 

yogesh garg commented on SPARK-23690:
-

In an offline discussion with [~mrbago], we discussed the following behavior 
for `handleInvalid`. We have to get the lengths of the vector columns that are 
involved in the assembly, ideally this information is present in the 
`attributeGroup` of the column, but that might return `size == -1`, in which 
case we earlier used the `d.select.first` to infer the size of these columns. 
This could raise an exception in the corner case that the first row itself has 
null values. We are abandoning the idea that we can get this information by 
finding a non-null row in each of such columns because this approach has 
complicated logic, terrible run time (O(#columns) distributed queries) and 
fewer guarantees for any such data we might see in the future (even if we infer 
the size right now, there's no guarantee we can do it later, leading to an 
un-expected error).

1. *Error*: Find the remaining lengths from `d.select.first`
  * if we get NullPointerException while iterating on the cells for sizes, 
throw an (early) error
  * if we get NoSuchElementError while looking for the first row, give the rows 
0 sizes and warn about incomplete metadata

2. *Skip*: Find remaining lengths from `d.drop.first`
  * if we get NoSuchElementError, warn about incomplete metadata
  * Note that we can't get NullPointerException in this case (yay!)

3. *Keep*: If any column does not have attribute sizes, it's dangerous to infer 
sizes from the data because even if we get the information form the current 
dataset, a future cut of the data is not guaranteed to be infer-able. Thus, 
throw an error encouraging `VectorSizeHint`

Please share thoughts and feedback on this!

> VectorAssembler should have handleInvalid to handle columns with null values
> 
>
> Key: SPARK-23690
> URL: https://issues.apache.org/jira/browse/SPARK-23690
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Priority: Major
>
> VectorAssembler only takes in numeric (and vectors (of numeric?)) columns as 
> an input and returns the assembled vector. It currently throws an error if it 
> sees a null value in any column. This behavior also affects `RFormula` that 
> uses VectorAssembler to assemble numeric columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23690) VectorAssembler should have handleInvalid to handle columns with null values

2018-03-14 Thread yogesh garg (JIRA)
yogesh garg created SPARK-23690:
---

 Summary: VectorAssembler should have handleInvalid to handle 
columns with null values
 Key: SPARK-23690
 URL: https://issues.apache.org/jira/browse/SPARK-23690
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.3.0
Reporter: yogesh garg


VectorAssembler only takes in numeric (and vectors (of numeric?)) columns as an 
input and returns the assembled vector. It currently throws an error if it sees 
a null value in any column. This behavior also affects `RFormula` that uses 
VectorAssembler to assemble numeric columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23562) RFormula handleInvalid should handle invalid values in non-string columns.

2018-03-07 Thread yogesh garg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390434#comment-16390434
 ] 

yogesh garg edited comment on SPARK-23562 at 3/7/18 11:33 PM:
--

Error in question can be reproduced with the following code in scala 

{code:scala}
val d1 = spark.createDataFrame(Seq(
  (1001, "a"),
  (1002, "b")
)).toDF("id1", "c1")
val seq: Seq[(java.lang.Long, String)] = (Seq(
  (20001, "x"),
  (20002, "y"),
  (null, null)
))
val d2 = seq.toDF("id2", "c2")

val dataset = d1.crossJoin(d2)
d1.show()
d2.show()
dataset.show()

def test(mode: String) = {
  val formula = new RFormula()
.setFormula("c1 ~ id2")
.setHandleInvalid(mode)

  val model = formula.fit(dataset)
  val output = model.transform(dataset)
  println(model)
  println(mode)
  output.select("features", "label").show(truncate=false)
}

List("skip", "keep", "error").foreach {test}{code}



{code:java}
org.apache.spark.SparkException: Job aborted due to stage failure: Task ** in 
stage ** failed ** times, most recent failure: Lost task ** in stage ** (TID 
**, **, executor **): org.apache.spark.SparkException: Failed to execute user 
defined function($anonfun$3: (struct) 
=> vector)

Caused by: org.apache.spark.SparkException: Values to assemble cannot be null.

{code}



was (Author: yogeshgarg):
Error in question can be reproduced with the following code in scala 

{code:scala}
val d1 = spark.createDataFrame(Seq(
  (1001, "a"),
  (1002, "b")
)).toDF("id1", "c1")
val seq: Seq[(java.lang.Long, String)] = (Seq(
  (20001, "x"),
  (20002, "y"),
  (null, null)
))
val d2 = seq.toDF("id2", "c2")

val dataset = d1.crossJoin(d2)
d1.show()
d2.show()
dataset.show()

def test(mode: String) = {
  val formula = new RFormula()
.setFormula("c1 ~ id2")
.setHandleInvalid(mode)

  val model = formula.fit(dataset)
  val output = model.transform(dataset)
  println(model)
  println(mode)
  output.select("features", "label").show(truncate=false)
}

List("skip", "keep", "error").foreach {test}{code}


> RFormula handleInvalid should handle invalid values in non-string columns.
> --
>
> Key: SPARK-23562
> URL: https://issues.apache.org/jira/browse/SPARK-23562
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Major
>
> Currently when handleInvalid is set to 'keep' or 'skip' this only applies to 
> String fields. Numeric fields that are null will either cause the transformer 
> to fail or might be null in the resulting label column.
> I'm not sure what the semantics of keep might be for numeric columns with 
> null values, but we should be able to at least support skip for these types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23562) RFormula handleInvalid should handle invalid values in non-string columns.

2018-03-07 Thread yogesh garg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390434#comment-16390434
 ] 

yogesh garg commented on SPARK-23562:
-

Error in question can be reproduced with the following code in scala 
```
val d1 = spark.createDataFrame(Seq(
  (1001, "a"),
  (1002, "b")
)).toDF("id1", "c1")
val seq: Seq[(java.lang.Long, String)] = (Seq(
  (20001, "x"),
  (20002, "y"),
  (null, null)
))
val d2 = seq.toDF("id2", "c2")

val dataset = d1.crossJoin(d2)
d1.show()
d2.show()
dataset.show()

def test(mode: String) = {
  val formula = new RFormula()
.setFormula("c1 ~ id2")
.setHandleInvalid(mode)

  val model = formula.fit(dataset)
  val output = model.transform(dataset)
  println(model)
  println(mode)
  output.select("features", "label").show(truncate=false)
}

List("skip", "keep", "error").foreach {test}
```

> RFormula handleInvalid should handle invalid values in non-string columns.
> --
>
> Key: SPARK-23562
> URL: https://issues.apache.org/jira/browse/SPARK-23562
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Major
>
> Currently when handleInvalid is set to 'keep' or 'skip' this only applies to 
> String fields. Numeric fields that are null will either cause the transformer 
> to fail or might be null in the resulting label column.
> I'm not sure what the semantics of keep might be for numeric columns with 
> null values, but we should be able to at least support skip for these types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23562) RFormula handleInvalid should handle invalid values in non-string columns.

2018-03-07 Thread yogesh garg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390434#comment-16390434
 ] 

yogesh garg edited comment on SPARK-23562 at 3/7/18 11:30 PM:
--

Error in question can be reproduced with the following code in scala 

{code:scala}
val d1 = spark.createDataFrame(Seq(
  (1001, "a"),
  (1002, "b")
)).toDF("id1", "c1")
val seq: Seq[(java.lang.Long, String)] = (Seq(
  (20001, "x"),
  (20002, "y"),
  (null, null)
))
val d2 = seq.toDF("id2", "c2")

val dataset = d1.crossJoin(d2)
d1.show()
d2.show()
dataset.show()

def test(mode: String) = {
  val formula = new RFormula()
.setFormula("c1 ~ id2")
.setHandleInvalid(mode)

  val model = formula.fit(dataset)
  val output = model.transform(dataset)
  println(model)
  println(mode)
  output.select("features", "label").show(truncate=false)
}

List("skip", "keep", "error").foreach {test}{code}



was (Author: yogeshgarg):
Error in question can be reproduced with the following code in scala 
```
val d1 = spark.createDataFrame(Seq(
  (1001, "a"),
  (1002, "b")
)).toDF("id1", "c1")
val seq: Seq[(java.lang.Long, String)] = (Seq(
  (20001, "x"),
  (20002, "y"),
  (null, null)
))
val d2 = seq.toDF("id2", "c2")

val dataset = d1.crossJoin(d2)
d1.show()
d2.show()
dataset.show()

def test(mode: String) = {
  val formula = new RFormula()
.setFormula("c1 ~ id2")
.setHandleInvalid(mode)

  val model = formula.fit(dataset)
  val output = model.transform(dataset)
  println(model)
  println(mode)
  output.select("features", "label").show(truncate=false)
}

List("skip", "keep", "error").foreach {test}
```

> RFormula handleInvalid should handle invalid values in non-string columns.
> --
>
> Key: SPARK-23562
> URL: https://issues.apache.org/jira/browse/SPARK-23562
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Major
>
> Currently when handleInvalid is set to 'keep' or 'skip' this only applies to 
> String fields. Numeric fields that are null will either cause the transformer 
> to fail or might be null in the resulting label column.
> I'm not sure what the semantics of keep might be for numeric columns with 
> null values, but we should be able to at least support skip for these types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18630) PySpark ML memory leak

2018-03-01 Thread yogesh garg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382886#comment-16382886
 ] 

yogesh garg commented on SPARK-18630:
-

After some discussion, I think it makes sense to move just the __del__ method 
to JavaWrapper and keep the copy method in JavaParams. The code also needs some 
testing.

> PySpark ML memory leak
> --
>
> Key: SPARK-18630
> URL: https://issues.apache.org/jira/browse/SPARK-18630
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> After SPARK-18274 is fixed by https://github.com/apache/spark/pull/15843, it 
> would be good to follow up and address the potential memory leak for all 
> items handled by the `JavaWrapper`, not just `JavaParams`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18630) PySpark ML memory leak

2018-02-28 Thread yogesh garg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381244#comment-16381244
 ] 

yogesh garg commented on SPARK-18630:
-

I would like to take this. If I understand correctly, moving the `__del__` and 
(deep) `copy` methods to `JavaWrapper` should address this potential issue. Is 
there a reason why we might not want to do a deep copy of `JavaWrapper` class?

> PySpark ML memory leak
> --
>
> Key: SPARK-18630
> URL: https://issues.apache.org/jira/browse/SPARK-18630
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> After SPARK-18274 is fixed by https://github.com/apache/spark/pull/15843, it 
> would be good to follow up and address the potential memory leak for all 
> items handled by the `JavaWrapper`, not just `JavaParams`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22915) ML test for StructuredStreaming: spark.ml.feature, N-Z

2018-02-27 Thread yogesh garg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379035#comment-16379035
 ] 

yogesh garg commented on SPARK-22915:
-

Ah, doesn't make sense for me to take it then. Thanks! Please go ahead.

> ML test for StructuredStreaming: spark.ml.feature, N-Z
> --
>
> Key: SPARK-22915
> URL: https://issues.apache.org/jira/browse/SPARK-22915
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> *For featurizers with names from N - Z*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22915) ML test for StructuredStreaming: spark.ml.feature, N-Z

2018-02-27 Thread yogesh garg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378978#comment-16378978
 ] 

yogesh garg commented on SPARK-22915:
-

I have started working on this and can raise a PR soon. Thanks for the help!

> ML test for StructuredStreaming: spark.ml.feature, N-Z
> --
>
> Key: SPARK-22915
> URL: https://issues.apache.org/jira/browse/SPARK-22915
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> *For featurizers with names from N - Z*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org