from:"pralabhkumar \(Jira\)"

[jira] [Commented] (SPARK-33782) Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

2022-10-31 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17626417#comment-17626417
 ] 

pralabhkumar commented on SPARK-33782:
--

[~hyukjin.kwon] [~dongjoon] 

Please let me know, if this Jira is relevant . I have already created the PR 
and its been already reviewed by couple of PMC . Please help to get it reviewed 
if the Jira is relevant otherwise i'll close the PR

> Place spark.files, spark.jars and spark.files under the current working 
> directory on the driver in K8S cluster mode
> ---
>
> Key: SPARK-33782
> URL: https://issues.apache.org/jira/browse/SPARK-33782
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> In Yarn cluster modes, the passed files are able to be accessed in the 
> current working directory. Looks like this is not the case in Kubernates 
> cluset mode.
> By doing this, users can, for example, leverage PEX to manage Python 
> dependences in Apache Spark:
> {code}
> pex pyspark==3.0.1 pyarrow==0.15.1 pandas==0.25.3 -o myarchive.pex
> PYSPARK_PYTHON=./myarchive.pex spark-submit --files myarchive.pex
> {code}
> See also https://github.com/apache/spark/pull/30735/files#r540935585.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33782) Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

2022-10-11 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17615997#comment-17615997
 ] 

pralabhkumar commented on SPARK-33782:
--

[~hyukjin.kwon] 

 

Can u please help to review the PR . It would be of great help . 

> Place spark.files, spark.jars and spark.files under the current working 
> directory on the driver in K8S cluster mode
> ---
>
> Key: SPARK-33782
> URL: https://issues.apache.org/jira/browse/SPARK-33782
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> In Yarn cluster modes, the passed files are able to be accessed in the 
> current working directory. Looks like this is not the case in Kubernates 
> cluset mode.
> By doing this, users can, for example, leverage PEX to manage Python 
> dependences in Apache Spark:
> {code}
> pex pyspark==3.0.1 pyarrow==0.15.1 pandas==0.25.3 -o myarchive.pex
> PYSPARK_PYTHON=./myarchive.pex spark-submit --files myarchive.pex
> {code}
> See also https://github.com/apache/spark/pull/30735/files#r540935585.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33782) Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

2022-09-13 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603426#comment-17603426
 ] 

pralabhkumar commented on SPARK-33782:
--

[~dongjoon] Please review the PR . 

> Place spark.files, spark.jars and spark.files under the current working 
> directory on the driver in K8S cluster mode
> ---
>
> Key: SPARK-33782
> URL: https://issues.apache.org/jira/browse/SPARK-33782
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> In Yarn cluster modes, the passed files are able to be accessed in the 
> current working directory. Looks like this is not the case in Kubernates 
> cluset mode.
> By doing this, users can, for example, leverage PEX to manage Python 
> dependences in Apache Spark:
> {code}
> pex pyspark==3.0.1 pyarrow==0.15.1 pandas==0.25.3 -o myarchive.pex
> PYSPARK_PYTHON=./myarchive.pex spark-submit --files myarchive.pex
> {code}
> See also https://github.com/apache/spark/pull/30735/files#r540935585.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39965) Skip PVC cleanup when driver doesn't own PVCs

2022-08-08 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576793#comment-17576793
 ] 

pralabhkumar commented on SPARK-39965:
--

[~dongjoon] Thx for taking this .  This is really helpful

> Skip PVC cleanup when driver doesn't own PVCs
> -
>
> Key: SPARK-39965
> URL: https://issues.apache.org/jira/browse/SPARK-39965
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Assignee: pralabhkumar
>Priority: Trivial
>
> From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , 
> functionality is added to delete PVC if the Spark driver died. 
> [https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144]
>  
> However there are cases , where spark on K8s doesn't use PVC and use host 
> path for storage. 
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]
>  
> Now  in those cases ,
>  * it request to delete PVC (which is not required) .
>  * It also tries to delete in the case where driver doesn't own the PV (or 
> spark.kubernetes.driver.ownPersistentVolumeClaim is false) 
>  * Moreover in the cluster , where Spark user doesn't have access to list or 
> delete PVC , it throws exception .  
>  
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> GET at: 
> [https://kubernetes.default.svc/api/v1/namespaces/<>/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1|https://kubernetes.default.svc/api/v1/namespaces/dpi-dev/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1].
>  Message: Forbidden!Configured service account doesn't have access. Service 
> account may have been revoked. persistentvolumeclaims is forbidden: User 
> "system:serviceaccount:dpi-dev:spark" cannot list resource 
> "persistentvolumeclaims" in API group "" in the namespace "<>".
>  
> *Solution*
> Ideally there should be configuration 
> spark.kubernetes.driver.pvc.deleteOnTermination or use 
> spark.kubernetes.driver.ownPersistentVolumeClaim  which should be checked 
> before calling to delete PVC. If user have not set up PV or if the driver 
> doesn't own  then there is no need to call the api and delete PVC . 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39965) Spark on K8s delete pvc even though it's not being used.

2022-08-07 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576516#comment-17576516
 ] 

pralabhkumar commented on SPARK-39965:
--

[~dongjoon] 

Thx for replying. We don't see an issue except getting exception in the logs 
(which was mentioned above) . However , please not that , prior to this fix , 
we were not getting any exception in the logs . Now in scenarios , where PV is 
not being used by Spark (as in our case), why should we get the above exception 
in the logs.  Currently there is no way to not run 
Utils.tryLogNonFatalError \{
      kubernetesClient
        .persistentVolumeClaims()
        .withLabel(SPARK_APP_ID_LABEL, applicationId())
        .delete()
    }
 

IMHO , there should be configuration (which check whether driver own PVC or 
spark uses PV ).  For e.g
{code:java}
if (conf.get(KUBERNETES_DRIVER_OWN_PVC)) {
  Utils.tryLogNonFatalError {
kubernetesClient
  .persistentVolumeClaims()
  .withLabel(SPARK_APP_ID_LABEL, applicationId())
  .delete()
  }
} {code}
 

> Spark on K8s delete pvc even though it's not being used.
> 
>
> Key: SPARK-39965
> URL: https://issues.apache.org/jira/browse/SPARK-39965
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Minor
>
> From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , 
> functionality is added to delete PVC if the Spark driver died. 
> [https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144]
>  
> However there are cases , where spark on K8s doesn't use PVC and use host 
> path for storage. 
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]
>  
> Now  in those cases ,
>  * it request to delete PVC (which is not required) .
>  * It also tries to delete in the case where driver doesn't own the PV (or 
> spark.kubernetes.driver.ownPersistentVolumeClaim is false) 
>  * Moreover in the cluster , where Spark user doesn't have access to list or 
> delete PVC , it throws exception .  
>  
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> GET at: 
> [https://kubernetes.default.svc/api/v1/namespaces/<>/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1|https://kubernetes.default.svc/api/v1/namespaces/dpi-dev/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1].
>  Message: Forbidden!Configured service account doesn't have access. Service 
> account may have been revoked. persistentvolumeclaims is forbidden: User 
> "system:serviceaccount:dpi-dev:spark" cannot list resource 
> "persistentvolumeclaims" in API group "" in the namespace "<>".
>  
> *Solution*
> Ideally there should be configuration 
> spark.kubernetes.driver.pvc.deleteOnTermination or use 
> spark.kubernetes.driver.ownPersistentVolumeClaim  which should be checked 
> before calling to delete PVC. If user have not set up PV or if the driver 
> doesn't own  then there is no need to call the api and delete PVC . 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39965) Spark on K8s delete pvc even though it's not being used.

2022-08-04 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17575554#comment-17575554
 ] 

pralabhkumar commented on SPARK-39965:
--

[~dongjoon] Please review.

> Spark on K8s delete pvc even though it's not being used.
> 
>
> Key: SPARK-39965
> URL: https://issues.apache.org/jira/browse/SPARK-39965
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Minor
>
> From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , 
> functionality is added to delete PVC if the Spark driver died. 
> [https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144]
>  
> However there are cases , where spark on K8s doesn't use PVC and use host 
> path for storage. 
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]
>  
> Now  in those cases ,
>  * it request to delete PVC (which is not required) .
>  * It also tries to delete in the case where driver doesn't own the PV (or 
> spark.kubernetes.driver.ownPersistentVolumeClaim is false) 
>  * Moreover in the cluster , where Spark user doesn't have access to list or 
> delete PVC , it throws exception .  
>  
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> GET at: 
> [https://kubernetes.default.svc/api/v1/namespaces/<>/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1|https://kubernetes.default.svc/api/v1/namespaces/dpi-dev/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1].
>  Message: Forbidden!Configured service account doesn't have access. Service 
> account may have been revoked. persistentvolumeclaims is forbidden: User 
> "system:serviceaccount:dpi-dev:spark" cannot list resource 
> "persistentvolumeclaims" in API group "" in the namespace "<>".
>  
> *Solution*
> Ideally there should be configuration 
> spark.kubernetes.driver.pvc.deleteOnTermination or use 
> spark.kubernetes.driver.ownPersistentVolumeClaim  which should be checked 
> before calling to delete PVC. If user have not set up PV or if the driver 
> doesn't own  then there is no need to call the api and delete PVC . 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39965) Spark on K8s delete pvc even though it's not being used.

2022-08-02 Thread pralabhkumar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pralabhkumar updated SPARK-39965:
-
Description: 
>From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , 
>functionality is added to delete PVC if the Spark driver died. 

[https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144]

 

However there are cases , where spark on K8s doesn't use PVC and use host path 
for storage. 

[https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]

 

Now  in those cases ,
 * it request to delete PVC (which is not required) .
 * It also tries to delete in the case where driver doesn't own the PV (or 
spark.kubernetes.driver.ownPersistentVolumeClaim is false) 
 * Moreover in the cluster , where Spark user doesn't have access to list or 
delete PVC , it throws exception .  

 

io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET 
at: 
[https://kubernetes.default.svc/api/v1/namespaces/<>/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1|https://kubernetes.default.svc/api/v1/namespaces/dpi-dev/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1].
 Message: Forbidden!Configured service account doesn't have access. Service 
account may have been revoked. persistentvolumeclaims is forbidden: User 
"system:serviceaccount:dpi-dev:spark" cannot list resource 
"persistentvolumeclaims" in API group "" in the namespace "<>".

 

*Solution*

Ideally there should be configuration 
spark.kubernetes.driver.pvc.deleteOnTermination or use 
spark.kubernetes.driver.ownPersistentVolumeClaim  which should be checked 
before calling to delete PVC. If user have not set up PV or if the driver 
doesn't own  then there is no need to call the api and delete PVC . 

 

  was:
>From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , 
>functionality is added to delete PVC if the Spark driver died. 

[https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144]

 

However there are cases , where spark on K8s doesn't use PVC and use host path 
for storage. 

[https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]

 

Now even in those cases , it request to delete PVC (which is not required) . 
Moreover in the cluster , where Spark user doesn't have access to list or 
delete PVC , it throws exception . 

 

io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET 
at: 
[https://kubernetes.default.svc/api/v1/namespaces/<>/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1|https://kubernetes.default.svc/api/v1/namespaces/dpi-dev/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1].
 Message: Forbidden!Configured service account doesn't have access. Service 
account may have been revoked. persistentvolumeclaims is forbidden: User 
"system:serviceaccount:dpi-dev:spark" cannot list resource 
"persistentvolumeclaims" in API group "" in the namespace "<>".

 

Ideally there should be configuration 
spark.kubernetes.driver.pvc.deleteOnTermination or use 
spark.kubernetes.driver.ownPersistentVolumeClaim

which should be checked before calling to delete PVC. If user have not set up 
PV or if the driver doesn't own  then there is no need to call the api and 
delete PVC . 

 


> Spark on K8s delete pvc even though it's not being used.
> 
>
> Key: SPARK-39965
> URL: https://issues.apache.org/jira/browse/SPARK-39965
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Minor
>
> From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , 
> functionality is added to delete PVC if the Spark driver died. 
> [https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144]
>  
> However there are cases , where spark on K8s doesn't use PVC and use host 
> path for storage. 
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]
>  
> Now  in those cases ,
>  * it request to delete PVC (which is not required) .
>  * It also tries to delete in the case where driver doesn't own the PV (or 
>

[jira] [Updated] (SPARK-39965) Spark on K8s delete pvc even though it's not being used.

2022-08-02 Thread pralabhkumar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pralabhkumar updated SPARK-39965:
-
Description: 
>From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , 
>functionality is added to delete PVC if the Spark driver died. 

[https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144]

 

However there are cases , where spark on K8s doesn't use PVC and use host path 
for storage. 

[https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]

 

Now even in those cases , it request to delete PVC (which is not required) . 
Moreover in the cluster , where Spark user doesn't have access to list or 
delete PVC , it throws exception . 

 

io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET 
at: 
[https://kubernetes.default.svc/api/v1/namespaces/<>/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1|https://kubernetes.default.svc/api/v1/namespaces/dpi-dev/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1].
 Message: Forbidden!Configured service account doesn't have access. Service 
account may have been revoked. persistentvolumeclaims is forbidden: User 
"system:serviceaccount:dpi-dev:spark" cannot list resource 
"persistentvolumeclaims" in API group "" in the namespace "<>".

 

Ideally there should be configuration 
spark.kubernetes.driver.pvc.deleteOnTermination or use 
spark.kubernetes.driver.ownPersistentVolumeClaim

which should be checked before calling to delete PVC. If user have not set up 
PV or if the driver doesn't own  then there is no need to call the api and 
delete PVC . 

 

  was:
>From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , 
>functionality is added to delete PVC if the Spark driver died. 

[https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144]

 

However there are cases , where spark on K8s doesn't use PVC and use host path 
for storage. 

[https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]

 

Now even in those cases , it request to delete PVC (which is not required) . 
Moreover in the cluster , where Spark user doesn't have access to list or 
delete PVC , it throws exception . 

 

io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET 
at: 
[https://kubernetes.default.svc/api/v1/namespaces/<>/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1|https://kubernetes.default.svc/api/v1/namespaces/dpi-dev/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1].
 Message: Forbidden!Configured service account doesn't have access. Service 
account may have been revoked. persistentvolumeclaims is forbidden: User 
"system:serviceaccount:dpi-dev:spark" cannot list resource 
"persistentvolumeclaims" in API group "" in the namespace "<>".

 

Ideally there should be configuration 
spark.kubernetes.driver.pvc.deleteOnTermination which should be checked before 
calling to delete PVC. If user have not set up PV then there is no need to call 
the api and delete PVC . 

 


> Spark on K8s delete pvc even though it's not being used.
> 
>
> Key: SPARK-39965
> URL: https://issues.apache.org/jira/browse/SPARK-39965
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Minor
>
> From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , 
> functionality is added to delete PVC if the Spark driver died. 
> [https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144]
>  
> However there are cases , where spark on K8s doesn't use PVC and use host 
> path for storage. 
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]
>  
> Now even in those cases , it request to delete PVC (which is not required) . 
> Moreover in the cluster , where Spark user doesn't have access to list or 
> delete PVC , it throws exception . 
>  
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> GET at: 
>

[jira] [Commented] (SPARK-39965) Spark on K8s delete pvc even though it's not being used.

2022-08-02 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17574494#comment-17574494
 ] 

pralabhkumar commented on SPARK-39965:
--

Gentle ping @[dongjoon-hyun .|https://github.com/dongjoon-hyun]

 

 

> Spark on K8s delete pvc even though it's not being used.
> 
>
> Key: SPARK-39965
> URL: https://issues.apache.org/jira/browse/SPARK-39965
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Minor
>
> From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , 
> functionality is added to delete PVC if the Spark driver died. 
> [https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144]
>  
> However there are cases , where spark on K8s doesn't use PVC and use host 
> path for storage. 
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]
>  
> Now even in those cases , it request to delete PVC (which is not required) . 
> Moreover in the cluster , where Spark user doesn't have access to list or 
> delete PVC , it throws exception . 
>  
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> GET at: 
> [https://kubernetes.default.svc/api/v1/namespaces/<>/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1|https://kubernetes.default.svc/api/v1/namespaces/dpi-dev/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1].
>  Message: Forbidden!Configured service account doesn't have access. Service 
> account may have been revoked. persistentvolumeclaims is forbidden: User 
> "system:serviceaccount:dpi-dev:spark" cannot list resource 
> "persistentvolumeclaims" in API group "" in the namespace "<>".
>  
> Ideally there should be configuration 
> spark.kubernetes.driver.pvc.deleteOnTermination which should be checked 
> before calling to delete PVC. If user have not set up PV then there is no 
> need to call the api and delete PVC . 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39965) Spark on K8s delete pvc even though it's not being used.

2022-08-02 Thread pralabhkumar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pralabhkumar updated SPARK-39965:
-
Description: 
>From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , 
>functionality is added to delete PVC if the Spark driver died. 

[https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144]

 

However there are cases , where spark on K8s doesn't use PVC and use host path 
for storage. 

[https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]

 

Now even in those cases , it request to delete PVC (which is not required) . 
Moreover in the cluster , where Spark user doesn't have access to list or 
delete PVC , it throws exception . 

 

io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET 
at: 
[https://kubernetes.default.svc/api/v1/namespaces/<>/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1|https://kubernetes.default.svc/api/v1/namespaces/dpi-dev/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1].
 Message: Forbidden!Configured service account doesn't have access. Service 
account may have been revoked. persistentvolumeclaims is forbidden: User 
"system:serviceaccount:dpi-dev:spark" cannot list resource 
"persistentvolumeclaims" in API group "" in the namespace "<>".

 

Ideally there should be configuration 
spark.kubernetes.driver.pvc.deleteOnTermination which should be checked before 
calling to delete PVC. If user have not set up PV then there is no need to call 
the api and delete PVC . 

 

  was:
In org.apache.spark.util  getConfiguredLocalDirs  

 
{code:java}
if (isRunningInYarnContainer(conf)) {
  // If we are in yarn mode, systems can have different disk layouts so we must 
set it
  // to what Yarn on this system said was available. Note this assumes that 
Yarn has
  // created the directories already, and that they are secured so that only the
  // user has access to them.
  randomizeInPlace(getYarnLocalDirs(conf).split(","))
} else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
  conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
} else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
  conf.getenv("SPARK_LOCAL_DIRS").split(",")
}{code}
randomizedInplace is not called conf.getenv("SPARK_LOCAL_DIRS").split(",") .  
This is what used in case of K8s and the shuffle locations are not randomized. 

IMHO , this should be randomized , so that all the directories have equal 
changes of pushing the data as was done on yarn side 

 

 

 


> Spark on K8s delete pvc even though it's not being used.
> 
>
> Key: SPARK-39965
> URL: https://issues.apache.org/jira/browse/SPARK-39965
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Minor
>
> From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , 
> functionality is added to delete PVC if the Spark driver died. 
> [https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144]
>  
> However there are cases , where spark on K8s doesn't use PVC and use host 
> path for storage. 
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]
>  
> Now even in those cases , it request to delete PVC (which is not required) . 
> Moreover in the cluster , where Spark user doesn't have access to list or 
> delete PVC , it throws exception . 
>  
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> GET at: 
> [https://kubernetes.default.svc/api/v1/namespaces/<>/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1|https://kubernetes.default.svc/api/v1/namespaces/dpi-dev/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1].
>  Message: Forbidden!Configured service account doesn't have access. Service 
> account may have been revoked. persistentvolumeclaims is forbidden: User 
> "system:serviceaccount:dpi-dev:spark" cannot list resource 
> "persistentvolumeclaims" in API group "" in the namespace "<>".
>  
> Ideally there should be configuration 
> spark.kubernetes.driver.pvc.deleteOnTermination which should be checked 
> before calling to delete PVC. If user have not set up PV then there is no 
> need to call the api and delete PVC . 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (SPARK-39965) Spark on K8s delete pvc even though it's not being used.

2022-08-02 Thread pralabhkumar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pralabhkumar updated SPARK-39965:
-
Component/s: (was: Spark Core)

> Spark on K8s delete pvc even though it's not being used.
> 
>
> Key: SPARK-39965
> URL: https://issues.apache.org/jira/browse/SPARK-39965
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Minor
>
> In org.apache.spark.util  getConfiguredLocalDirs  
>  
> {code:java}
> if (isRunningInYarnContainer(conf)) {
>   // If we are in yarn mode, systems can have different disk layouts so we 
> must set it
>   // to what Yarn on this system said was available. Note this assumes that 
> Yarn has
>   // created the directories already, and that they are secured so that only 
> the
>   // user has access to them.
>   randomizeInPlace(getYarnLocalDirs(conf).split(","))
> } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
>   conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
> } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
>   conf.getenv("SPARK_LOCAL_DIRS").split(",")
> }{code}
> randomizedInplace is not called conf.getenv("SPARK_LOCAL_DIRS").split(",") .  
> This is what used in case of K8s and the shuffle locations are not 
> randomized. 
> IMHO , this should be randomized , so that all the directories have equal 
> changes of pushing the data as was done on yarn side 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39965) Spark on K8s delete pvc even though it's not being used.

2022-08-02 Thread pralabhkumar (Jira)

pralabhkumar created SPARK-39965:


 Summary: Spark on K8s delete pvc even though it's not being used.
 Key: SPARK-39965
 URL: https://issues.apache.org/jira/browse/SPARK-39965
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes, Spark Core
Affects Versions: 3.3.0
Reporter: pralabhkumar


In org.apache.spark.util  getConfiguredLocalDirs  

 
{code:java}
if (isRunningInYarnContainer(conf)) {
  // If we are in yarn mode, systems can have different disk layouts so we must 
set it
  // to what Yarn on this system said was available. Note this assumes that 
Yarn has
  // created the directories already, and that they are secured so that only the
  // user has access to them.
  randomizeInPlace(getYarnLocalDirs(conf).split(","))
} else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
  conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
} else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
  conf.getenv("SPARK_LOCAL_DIRS").split(",")
}{code}
randomizedInplace is not called conf.getenv("SPARK_LOCAL_DIRS").split(",") .  
This is what used in case of K8s and the shuffle locations are not randomized. 

IMHO , this should be randomized , so that all the directories have equal 
changes of pushing the data as was done on yarn side 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33782) Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

2022-08-02 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17574314#comment-17574314
 ] 

pralabhkumar commented on SPARK-33782:
--

[~hyukjin.kwon] I would like to work on this . Please let me know if its ok 

> Place spark.files, spark.jars and spark.files under the current working 
> directory on the driver in K8S cluster mode
> ---
>
> Key: SPARK-33782
> URL: https://issues.apache.org/jira/browse/SPARK-33782
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> In Yarn cluster modes, the passed files are able to be accessed in the 
> current working directory. Looks like this is not the case in Kubernates 
> cluset mode.
> By doing this, users can, for example, leverage PEX to manage Python 
> dependences in Apache Spark:
> {code}
> pex pyspark==3.0.1 pyarrow==0.15.1 pandas==0.25.3 -o myarchive.pex
> PYSPARK_PYTHON=./myarchive.pex spark-submit --files myarchive.pex
> {code}
> See also https://github.com/apache/spark/pull/30735/files#r540935585.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-39375) SPIP: Spark Connect - A client and server interface for Apache Spark.

2022-07-27 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571978#comment-17571978
 ] 

pralabhkumar edited comment on SPARK-39375 at 7/27/22 2:51 PM:
---

This is really good proposal and need of an hour (specifically since Livy is 
dormant and Toree also not very active) . This will hugely help in the use 
cases related to Notebook. 

 

Please let us know , is there an ETA for the first version of this , or any 
plan to have further sub tasks , so that other people can contribute to it .  


was (Author: pralabhkumar):
This is really good proposal and need of an hour (specifically since Livy is 
dormant and Toree also not very active) . This will hugely help in the use 
cases related to Notebook. 

 

Please let us know , is there an ETA for the first version of this , or any 
plan to have further tasks , so that other people can contribute to it .  

> SPIP: Spark Connect - A client and server interface for Apache Spark.
> -
>
> Key: SPARK-39375
> URL: https://issues.apache.org/jira/browse/SPARK-39375
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Martin Grund
>Priority: Major
>  Labels: SPIP
>
> Please find the full document for discussion here: [Spark Connect 
> SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj]
>  Below, we have just referenced the introduction.
> h2. What are you trying to do?
> While Spark is used extensively, it was designed nearly a decade ago, which, 
> in the age of serverless computing and ubiquitous programming language use, 
> poses a number of limitations. Most of the limitations stem from the tightly 
> coupled Spark driver architecture and fact that clusters are typically shared 
> across users: (1) {*}Lack of built-in remote connectivity{*}: the Spark 
> driver runs both the client application and scheduler, which results in a 
> heavyweight architecture that requires proximity to the cluster. There is no 
> built-in capability to  remotely connect to a Spark cluster in languages 
> other than SQL and users therefore rely on external solutions such as the 
> inactive project [Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich 
> developer experience{*}: The current architecture and APIs do not cater for 
> interactive data exploration (as done with Notebooks), or allow for building 
> out rich developer experience common in modern code editors. (3) 
> {*}Stability{*}: with the current shared driver architecture, users causing 
> critical exceptions (e.g. OOM) bring the whole cluster down for all users. 
> (4) {*}Upgradability{*}: the current entangling of platform and client APIs 
> (e.g. first and third-party dependencies in the classpath) does not allow for 
> seamless upgrades between Spark versions (and with that, hinders new feature 
> adoption).
>  
> We propose to overcome these challenges by building on the DataFrame API and 
> the underlying unresolved logical plans. The DataFrame API is widely used and 
> makes it very easy to iteratively express complex logic. We will introduce 
> {_}Spark Connect{_}, a remote option of the DataFrame API that separates the 
> client from the Spark server. With Spark Connect, Spark will become 
> decoupled, allowing for built-in remote connectivity: The decoupled client 
> SDK can be used to run interactive data exploration and connect to the server 
> for DataFrame operations. 
>  
> Spark Connect will benefit Spark developers in different ways: The decoupled 
> architecture will result in improved stability, as clients are separated from 
> the driver. From the Spark Connect client perspective, Spark will be (almost) 
> versionless, and thus enable seamless upgradability, as server APIs can 
> evolve without affecting the client API. The decoupled client-server 
> architecture can be leveraged to build close integrations with local 
> developer tooling. Finally, separating the client process from the Spark 
> server process will improve Spark’s overall security posture by avoiding the 
> tight coupling of the client inside the Spark runtime environment.
>  
> Spark Connect will strengthen Spark’s position as the modern unified engine 
> for large-scale data analytics and expand applicability to use cases and 
> developers we could not reach with the current setup: Spark will become 
> ubiquitously usable as the DataFrame API can be used with (almost) any 
> programming language.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional

[jira] [Commented] (SPARK-39375) SPIP: Spark Connect - A client and server interface for Apache Spark.

2022-07-27 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571978#comment-17571978
 ] 

pralabhkumar commented on SPARK-39375:
--

This is really good proposal and need of an hour (specifically since Livy is 
dormant and Toree also not very active) . This will hugely help in the use 
cases related to Notebook. 

 

Please let us know , is there an ETA for the first version of this , or any 
plan to have further tasks , so that other people can contribute to it .  

> SPIP: Spark Connect - A client and server interface for Apache Spark.
> -
>
> Key: SPARK-39375
> URL: https://issues.apache.org/jira/browse/SPARK-39375
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Martin Grund
>Priority: Major
>  Labels: SPIP
>
> Please find the full document for discussion here: [Spark Connect 
> SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj]
>  Below, we have just referenced the introduction.
> h2. What are you trying to do?
> While Spark is used extensively, it was designed nearly a decade ago, which, 
> in the age of serverless computing and ubiquitous programming language use, 
> poses a number of limitations. Most of the limitations stem from the tightly 
> coupled Spark driver architecture and fact that clusters are typically shared 
> across users: (1) {*}Lack of built-in remote connectivity{*}: the Spark 
> driver runs both the client application and scheduler, which results in a 
> heavyweight architecture that requires proximity to the cluster. There is no 
> built-in capability to  remotely connect to a Spark cluster in languages 
> other than SQL and users therefore rely on external solutions such as the 
> inactive project [Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich 
> developer experience{*}: The current architecture and APIs do not cater for 
> interactive data exploration (as done with Notebooks), or allow for building 
> out rich developer experience common in modern code editors. (3) 
> {*}Stability{*}: with the current shared driver architecture, users causing 
> critical exceptions (e.g. OOM) bring the whole cluster down for all users. 
> (4) {*}Upgradability{*}: the current entangling of platform and client APIs 
> (e.g. first and third-party dependencies in the classpath) does not allow for 
> seamless upgrades between Spark versions (and with that, hinders new feature 
> adoption).
>  
> We propose to overcome these challenges by building on the DataFrame API and 
> the underlying unresolved logical plans. The DataFrame API is widely used and 
> makes it very easy to iteratively express complex logic. We will introduce 
> {_}Spark Connect{_}, a remote option of the DataFrame API that separates the 
> client from the Spark server. With Spark Connect, Spark will become 
> decoupled, allowing for built-in remote connectivity: The decoupled client 
> SDK can be used to run interactive data exploration and connect to the server 
> for DataFrame operations. 
>  
> Spark Connect will benefit Spark developers in different ways: The decoupled 
> architecture will result in improved stability, as clients are separated from 
> the driver. From the Spark Connect client perspective, Spark will be (almost) 
> versionless, and thus enable seamless upgradability, as server APIs can 
> evolve without affecting the client API. The decoupled client-server 
> architecture can be leveraged to build close integrations with local 
> developer tooling. Finally, separating the client process from the Spark 
> server process will improve Spark’s overall security posture by avoiding the 
> tight coupling of the client inside the Spark runtime environment.
>  
> Spark Connect will strengthen Spark’s position as the modern unified engine 
> for large-scale data analytics and expand applicability to use cases and 
> developers we could not reach with the current setup: Spark will become 
> ubiquitously usable as the DataFrame API can be used with (almost) any 
> programming language.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39755) SPARK_LOCAL_DIRS locations are not randomized in K8s

2022-07-13 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566430#comment-17566430
 ] 

pralabhkumar commented on SPARK-39755:
--

[~hyukjin.kwon]  , please let me know if the above suggestion is correct(we are 
facing simillar issue of what mention in Spark-24992) when running Spark on K8s 
. I'll implement the same 

> SPARK_LOCAL_DIRS locations are not randomized in K8s
> 
>
> Key: SPARK-39755
> URL: https://issues.apache.org/jira/browse/SPARK-39755
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Minor
>
> In org.apache.spark.util  getConfiguredLocalDirs  
>  
> {code:java}
> if (isRunningInYarnContainer(conf)) {
>   // If we are in yarn mode, systems can have different disk layouts so we 
> must set it
>   // to what Yarn on this system said was available. Note this assumes that 
> Yarn has
>   // created the directories already, and that they are secured so that only 
> the
>   // user has access to them.
>   randomizeInPlace(getYarnLocalDirs(conf).split(","))
> } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
>   conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
> } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
>   conf.getenv("SPARK_LOCAL_DIRS").split(",")
> }{code}
> randomizedInplace is not called conf.getenv("SPARK_LOCAL_DIRS").split(",") .  
> This is what used in case of K8s and the shuffle locations are not 
> randomized. 
> IMHO , this should be randomized , so that all the directories have equal 
> changes of pushing the data as was done on yarn side 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39755) SPARK_LOCAL_DIRS locations are not randomized in K8s

2022-07-12 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566164#comment-17566164
 ] 

pralabhkumar commented on SPARK-39755:
--

Problem seen on yarn side and the fix was 
randomization(https://issues.apache.org/jira/browse/SPARK-24992). Similar 
problem is seen on K8s . Let me know , if its ok , i'll work on it . 

> SPARK_LOCAL_DIRS locations are not randomized in K8s
> 
>
> Key: SPARK-39755
> URL: https://issues.apache.org/jira/browse/SPARK-39755
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Minor
>
> In org.apache.spark.util  getConfiguredLocalDirs  
>  
> {code:java}
> if (isRunningInYarnContainer(conf)) {
>   // If we are in yarn mode, systems can have different disk layouts so we 
> must set it
>   // to what Yarn on this system said was available. Note this assumes that 
> Yarn has
>   // created the directories already, and that they are secured so that only 
> the
>   // user has access to them.
>   randomizeInPlace(getYarnLocalDirs(conf).split(","))
> } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
>   conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
> } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
>   conf.getenv("SPARK_LOCAL_DIRS").split(",")
> }{code}
> randomizedInplace is not called conf.getenv("SPARK_LOCAL_DIRS").split(",") .  
> This is what used in case of K8s and the shuffle locations are not 
> randomized. 
> IMHO , this should be randomized , so that all the directories have equal 
> changes of pushing the data as was done on yarn side 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39755) SPARK_LOCAL_DIRS locations are not randomized in K8s

2022-07-12 Thread pralabhkumar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pralabhkumar updated SPARK-39755:
-
Description: 
In org.apache.spark.util  getConfiguredLocalDirs  

 
{code:java}
if (isRunningInYarnContainer(conf)) {
  // If we are in yarn mode, systems can have different disk layouts so we must 
set it
  // to what Yarn on this system said was available. Note this assumes that 
Yarn has
  // created the directories already, and that they are secured so that only the
  // user has access to them.
  randomizeInPlace(getYarnLocalDirs(conf).split(","))
} else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
  conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
} else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
  conf.getenv("SPARK_LOCAL_DIRS").split(",")
}{code}
randomizedInplace is not called conf.getenv("SPARK_LOCAL_DIRS").split(",") .  
This is what used in case of K8s and the shuffle locations are not randomized. 

IMHO , this should be randomized , so that all the directories have equal 
changes of pushing the data as was done on yarn side 

 

 

 

  was:
In org.apache.spark.util  getConfiguredLocalDirs  

 
{code:java}
if (isRunningInYarnContainer(conf)) {
  // If we are in yarn mode, systems can have different disk layouts so we must 
set it
  // to what Yarn on this system said was available. Note this assumes that 
Yarn has
  // created the directories already, and that they are secured so that only the
  // user has access to them.
  randomizeInPlace(getYarnLocalDirs(conf).split(","))
} else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
  conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
} else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
  conf.getenv("SPARK_LOCAL_DIRS").split(",")
}{code}
conf.getenv("SPARK_LOCAL_DIRS").split(",") is not randomizedInplace.  This is 
what used in case of K8s and the shuffle locations are not randomized. 

IMHO , this should be randomized , so that all the directories have equal 
changes of pushing the data as was done on yarn side 

 

 

 


> SPARK_LOCAL_DIRS locations are not randomized in K8s
> 
>
> Key: SPARK-39755
> URL: https://issues.apache.org/jira/browse/SPARK-39755
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Minor
>
> In org.apache.spark.util  getConfiguredLocalDirs  
>  
> {code:java}
> if (isRunningInYarnContainer(conf)) {
>   // If we are in yarn mode, systems can have different disk layouts so we 
> must set it
>   // to what Yarn on this system said was available. Note this assumes that 
> Yarn has
>   // created the directories already, and that they are secured so that only 
> the
>   // user has access to them.
>   randomizeInPlace(getYarnLocalDirs(conf).split(","))
> } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
>   conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
> } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
>   conf.getenv("SPARK_LOCAL_DIRS").split(",")
> }{code}
> randomizedInplace is not called conf.getenv("SPARK_LOCAL_DIRS").split(",") .  
> This is what used in case of K8s and the shuffle locations are not 
> randomized. 
> IMHO , this should be randomized , so that all the directories have equal 
> changes of pushing the data as was done on yarn side 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39755) SPARK_LOCAL_DIRS locations are not randomized in K8s

2022-07-12 Thread pralabhkumar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pralabhkumar updated SPARK-39755:
-
Description: 
In org.apache.spark.util  getConfiguredLocalDirs  

 
{code:java}
if (isRunningInYarnContainer(conf)) {
  // If we are in yarn mode, systems can have different disk layouts so we must 
set it
  // to what Yarn on this system said was available. Note this assumes that 
Yarn has
  // created the directories already, and that they are secured so that only the
  // user has access to them.
  randomizeInPlace(getYarnLocalDirs(conf).split(","))
} else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
  conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
} else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
  conf.getenv("SPARK_LOCAL_DIRS").split(",")
}{code}
conf.getenv("SPARK_LOCAL_DIRS").split(",") is not randomizedInplace.  This is 
what used in case of K8s and the shuffle locations are not randomized. 

IMHO , this should be randomized , so that all the directories have equal 
changes of pushing the data as was done on yarn side 

 

 

 

  was:
In org.apache.spark.util  getConfiguredLocalDirs  

 
{code:java}
if (isRunningInYarnContainer(conf)) {
  // If we are in yarn mode, systems can have different disk layouts so we must 
set it
  // to what Yarn on this system said was available. Note this assumes that 
Yarn has
  // created the directories already, and that they are secured so that only the
  // user has access to them.
  randomizeInPlace(getYarnLocalDirs(conf).split(","))
} else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
  conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
} else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
  conf.getenv("SPARK_LOCAL_DIRS").split(",")
}{code}
conf.getenv("SPARK_LOCAL_DIRS").split(",") is not randomizedInplace.  This is 
what used in case of K8s and the shuffle locations are not randomized. 

IMHO , this should be randomized , so that all the directories have equal 
changes of pushing the data. 

 

 

 


> SPARK_LOCAL_DIRS locations are not randomized in K8s
> 
>
> Key: SPARK-39755
> URL: https://issues.apache.org/jira/browse/SPARK-39755
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Minor
>
> In org.apache.spark.util  getConfiguredLocalDirs  
>  
> {code:java}
> if (isRunningInYarnContainer(conf)) {
>   // If we are in yarn mode, systems can have different disk layouts so we 
> must set it
>   // to what Yarn on this system said was available. Note this assumes that 
> Yarn has
>   // created the directories already, and that they are secured so that only 
> the
>   // user has access to them.
>   randomizeInPlace(getYarnLocalDirs(conf).split(","))
> } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
>   conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
> } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
>   conf.getenv("SPARK_LOCAL_DIRS").split(",")
> }{code}
> conf.getenv("SPARK_LOCAL_DIRS").split(",") is not randomizedInplace.  This is 
> what used in case of K8s and the shuffle locations are not randomized. 
> IMHO , this should be randomized , so that all the directories have equal 
> changes of pushing the data as was done on yarn side 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39755) SPARK_LOCAL_DIRS locations are not randomized in K8s

2022-07-12 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566142#comment-17566142
 ] 

pralabhkumar commented on SPARK-39755:
--

[~dongjoon] Gentle ping . 

> SPARK_LOCAL_DIRS locations are not randomized in K8s
> 
>
> Key: SPARK-39755
> URL: https://issues.apache.org/jira/browse/SPARK-39755
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Minor
>
> In org.apache.spark.util  getConfiguredLocalDirs  
>  
> {code:java}
> if (isRunningInYarnContainer(conf)) {
>   // If we are in yarn mode, systems can have different disk layouts so we 
> must set it
>   // to what Yarn on this system said was available. Note this assumes that 
> Yarn has
>   // created the directories already, and that they are secured so that only 
> the
>   // user has access to them.
>   randomizeInPlace(getYarnLocalDirs(conf).split(","))
> } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
>   conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
> } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
>   conf.getenv("SPARK_LOCAL_DIRS").split(",")
> }{code}
> conf.getenv("SPARK_LOCAL_DIRS").split(",") is not randomizedInplace.  This is 
> what used in case of K8s and the shuffle locations are not randomized. 
> IMHO , this should be randomized , so that all the directories have equal 
> changes of pushing the data. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39755) SPARK_LOCAL_DIRS locations are not randomized in K8s

2022-07-12 Thread pralabhkumar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pralabhkumar updated SPARK-39755:
-
Summary: SPARK_LOCAL_DIRS locations are not randomized in K8s  (was: 
Spark-shuffle locations are not randomized in K8s)

> SPARK_LOCAL_DIRS locations are not randomized in K8s
> 
>
> Key: SPARK-39755
> URL: https://issues.apache.org/jira/browse/SPARK-39755
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Minor
>
> In org.apache.spark.util  getConfiguredLocalDirs  
>  
> {code:java}
> if (isRunningInYarnContainer(conf)) {
>   // If we are in yarn mode, systems can have different disk layouts so we 
> must set it
>   // to what Yarn on this system said was available. Note this assumes that 
> Yarn has
>   // created the directories already, and that they are secured so that only 
> the
>   // user has access to them.
>   randomizeInPlace(getYarnLocalDirs(conf).split(","))
> } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
>   conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
> } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
>   conf.getenv("SPARK_LOCAL_DIRS").split(",")
> }{code}
> conf.getenv("SPARK_LOCAL_DIRS").split(",") is not randomizedInplace.  This is 
> what used in case of K8s and the shuffle locations are not randomized. 
> IMHO , this should be randomized , so that all the directories have equal 
> changes of pushing the data. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39755) Spark-shuffle locations are not randomized in K8s

2022-07-12 Thread pralabhkumar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pralabhkumar updated SPARK-39755:
-
Summary: Spark-shuffle locations are not randomized in K8s  (was: 
Spark-shuffle locations are not randomized in K8s )

> Spark-shuffle locations are not randomized in K8s
> -
>
> Key: SPARK-39755
> URL: https://issues.apache.org/jira/browse/SPARK-39755
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Minor
>
> In org.apache.spark.util  getConfiguredLocalDirs  
>  
> {code:java}
> if (isRunningInYarnContainer(conf)) {
>   // If we are in yarn mode, systems can have different disk layouts so we 
> must set it
>   // to what Yarn on this system said was available. Note this assumes that 
> Yarn has
>   // created the directories already, and that they are secured so that only 
> the
>   // user has access to them.
>   randomizeInPlace(getYarnLocalDirs(conf).split(","))
> } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
>   conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
> } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
>   conf.getenv("SPARK_LOCAL_DIRS").split(",")
> }{code}
> conf.getenv("SPARK_LOCAL_DIRS").split(",") is not randomizedInplace.  This is 
> what used in case of K8s and the shuffle locations are not randomized. 
> IMHO , this should be randomized , so that all the directories have equal 
> changes of pushing the data. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39755) Spark-shuffle locations are not randomized in K8s

2022-07-12 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565935#comment-17565935
 ] 

pralabhkumar commented on SPARK-39755:
--

[~hyukjin.kwon]  Please comment on the same. 

> Spark-shuffle locations are not randomized in K8s 
> --
>
> Key: SPARK-39755
> URL: https://issues.apache.org/jira/browse/SPARK-39755
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Minor
>
> In org.apache.spark.util  getConfiguredLocalDirs  
>  
> {code:java}
> if (isRunningInYarnContainer(conf)) {
>   // If we are in yarn mode, systems can have different disk layouts so we 
> must set it
>   // to what Yarn on this system said was available. Note this assumes that 
> Yarn has
>   // created the directories already, and that they are secured so that only 
> the
>   // user has access to them.
>   randomizeInPlace(getYarnLocalDirs(conf).split(","))
> } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
>   conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
> } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
>   conf.getenv("SPARK_LOCAL_DIRS").split(",")
> }{code}
> conf.getenv("SPARK_LOCAL_DIRS").split(",") is not randomizedInplace.  This is 
> what used in case of K8s and the shuffle locations are not randomized. 
> IMHO , this should be randomized , so that all the directories have equal 
> changes of pushing the data. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39755) Spark-shuffle locations are not randomized in K8s

2022-07-12 Thread pralabhkumar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pralabhkumar updated SPARK-39755:
-
Description: 
In org.apache.spark.util  getConfiguredLocalDirs  

 
{code:java}
if (isRunningInYarnContainer(conf)) {
  // If we are in yarn mode, systems can have different disk layouts so we must 
set it
  // to what Yarn on this system said was available. Note this assumes that 
Yarn has
  // created the directories already, and that they are secured so that only the
  // user has access to them.
  randomizeInPlace(getYarnLocalDirs(conf).split(","))
} else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
  conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
} else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
  conf.getenv("SPARK_LOCAL_DIRS").split(",")
}{code}
conf.getenv("SPARK_LOCAL_DIRS").split(",") is not randomizedInplace.  This is 
what used in case of K8s and the shuffle locations are not randomized. 

IMHO , this should be randomized , so that all the directories have equal 
changes of pushing the data. 

 

 

 

> Spark-shuffle locations are not randomized in K8s 
> --
>
> Key: SPARK-39755
> URL: https://issues.apache.org/jira/browse/SPARK-39755
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Minor
>
> In org.apache.spark.util  getConfiguredLocalDirs  
>  
> {code:java}
> if (isRunningInYarnContainer(conf)) {
>   // If we are in yarn mode, systems can have different disk layouts so we 
> must set it
>   // to what Yarn on this system said was available. Note this assumes that 
> Yarn has
>   // created the directories already, and that they are secured so that only 
> the
>   // user has access to them.
>   randomizeInPlace(getYarnLocalDirs(conf).split(","))
> } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
>   conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
> } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
>   conf.getenv("SPARK_LOCAL_DIRS").split(",")
> }{code}
> conf.getenv("SPARK_LOCAL_DIRS").split(",") is not randomizedInplace.  This is 
> what used in case of K8s and the shuffle locations are not randomized. 
> IMHO , this should be randomized , so that all the directories have equal 
> changes of pushing the data. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39755) Spark-shuffle locations are not randomized in K8s

2022-07-12 Thread pralabhkumar (Jira)

pralabhkumar created SPARK-39755:


 Summary: Spark-shuffle locations are not randomized in K8s 
 Key: SPARK-39755
 URL: https://issues.apache.org/jira/browse/SPARK-39755
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes, Spark Core
Affects Versions: 3.3.0
Reporter: pralabhkumar






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38292) Support `na_filter` for pyspark.pandas.read_csv

2022-06-23 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557987#comment-17557987
 ] 

pralabhkumar commented on SPARK-38292:
--

[~hyukjin.kwon] Please let me know if its ok . I'll do the same.

> Support `na_filter` for pyspark.pandas.read_csv
> ---
>
> Key: SPARK-38292
> URL: https://issues.apache.org/jira/browse/SPARK-38292
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> pandas support `na_filter` parameter for `read_csv` function. 
> (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
> We also want to support this to follow the behavior of pandas.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38292) Support `na_filter` for pyspark.pandas.read_csv

2022-06-22 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557537#comment-17557537
 ] 

pralabhkumar commented on SPARK-38292:
--

[~hyukjin.kwon] Thx for the suggestion . 

 

After going through the code (DataFrameReader and Univocity spark parser code) 
. Here is the analysis .

Example A,,B

A,,B ==> spark.read.option(“nullValue”,”A”) ==> results in null, null, B

Reason for this is 
 * _parse method in_ org.apache.spark.sql.catalyst.csv.UnivocityParser
 * Parse string => A, A,B (settings.setNullValue in 
com.univocity.parsers.csv.CsvParser replaces the ,, value with A)
 * Now nullSafeDatum will check if (datum == options.{_}nullValue{_} || datum 
== null) and return null for both the values , since datum = options.nullValue 
=> null, null, B
 * Not sure if this is expected  output since from  
com.univocity.parsers.csv.CsvParser point of view expected output should be 
“A,A,B” after setting .setNullValue("A")

 

*Solution*

Now in case of na_filter ,  what I am thinking is to add one property if ( 
(na_filter &&  datum == options.{_}nullValue)|| datum == null){_}

_Now if the input string is A,,B and user have set na_filter to False , then_ 
com.univocity.parsers.csv.CsvParser will return as its is since setNullValue is 
(“”) 

And then (na_filter &&  datum == options.{_}nullValue) condition become false 
and{_} converter.apply(datum) , which will leave the value as its . 

> Support `na_filter` for pyspark.pandas.read_csv
> ---
>
> Key: SPARK-38292
> URL: https://issues.apache.org/jira/browse/SPARK-38292
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> pandas support `na_filter` parameter for `read_csv` function. 
> (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
> We also want to support this to follow the behavior of pandas.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38292) Support `na_filter` for pyspark.pandas.read_csv

2022-06-21 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556851#comment-17556851
 ] 

pralabhkumar commented on SPARK-38292:
--

[~itholic] [~hyukjin.kwon] 

Would like to discuss the logic 

The difference comes na_filter = False , when there are missing values . For 
.eg 

22,,1980-09-26

33,,1980-09-26

 

Pandas with na_filter , read it as its . However Spark will read missing value 
with null . This happens because of univocity-parsers , which reads missing 
value as null . 

 

Approach

in case of na_filter. 

Once file is  read  in namespace.py via reader.csv(patj)  , replace missing 
values with empty string (df.fillna("")). We also need to change the datatype 
of the column to string (as panda does). 

 

 

Please let me know , if its correct direction , i'll create a PR . 

> Support `na_filter` for pyspark.pandas.read_csv
> ---
>
> Key: SPARK-38292
> URL: https://issues.apache.org/jira/browse/SPARK-38292
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> pandas support `na_filter` parameter for `read_csv` function. 
> (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
> We also want to support this to follow the behavior of pandas.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39399) proxy-user support not working for Spark on k8s in cluster deploy mode

2022-06-20 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556302#comment-17556302
 ] 

pralabhkumar commented on SPARK-39399:
--

Gentle ping [~hyukjin.kwon]   [~dongjoon] 

> proxy-user support not working for Spark on k8s in cluster deploy mode
> --
>
> Key: SPARK-39399
> URL: https://issues.apache.org/jira/browse/SPARK-39399
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.2.0
>Reporter: Shrikant
>Priority: Major
>
> As part of https://issues.apache.org/jira/browse/SPARK-25355 Proxy user 
> support was added for Spark on K8s. But the PR only added proxy user on the 
> spark-submit command to the childArgs. The actual functionality of 
> authentication using the proxy user is not working in case of cluster deploy 
> mode for Spark on K8s.
> We get AccessControlException when trying to access the kerberized HDFS 
> through a proxy user. 
> Spark-Submit:
> $SPARK_HOME/bin/spark-submit \
> --master  \
> --deploy-mode cluster \
> --name with_proxy_user_di \
> --proxy-user  \
> --class org.apache.spark.examples.SparkPi \
> --conf spark.kubernetes.container.image= \
> --conf spark.kubernetes.driver.podTemplateFile=driver.yaml \
> --conf spark.kubernetes.executor.podTemplateFile=executor.yaml \
> --conf spark.kubernetes.driver.limit.cores=1 \
> --conf spark.executor.instances=1 \
> --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
> --conf spark.kubernetes.namespace= \
> --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf \
> --conf spark.eventLog.enabled=true \
> --conf spark.eventLog.dir=hdfs:///scaas/shs_logs \--conf 
> spark.kubernetes.file.upload.path=hdfs:///tmp \--conf 
> spark.kubernetes.container.image.pullPolicy=Always \
> --conf 
> spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///opt/log4j/log4j.properties
>  \ $SPARK_HOME/examples/jars/spark-examples_2.12-3.2.0-1.jar 
> Driver Logs:
> {code:java}
> ++ id -u
> + myuid=185
> ++ id -g
> + mygid=0
> + set +e
> ++ getent passwd 185
> + uidentry=
> + set -e
> + '[' -z '' ']'
> + '[' -w /etc/passwd ']'
> + echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
> + SPARK_CLASSPATH=':/opt/spark/jars/*'
> + env
> + grep SPARK_JAVA_OPT_
> + sort -t_ -k4 -n
> + sed 's/[^=]*=\(.*\)/\1/g'
> + readarray -t SPARK_EXECUTOR_JAVA_OPTS
> + '[' -n '' ']'
> + '[' -z ']'
> + '[' -z ']'
> + '[' -n '' ']'
> + '[' -z x ']'
> + SPARK_CLASSPATH='/opt/hadoop/conf::/opt/spark/jars/*'
> + '[' -z x ']'
> + SPARK_CLASSPATH='/opt/spark/conf:/opt/hadoop/conf::/opt/spark/jars/*'
> + case "$1" in
> + shift 1
> + CMD=("$SPARK_HOME/bin/spark-submit" --conf 
> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client 
> "$@")
> + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress= --deploy-mode client --proxy-user proxy_user 
> --properties-file /opt/spark/conf/spark.properties --class 
> org.apache.spark.examples.SparkPi spark-internal
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
> (file:/opt/spark/jars/spark-unsafe_2.12-3.2.0-1.jar) to constructor 
> java.nio.DirectByteBuffer(long,int)
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.spark.unsafe.Platform
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", 
> sampleName="Ops", always=false, type=DEFAULT, value={"Rate of successful 
> kerberos logins and latency (milliseconds)"}, valueName="Time")
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", 
> sampleName="Ops", always=false, type=DEFAULT, value={"Rate of failed kerberos 
> logins and latency (milliseconds)"}, valueName="Time")
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", 
> sampleName="Ops", always=false, type=DEFAULT, value={"GetGroups"}, 
> valueName="Time")
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field private 
> org.apache.hadoop.metrics2.lib.MutableGaugeLong 
>

[jira] [Commented] (SPARK-39399) proxy-user support not working for Spark on k8s in cluster deploy mode

2022-06-15 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554904#comment-17554904
 ] 

pralabhkumar commented on SPARK-39399:
--

ping [~hyukjin.kwon]  , please help us on the same or please provide some 
reference who can take this forward.  

> proxy-user support not working for Spark on k8s in cluster deploy mode
> --
>
> Key: SPARK-39399
> URL: https://issues.apache.org/jira/browse/SPARK-39399
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.2.0
>Reporter: Shrikant
>Priority: Major
>
> As part of https://issues.apache.org/jira/browse/SPARK-25355 Proxy user 
> support was added for Spark on K8s. But the PR only added proxy user on the 
> spark-submit command to the childArgs. The actual functionality of 
> authentication using the proxy user is not working in case of cluster deploy 
> mode for Spark on K8s.
> We get AccessControlException when trying to access the kerberized HDFS 
> through a proxy user. 
> Spark-Submit:
> $SPARK_HOME/bin/spark-submit \
> --master  \
> --deploy-mode cluster \
> --name with_proxy_user_di \
> --proxy-user  \
> --class org.apache.spark.examples.SparkPi \
> --conf spark.kubernetes.container.image= \
> --conf spark.kubernetes.driver.podTemplateFile=driver.yaml \
> --conf spark.kubernetes.executor.podTemplateFile=executor.yaml \
> --conf spark.kubernetes.driver.limit.cores=1 \
> --conf spark.executor.instances=1 \
> --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
> --conf spark.kubernetes.namespace= \
> --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf \
> --conf spark.eventLog.enabled=true \
> --conf spark.eventLog.dir=hdfs:///scaas/shs_logs \--conf 
> spark.kubernetes.file.upload.path=hdfs:///tmp \--conf 
> spark.kubernetes.container.image.pullPolicy=Always \
> --conf 
> spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///opt/log4j/log4j.properties
>  \ $SPARK_HOME/examples/jars/spark-examples_2.12-3.2.0-1.jar 
> Driver Logs:
> {code:java}
> ++ id -u
> + myuid=185
> ++ id -g
> + mygid=0
> + set +e
> ++ getent passwd 185
> + uidentry=
> + set -e
> + '[' -z '' ']'
> + '[' -w /etc/passwd ']'
> + echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
> + SPARK_CLASSPATH=':/opt/spark/jars/*'
> + env
> + grep SPARK_JAVA_OPT_
> + sort -t_ -k4 -n
> + sed 's/[^=]*=\(.*\)/\1/g'
> + readarray -t SPARK_EXECUTOR_JAVA_OPTS
> + '[' -n '' ']'
> + '[' -z ']'
> + '[' -z ']'
> + '[' -n '' ']'
> + '[' -z x ']'
> + SPARK_CLASSPATH='/opt/hadoop/conf::/opt/spark/jars/*'
> + '[' -z x ']'
> + SPARK_CLASSPATH='/opt/spark/conf:/opt/hadoop/conf::/opt/spark/jars/*'
> + case "$1" in
> + shift 1
> + CMD=("$SPARK_HOME/bin/spark-submit" --conf 
> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client 
> "$@")
> + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress= --deploy-mode client --proxy-user proxy_user 
> --properties-file /opt/spark/conf/spark.properties --class 
> org.apache.spark.examples.SparkPi spark-internal
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
> (file:/opt/spark/jars/spark-unsafe_2.12-3.2.0-1.jar) to constructor 
> java.nio.DirectByteBuffer(long,int)
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.spark.unsafe.Platform
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", 
> sampleName="Ops", always=false, type=DEFAULT, value={"Rate of successful 
> kerberos logins and latency (milliseconds)"}, valueName="Time")
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", 
> sampleName="Ops", always=false, type=DEFAULT, value={"Rate of failed kerberos 
> logins and latency (milliseconds)"}, valueName="Time")
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", 
> sampleName="Ops", always=false, type=DEFAULT, value={"GetGroups"}, 
> valueName="Time")
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field private 
>

[jira] [Commented] (SPARK-38292) Support `na_filter` for pyspark.pandas.read_csv

2022-06-15 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554551#comment-17554551
 ] 

pralabhkumar commented on SPARK-38292:
--

[~itholic] I would like to work on this . 

> Support `na_filter` for pyspark.pandas.read_csv
> ---
>
> Key: SPARK-38292
> URL: https://issues.apache.org/jira/browse/SPARK-38292
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> pandas support `na_filter` parameter for `read_csv` function. 
> (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
> We also want to support this to follow the behavior of pandas.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39179) Improve the test coverage for pyspark/shuffle.py

2022-05-13 Thread pralabhkumar (Jira)

pralabhkumar created SPARK-39179:


 Summary: Improve the test coverage for pyspark/shuffle.py
 Key: SPARK-39179
 URL: https://issues.apache.org/jira/browse/SPARK-39179
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: pralabhkumar
Assignee: pralabhkumar
 Fix For: 3.4.0


Improve the test coverage of taskcontext.py



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39179) Improve the test coverage for pyspark/shuffle.py

2022-05-13 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17536555#comment-17536555
 ] 

pralabhkumar commented on SPARK-39179:
--

I am working on this . 

> Improve the test coverage for pyspark/shuffle.py
> 
>
> Key: SPARK-39179
> URL: https://issues.apache.org/jira/browse/SPARK-39179
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Assignee: pralabhkumar
>Priority: Minor
> Fix For: 3.4.0
>
>
> Improve the test coverage of shuffle.py



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39179) Improve the test coverage for pyspark/shuffle.py

2022-05-13 Thread pralabhkumar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pralabhkumar updated SPARK-39179:
-
Description: Improve the test coverage of shuffle.py  (was: Improve the 
test coverage of taskcontext.py)

> Improve the test coverage for pyspark/shuffle.py
> 
>
> Key: SPARK-39179
> URL: https://issues.apache.org/jira/browse/SPARK-39179
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Assignee: pralabhkumar
>Priority: Minor
> Fix For: 3.4.0
>
>
> Improve the test coverage of shuffle.py



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39102) Replace the usage of guava's Files.createTempDir() with java.nio.file.Files.createTempDirectory()

2022-05-09 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17533764#comment-17533764
 ] 

pralabhkumar commented on SPARK-39102:
--

Sure I'll work on this 

> Replace the usage of  guava's Files.createTempDir() with 
> java.nio.file.Files.createTempDirectory()
> --
>
> Key: SPARK-39102
> URL: https://issues.apache.org/jira/browse/SPARK-39102
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.2.1, 3.4.0
>Reporter: pralabhkumar
>Priority: Minor
>
> Hi 
> There are several classes where Spark is using guava's Files.createTempDir() 
> which have vulnerabilities. I think its better to move to 
> java.nio.file.Files.createTempDirectory() for those classes. 
> Classes 
> Java8RDDAPISuite
> JavaAPISuite.java
> RPackageUtilsSuite
> StreamTestHelper
> TestShuffleDataContext
> ExternalBlockHandlerSuite
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39097) Improve the test coverage for pyspark/taskcontext.py

2022-05-07 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17533251#comment-17533251
 ] 

pralabhkumar commented on SPARK-39097:
--

[~hyukjin.kwon] 

While analysis the unit test cases for taskcontext in test_taskcontext.py , i 
found most of the test cases are there . However its not coming in the coverage 
, possibly because methods are called inside tasks (rdd.map(lambda x: 
TaskContext.get().stageId()). 

So for e.g report says 
[https://app.codecov.io/gh/apache/spark/blob/master/python/pyspark/taskcontext.py]
 stageID is not covered in the test case. However test ,  test_stage_id is 
testing the stageid method . 

stage1 = rdd.map(lambda x: TaskContext.get().stageId()).take(1)[0]

If I change the code to below and bring taskcontext to driver , then the 
coverage report says stageid is covered via unit test case. 

rdd.map(lambda x: TaskContext.get()).take(1)[0].stageId()

I can change the code to the above one to have the coverage , please let me 
know , if this is correct. 

 

 

>  Improve the test coverage for pyspark/taskcontext.py
> -
>
> Key: SPARK-39097
> URL: https://issues.apache.org/jira/browse/SPARK-39097
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Assignee: pralabhkumar
>Priority: Minor
> Fix For: 3.4.0
>
>
> Improve the test coverage of taskcontext.py



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39097) Improve the test coverage for pyspark/taskcontext.py

2022-05-07 Thread pralabhkumar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pralabhkumar updated SPARK-39097:
-
Description: Improve the test coverage of taskcontext.py  (was: Improve the 
test coverage of rddsampler.py)

>  Improve the test coverage for pyspark/taskcontext.py
> -
>
> Key: SPARK-39097
> URL: https://issues.apache.org/jira/browse/SPARK-39097
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Assignee: pralabhkumar
>Priority: Minor
> Fix For: 3.4.0
>
>
> Improve the test coverage of taskcontext.py



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39102) Replace the usage of guava's Files.createTempDir() with java.nio.file.Files.createTempDirectory()

2022-05-05 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532443#comment-17532443
 ] 

pralabhkumar commented on SPARK-39102:
--

ping [~hyukjin.kwon] 

> Replace the usage of  guava's Files.createTempDir() with 
> java.nio.file.Files.createTempDirectory()
> --
>
> Key: SPARK-39102
> URL: https://issues.apache.org/jira/browse/SPARK-39102
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.2.1, 3.4.0
>Reporter: pralabhkumar
>Priority: Minor
>
> Hi 
> There are several classes where Spark is using guava's Files.createTempDir() 
> which have vulnerabilities. I think its better to move to 
> java.nio.file.Files.createTempDirectory() for those classes. 
> Classes 
> Java8RDDAPISuite
> JavaAPISuite.java
> RPackageUtilsSuite
> StreamTestHelper
> TestShuffleDataContext
> ExternalBlockHandlerSuite
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39102) Replace the usage of guava's Files.createTempDir() with java.nio.file.Files.createTempDirectory()

2022-05-04 Thread pralabhkumar (Jira)

pralabhkumar created SPARK-39102:


 Summary: Replace the usage of  guava's Files.createTempDir() with 
java.nio.file.Files.createTempDirectory()
 Key: SPARK-39102
 URL: https://issues.apache.org/jira/browse/SPARK-39102
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.1, 3.2.0, 3.4.0
Reporter: pralabhkumar


Hi 

There are several classes where Spark is using guava's Files.createTempDir() 
which have vulnerabilities. I think its better to move to 
java.nio.file.Files.createTempDirectory() for those classes. 

Classes 

Java8RDDAPISuite

JavaAPISuite.java

RPackageUtilsSuite

StreamTestHelper

TestShuffleDataContext

ExternalBlockHandlerSuite

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38262) Upgrade Google guava to version 30.0-jre

2022-05-04 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531526#comment-17531526
 ] 

pralabhkumar commented on SPARK-38262:
--

[~bjornjorgensen] 

So QQ , as part of this PR , it is not upgraded to version 30.0 , because of 
issues on Hive and Hadoop side.
 * So is there any plan to fix 
[CVE-2020-8908|https://nvd.nist.gov/vuln/detail/CVE-2020-8908]
 * does this effect  https://issues.apache.org/jira/browse/HADOOP-18036 any 
decision on Spark side 

> Upgrade Google guava to version 30.0-jre
> 
>
> Key: SPARK-38262
> URL: https://issues.apache.org/jira/browse/SPARK-38262
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> This is duplicated many times like in 
> [SPARK-32502|https://issues.apache.org/jira/browse/SPARK-32502]  
> Apache Spark is using com.google.guava:guava version 14.0.1 which has two 
> security issues.
> [CVE-2018-10237|https://nvd.nist.gov/vuln/detail/CVE-2018-10237] 
> [CVE-2020-8908|https://nvd.nist.gov/vuln/detail/CVE-2020-8908] 
> We should upgrade to [version 
> 30.0|https://mvnrepository.com/artifact/com.google.guava/guava/30.0-jre] 
> I will add some links to what I have found about this issue 
> [HIVE-25617:fix bug introduced by 
> CVE-2020-8908|https://github.com/apache/hive/pull/2725]
> [Upgrade Guava to 27|https://github.com/apache/druid/pull/10683] 
> [HIVE-21961: Upgrade Hadoop to 3.1.4, Guava to 27.0-jre and Jetty to 
> 9.4.20.v20190813|https://github.com/apache/hive/pull/1821]   
> [Shade Guava manually|https://github.com/apache/druid/issues/6942] 
> [[DISCUSS] Hadoop 3, dropping support for Hadoop 
> 2.x|https://lists.apache.org/thread/zmc389trnkh6x444so8mdb2h0x0noqq4] 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39097) Improve the test coverage for pyspark/taskcontext.py

2022-05-03 Thread pralabhkumar (Jira)

pralabhkumar created SPARK-39097:


 Summary:  Improve the test coverage for pyspark/taskcontext.py
 Key: SPARK-39097
 URL: https://issues.apache.org/jira/browse/SPARK-39097
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: pralabhkumar
Assignee: pralabhkumar
 Fix For: 3.4.0


Improve the test coverage of rddsampler.py



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39097) Improve the test coverage for pyspark/taskcontext.py

2022-05-03 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531481#comment-17531481
 ] 

pralabhkumar commented on SPARK-39097:
--

I am working on this. 

>  Improve the test coverage for pyspark/taskcontext.py
> -
>
> Key: SPARK-39097
> URL: https://issues.apache.org/jira/browse/SPARK-39097
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Assignee: pralabhkumar
>Priority: Minor
> Fix For: 3.4.0
>
>
> Improve the test coverage of rddsampler.py



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25355) Support --proxy-user for Spark on K8s

2022-04-29 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17530083#comment-17530083
 ] 

pralabhkumar commented on SPARK-25355:
--

[~hyukjin.kwon]  Can u please us  or redirect us to someone  who can help us on 
the above two comments . 

> Support --proxy-user for Spark on K8s
> -
>
> Key: SPARK-25355
> URL: https://issues.apache.org/jira/browse/SPARK-25355
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Stavros Kontopoulos
>Assignee: Pedro Rossi
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition 
> needed is the support for proxy user. A proxy user is impersonated by a 
> superuser who executes operations on behalf of the proxy user. More on this: 
> [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html]
> [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md]
> This has been implemented for Yarn upstream and Spark on Mesos here:
> [https://github.com/mesosphere/spark/pull/26]
> [~ifilonenko] creating this issue according to our discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39029) Improve the test coverage for pyspark/broadcast.py

2022-04-26 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528269#comment-17528269
 ] 

pralabhkumar commented on SPARK-39029:
--

I am working on this . 

> Improve the test coverage for pyspark/broadcast.py
> --
>
> Key: SPARK-39029
> URL: https://issues.apache.org/jira/browse/SPARK-39029
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Minor
>
> Improve the test coverage of rddsampler.py



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39029) Improve the test coverage for pyspark/broadcast.py

2022-04-26 Thread pralabhkumar (Jira)

pralabhkumar created SPARK-39029:


 Summary: Improve the test coverage for pyspark/broadcast.py
 Key: SPARK-39029
 URL: https://issues.apache.org/jira/browse/SPARK-39029
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: pralabhkumar


Improve the test coverage of rddsampler.py



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38492) Improve the test coverage for PySpark

2022-04-17 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17523293#comment-17523293
 ] 

pralabhkumar commented on SPARK-38492:
--

on it . Thx

> Improve the test coverage for PySpark
> -
>
> Key: SPARK-38492
> URL: https://issues.apache.org/jira/browse/SPARK-38492
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, Tests
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Currently, PySpark test coverage is around 91% according to codecov report: 
> [https://app.codecov.io/gh/apache/spark|https://app.codecov.io/gh/apache/spark]
> Since there are still 9% missing tests, so I think it would be great to 
> improve our test coverage.
> Of course we might not target to 100%, but as much as possible, to the level 
> that we can currently cover with CI.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38874) CLONE - Improve the test coverage for pyspark/ml module

2022-04-12 Thread pralabhkumar (Jira)

pralabhkumar created SPARK-38874:


 Summary: CLONE - Improve the test coverage for pyspark/ml module
 Key: SPARK-38874
 URL: https://issues.apache.org/jira/browse/SPARK-38874
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: pralabhkumar


Currently, ml module has 90% of test coverage.

We could improve the test coverage by adding the missing tests for ml module.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-38871) Improve the test coverage for PySpark/rddsampler.py

2022-04-12 Thread pralabhkumar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pralabhkumar closed SPARK-38871.


This issue is wrongly created , hence closing it 

> Improve the test coverage for PySpark/rddsampler.py
> ---
>
> Key: SPARK-38871
> URL: https://issues.apache.org/jira/browse/SPARK-38871
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, Tests
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Major
>
> Currently, PySpark test coverage is around 91% according to codecov report: 
> [https://app.codecov.io/gh/apache/spark|https://app.codecov.io/gh/apache/spark]
> Since there are still 9% missing tests, so I think it would be great to 
> improve our test coverage.
> Of course we might not target to 100%, but as much as possible, to the level 
> that we can currently cover with CI.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38871) Improve the test coverage for PySpark/rddsampler.py

2022-04-12 Thread pralabhkumar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pralabhkumar resolved SPARK-38871.
--
Resolution: Invalid

> Improve the test coverage for PySpark/rddsampler.py
> ---
>
> Key: SPARK-38871
> URL: https://issues.apache.org/jira/browse/SPARK-38871
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, Tests
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Major
>
> Currently, PySpark test coverage is around 91% according to codecov report: 
> [https://app.codecov.io/gh/apache/spark|https://app.codecov.io/gh/apache/spark]
> Since there are still 9% missing tests, so I think it would be great to 
> improve our test coverage.
> Of course we might not target to 100%, but as much as possible, to the level 
> that we can currently cover with CI.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py

2022-04-12 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521135#comment-17521135
 ] 

pralabhkumar commented on SPARK-38879:
--

I will be working on this . 

> Improve the test coverage for pyspark/rddsampler.py
> ---
>
> Key: SPARK-38879
> URL: https://issues.apache.org/jira/browse/SPARK-38879
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.4.0
>
>
> Improve the test coverage of rddsampler.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py

2022-04-12 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521135#comment-17521135
 ] 

pralabhkumar edited comment on SPARK-38879 at 4/12/22 1:07 PM:
---

Please allow me to work on this 


was (Author: pralabhkumar):
I will be working on this . 

> Improve the test coverage for pyspark/rddsampler.py
> ---
>
> Key: SPARK-38879
> URL: https://issues.apache.org/jira/browse/SPARK-38879
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.4.0
>
>
> Improve the test coverage of rddsampler.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py

2022-04-12 Thread pralabhkumar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pralabhkumar updated SPARK-38879:
-
Description: Improve the test coverage of rddsampler.py  (was: Improve the 
test coverage of statcounter.py )

> Improve the test coverage for pyspark/rddsampler.py
> ---
>
> Key: SPARK-38879
> URL: https://issues.apache.org/jira/browse/SPARK-38879
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.4.0
>
>
> Improve the test coverage of rddsampler.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py

2022-04-12 Thread pralabhkumar (Jira)

pralabhkumar created SPARK-38879:


 Summary: Improve the test coverage for pyspark/rddsampler.py
 Key: SPARK-38879
 URL: https://issues.apache.org/jira/browse/SPARK-38879
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: pralabhkumar
Assignee: Hyukjin Kwon
 Fix For: 3.4.0


Improve the test coverage of statcounter.py 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38871) Improve the test coverage for PySpark/rddsampler.py

2022-04-12 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521134#comment-17521134
 ] 

pralabhkumar commented on SPARK-38871:
--

Please close this one , wrongly cloned 

> Improve the test coverage for PySpark/rddsampler.py
> ---
>
> Key: SPARK-38871
> URL: https://issues.apache.org/jira/browse/SPARK-38871
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, Tests
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Major
>
> Currently, PySpark test coverage is around 91% according to codecov report: 
> [https://app.codecov.io/gh/apache/spark|https://app.codecov.io/gh/apache/spark]
> Since there are still 9% missing tests, so I think it would be great to 
> improve our test coverage.
> Of course we might not target to 100%, but as much as possible, to the level 
> that we can currently cover with CI.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38876) CLONE - Improve the test coverage for pyspark/*.py

2022-04-12 Thread pralabhkumar (Jira)

pralabhkumar created SPARK-38876:


 Summary: CLONE - Improve the test coverage for pyspark/*.py
 Key: SPARK-38876
 URL: https://issues.apache.org/jira/browse/SPARK-38876
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: pralabhkumar


Currently, there are several Python scripts under pyspark/ directory. (e.g. 
rdd.py, util.py, serializers.py, ...)

We could improve the test coverage by adding the missing tests for these 
scripts.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38872) CLONE - Improve the test coverage for pyspark/pandas module

2022-04-12 Thread pralabhkumar (Jira)

pralabhkumar created SPARK-38872:


 Summary: CLONE - Improve the test coverage for pyspark/pandas 
module
 Key: SPARK-38872
 URL: https://issues.apache.org/jira/browse/SPARK-38872
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: pralabhkumar


Currently, pandas module (pandas API on Spark) has 94% of test coverage.

We could improve the test coverage by adding the missing tests for pandas 
module.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38878) CLONE - Improve the test coverage for pyspark/statcounter.py

2022-04-12 Thread pralabhkumar (Jira)

pralabhkumar created SPARK-38878:


 Summary: CLONE - Improve the test coverage for 
pyspark/statcounter.py
 Key: SPARK-38878
 URL: https://issues.apache.org/jira/browse/SPARK-38878
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: pralabhkumar
Assignee: Hyukjin Kwon
 Fix For: 3.4.0


Improve the test coverage of statcounter.py 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38875) CLONE - Improve the test coverage for pyspark/sql module

2022-04-12 Thread pralabhkumar (Jira)

pralabhkumar created SPARK-38875:


 Summary: CLONE - Improve the test coverage for pyspark/sql module
 Key: SPARK-38875
 URL: https://issues.apache.org/jira/browse/SPARK-38875
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: pralabhkumar


Currently, sql module has 90% of test coverage.

We could improve the test coverage by adding the missing tests for sql module.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38877) CLONE - Improve the test coverage for pyspark/find_spark_home.py

2022-04-12 Thread pralabhkumar (Jira)

pralabhkumar created SPARK-38877:


 Summary: CLONE - Improve the test coverage for 
pyspark/find_spark_home.py
 Key: SPARK-38877
 URL: https://issues.apache.org/jira/browse/SPARK-38877
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.4.0
Reporter: pralabhkumar
Assignee: Hyukjin Kwon
 Fix For: 3.4.0


We should test when the environment variables are not set 
(https://app.codecov.io/gh/apache/spark/blob/master/python/pyspark/find_spark_home.py)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38871) Improve the test coverage for PySpark/rddsampler.py

2022-04-12 Thread pralabhkumar (Jira)

pralabhkumar created SPARK-38871:


 Summary: Improve the test coverage for PySpark/rddsampler.py
 Key: SPARK-38871
 URL: https://issues.apache.org/jira/browse/SPARK-38871
 Project: Spark
  Issue Type: Umbrella
  Components: PySpark, Tests
Affects Versions: 3.3.0
Reporter: pralabhkumar


Currently, PySpark test coverage is around 91% according to codecov report: 
[https://app.codecov.io/gh/apache/spark|https://app.codecov.io/gh/apache/spark]

Since there are still 9% missing tests, so I think it would be great to improve 
our test coverage.

Of course we might not target to 100%, but as much as possible, to the level 
that we can currently cover with CI.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38873) CLONE - Improve the test coverage for pyspark/mllib module

2022-04-12 Thread pralabhkumar (Jira)

pralabhkumar created SPARK-38873:


 Summary: CLONE - Improve the test coverage for pyspark/mllib module
 Key: SPARK-38873
 URL: https://issues.apache.org/jira/browse/SPARK-38873
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: pralabhkumar


Currently, mllib module has 88% of test coverage.

We could improve the test coverage by adding the missing tests for mllib module.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38854) Improve the test coverage for pyspark/statcounter.py

2022-04-11 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17520374#comment-17520374
 ] 

pralabhkumar commented on SPARK-38854:
--

[~gurwls223] I would like to work on this , hence can u please remove the 
Assignee to "unassigned" for now (since I cloned from your task it 
automatically came) Thx. 

> Improve the test coverage for pyspark/statcounter.py
> 
>
> Key: SPARK-38854
> URL: https://issues.apache.org/jira/browse/SPARK-38854
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> Improve the test coverage of statcounter.py 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38854) Improve the test coverage for pyspark/statcounter.py

2022-04-11 Thread pralabhkumar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pralabhkumar updated SPARK-38854:
-
Affects Version/s: 3.3.0
   (was: 3.4.0)

> Improve the test coverage for pyspark/statcounter.py
> 
>
> Key: SPARK-38854
> URL: https://issues.apache.org/jira/browse/SPARK-38854
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> Improve the test coverage of statcounter.py 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38854) Improve the test coverage for pyspark/statcounter.py

2022-04-11 Thread pralabhkumar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pralabhkumar updated SPARK-38854:
-
Description: Improve the test coverage of statcounter.py   (was: We should 
test when the environment variables are not set 
(https://app.codecov.io/gh/apache/spark/blob/master/python/pyspark/find_spark_home.py))

> Improve the test coverage for pyspark/statcounter.py
> 
>
> Key: SPARK-38854
> URL: https://issues.apache.org/jira/browse/SPARK-38854
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: pralabhkumar
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> Improve the test coverage of statcounter.py 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38854) Improve the test coverage for pyspark/statcounter.py

2022-04-11 Thread pralabhkumar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pralabhkumar updated SPARK-38854:
-
Fix Version/s: (was: 3.4.0)

> Improve the test coverage for pyspark/statcounter.py
> 
>
> Key: SPARK-38854
> URL: https://issues.apache.org/jira/browse/SPARK-38854
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: pralabhkumar
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> We should test when the environment variables are not set 
> (https://app.codecov.io/gh/apache/spark/blob/master/python/pyspark/find_spark_home.py)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38854) Improve the test coverage for pyspark/statcounter.py

2022-04-11 Thread pralabhkumar (Jira)

pralabhkumar created SPARK-38854:


 Summary: Improve the test coverage for pyspark/statcounter.py
 Key: SPARK-38854
 URL: https://issues.apache.org/jira/browse/SPARK-38854
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.4.0
Reporter: pralabhkumar
Assignee: Hyukjin Kwon
 Fix For: 3.4.0


We should test when the environment variables are not set 
(https://app.codecov.io/gh/apache/spark/blob/master/python/pyspark/find_spark_home.py)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38637) pyspark.pandas.config.OptionError: "No such option: 'mode.chained_assignment'

2022-04-05 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17517393#comment-17517393
 ] 

pralabhkumar commented on SPARK-38637:
--

[~itholic] can I work on this 

> pyspark.pandas.config.OptionError: "No such option: 'mode.chained_assignment'
> -
>
> Key: SPARK-38637
> URL: https://issues.apache.org/jira/browse/SPARK-38637
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Prakhar Sandhu
>Priority: Major
>
> I replaced import pandas as pd to import pyspark.pandas as pd in my code. 
> {code:java}
> pd.set_option("mode.chained_assignment", None) {code}
> The above command was working with pandas but this option is not available in 
>  pyspark.pandas . 
> {code:java}
> pyspark.pandas.config.OptionError: "No such option: 
> 'mode.chained_assignment'. Available options are [display.max_rows, 
> compute.max_rows, compute.shortcut_limit, compute.ops_on_diff_frames, 
> compute.default_index_type, compute.ordered_head, 
> plotting.max_rows, plotting.sample_ratio, plotting.backend]" {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24432) Add support for dynamic resource allocation

2022-01-19 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17478618#comment-17478618
 ] 

pralabhkumar commented on SPARK-24432:
--

[~dongjoon] 

one quick question . 
 - The K8s dynamic allocation with storage migration between executors is 
already in `master` branch for Apache Spark 3.1.0.

If u can please provide the PR which is doing that , it would be really helpful

> Add support for dynamic resource allocation
> ---
>
> Key: SPARK-24432
> URL: https://issues.apache.org/jira/browse/SPARK-24432
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Yinan Li
>Priority: Major
>
> This is an umbrella ticket for work on adding support for dynamic resource 
> allocation into the Kubernetes mode. This requires a Kubernetes-specific 
> external shuffle service. The feature is available in our fork at 
> github.com/apache-spark-on-k8s/spark.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37491) Fix Series.asof when values of the series is not sorted

2022-01-10 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472056#comment-17472056
 ] 

pralabhkumar commented on SPARK-37491:
--

Lets take example of 

pser = pd.Series([2, 1, np.nan, 4], index=[10, 20, 30, 40], name="Koalas")

pser.asof([5,20])  will give output [Nan , 1] 

While 

ps.from_pandas(pser).asof[5,20] will give output [Nan, 2]

*Explanation*

Data frame created after applying condition.

F.when(index_scol <= SF.lit(index).cast(index_type)  Without applying max 
aggregation  

+-+--+-+

|col_5        |col_25        |__index_level_0__|

+-+--+-+

|null|2.0|10               |

|null|1.0|20               |

|null|null|30               |

|null|null|40               |

+-+--+-+

Since we are taking max , output is coming 2. Ideally what we need is the last 
non null value or each col with increasing order of __index_level_0__.

Now to implement the logic . What I planning to do is create a below DF from 
the above DF , using explode , partition and row_number

__index_level_0__.        Identifier          value    row_number

40                                      col_5               null.      1

30                                    col_5                null       2

20                                    col_5                null       3

10                                    col_5               null         4

40                                     col_20         2              1

30                                     col_20        1              2

20                                    col_20         null         3

10                                  col_20            null         4  

 

Then filter on row_number=1 . There are other things to take care , but 
majority of the logic is this .

Please let me know if its in correct direction ( This is actually passing all 
the asof test cases ,including the  case which is described in jira. ) . 

 

[~itholic]  

> Fix Series.asof when values of the series is not sorted
> ---
>
> Key: SPARK-37491
> URL: https://issues.apache.org/jira/browse/SPARK-37491
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>
> https://github.com/apache/spark/pull/34737#discussion_r758223279



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37491) Fix Series.asof when values of the series is not sorted

2022-01-09 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17471314#comment-17471314
 ] 

pralabhkumar commented on SPARK-37491:
--

I am working on it .

> Fix Series.asof when values of the series is not sorted
> ---
>
> Key: SPARK-37491
> URL: https://issues.apache.org/jira/browse/SPARK-37491
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>
> https://github.com/apache/spark/pull/34737#discussion_r758223279



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37491) Fix Series.asof when values of the series is not sorted

2022-01-05 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17469471#comment-17469471
 ] 

pralabhkumar commented on SPARK-37491:
--

I would like to work on this . Basically the problem is in series.py , finding 
Max . 

cond = [
F.max(F.when(index_scol <= SF.lit(index).cast(index_type), self.spark.column))
for index in where
]

 

 

cc [~hyukjin.kwon] [~itholic] 

> Fix Series.asof when values of the series is not sorted
> ---
>
> Key: SPARK-37491
> URL: https://issues.apache.org/jira/browse/SPARK-37491
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>
> https://github.com/apache/spark/pull/34737#discussion_r758223279



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-37491) Fix Series.asof when values of the series is not sorted

2022-01-05 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17469471#comment-17469471
 ] 

pralabhkumar edited comment on SPARK-37491 at 1/5/22, 6:27 PM:
---

I would like to work on this . Basically the problem is in series.py . We 
should not find max here.  

cond = [
F.max(F.when(index_scol <= SF.lit(index).cast(index_type), self.spark.column))
for index in where
]

 

 

cc [~hyukjin.kwon] [~itholic] 


was (Author: pralabhkumar):
I would like to work on this . Basically the problem is in series.py , finding 
Max . 

cond = [
F.max(F.when(index_scol <= SF.lit(index).cast(index_type), self.spark.column))
for index in where
]

 

 

cc [~hyukjin.kwon] [~itholic] 

> Fix Series.asof when values of the series is not sorted
> ---
>
> Key: SPARK-37491
> URL: https://issues.apache.org/jira/browse/SPARK-37491
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>
> https://github.com/apache/spark/pull/34737#discussion_r758223279



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37188) pyspark.pandas histogram accepts the title option but does not add a title to the plot

2021-11-18 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17446305#comment-17446305
 ] 

pralabhkumar commented on SPARK-37188:
--

[~hyukjin.kwon] 

Working on it . Thx

> pyspark.pandas histogram accepts the title option but does not add a title to 
> the plot
> --
>
> Key: SPARK-37188
> URL: https://issues.apache.org/jira/browse/SPARK-37188
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> In pyspark.pandas if you write a line like this
> {quote}DF.plot.hist(bins=20, title="US Counties -- FullVaxPer100")
> {quote}
> it compiles and runs, but the plot has no title.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-37188) pyspark.pandas histogram accepts the title option but does not add a title to the plot

2021-11-18 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17445978#comment-17445978
 ] 

pralabhkumar edited comment on SPARK-37188 at 11/18/21, 2:56 PM:
-

IMHO , the issue is in pyspark.pandas.plot plotly.py plot_histogram method .  
arguments (kwargs) , passed by user are not passed to plotly when creating the 
figure. Therefore this issue is not just with title but can happen with other 
arguments like "activeshape" , "font". 

Once I passes the user argument to go.Layout(title=kwargs.get("title"))

, title issue is not happening(provided user passes title).  I think , we 
should pass all the arguments provided by user and expected by go.Layout.  
Similarly for go.Bar

[~yikunkero]  [~hyukjin.kwon]  . Please let me know , if I my understanding is 
correct , I can create a PR for it . 


was (Author: pralabhkumar):
IMHO , the issue is in pyspark.pandas.plot plotly.py plot_histogram method .  
arguments (kwargs) , passed by user are not passed to plotly when creating the 
figure. Therefore this issue is not just with title but can happen with other 
arguments like "activeshape" , "font". 

Once I passes the user argument to go.Layout(title=kwargs.get("title"))

, title issue is not happening(provided user passes title).  I think , we 
should pass all the arguments provided by user and expected by go.Layout.  
Similarly for go.Bar

[~yikunkero] . Please let me know , if I my understanding is correct , I can 
create a PR for it . 

> pyspark.pandas histogram accepts the title option but does not add a title to 
> the plot
> --
>
> Key: SPARK-37188
> URL: https://issues.apache.org/jira/browse/SPARK-37188
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> In pyspark.pandas if you write a line like this
> {quote}DF.plot.hist(bins=20, title="US Counties -- FullVaxPer100")
> {quote}
> it compiles and runs, but the plot has no title.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37188) pyspark.pandas histogram accepts the title option but does not add a title to the plot

2021-11-18 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17445978#comment-17445978
 ] 

pralabhkumar commented on SPARK-37188:
--

IMHO , the issue is in pyspark.pandas.plot plotly.py plot_histogram method .  
arguments (kwargs) , passed by user are not passed to plotly when creating the 
figure. Therefore this issue is not just with title but can happen with other 
arguments like "activeshape" , "font". 

Once I passes the user argument to go.Layout(title=kwargs.get("title"))

, title issue is not happening(provided user passes title).  I think , we 
should pass all the arguments provided by user and expected by go.Layout.  
Similarly for go.Bar

[~yikunkero] . Please let me know , if I my understanding is correct , I can 
create a PR for it . 

> pyspark.pandas histogram accepts the title option but does not add a title to 
> the plot
> --
>
> Key: SPARK-37188
> URL: https://issues.apache.org/jira/browse/SPARK-37188
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> In pyspark.pandas if you write a line like this
> {quote}DF.plot.hist(bins=20, title="US Counties -- FullVaxPer100")
> {quote}
> it compiles and runs, but the plot has no title.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-37189) pyspark.pandas histogram accepts the range option but does not use it

2021-11-18 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17445960#comment-17445960
 ] 

pralabhkumar edited comment on SPARK-37189 at 11/18/21, 2:52 PM:
-

 

 


was (Author: pralabhkumar):
IMHO , the issue is in pyspark.pandas.plot plotly.py plot_histogram method .  
arguments (kwargs) , passed by user are not passed to plotly when creating the 
figure. Therefore this issue is not just with title but can happen with other 
arguments like "activeshape" , "font". 

Once I passes the user argument to go.Layout , title issue is not 
happening(provided user passes title). 

[~yikunkero] . Please let me know , if I my understanding is correct , I can 
create a PR for it .  

 

> pyspark.pandas histogram accepts the range option but does not use it
> -
>
> Key: SPARK-37189
> URL: https://issues.apache.org/jira/browse/SPARK-37189
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> In pyspark.pandas if you write a line like this
> {quote}DF.plot.hist(bins=30, range=[0, 20], title="US Counties -- 
> DeathsPer100k (<20)")
> {quote}
> it compiles and runs, but the plot does not respect the range. All the values 
> are shown.
> The workaround is to create a new DataFrame that pre-selects just the rows 
> you want, but line above should work also.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-37189) pyspark.pandas histogram accepts the range option but does not use it

2021-11-18 Thread pralabhkumar (Jira)



[ https://issues.apache.org/jira/browse/SPARK-37189 ]


pralabhkumar deleted comment on SPARK-37189:
--

was (Author: pralabhkumar):
 

 

> pyspark.pandas histogram accepts the range option but does not use it
> -
>
> Key: SPARK-37189
> URL: https://issues.apache.org/jira/browse/SPARK-37189
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> In pyspark.pandas if you write a line like this
> {quote}DF.plot.hist(bins=30, range=[0, 20], title="US Counties -- 
> DeathsPer100k (<20)")
> {quote}
> it compiles and runs, but the plot does not respect the range. All the values 
> are shown.
> The workaround is to create a new DataFrame that pre-selects just the rows 
> you want, but line above should work also.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37189) pyspark.pandas histogram accepts the range option but does not use it

2021-11-18 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17445960#comment-17445960
 ] 

pralabhkumar commented on SPARK-37189:
--

IMHO , the issue is in pyspark.pandas.plot plotly.py plot_histogram method .  
arguments (kwargs) , passed by user are not passed to plotly when creating the 
figure. Therefore this issue is not just with title but can happen with other 
arguments like "activeshape" , "font". 

Once I passes the user argument to go.Layout , title issue is not 
happening(provided user passes title). 

[~yikunkero] . Please let me know , if I my understanding is correct , I can 
create a PR for it .  

 

> pyspark.pandas histogram accepts the range option but does not use it
> -
>
> Key: SPARK-37189
> URL: https://issues.apache.org/jira/browse/SPARK-37189
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> In pyspark.pandas if you write a line like this
> {quote}DF.plot.hist(bins=30, range=[0, 20], title="US Counties -- 
> DeathsPer100k (<20)")
> {quote}
> it compiles and runs, but the plot does not respect the range. All the values 
> are shown.
> The workaround is to create a new DataFrame that pre-selects just the rows 
> you want, but line above should work also.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding

2021-11-15 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444256#comment-17444256
 ] 

pralabhkumar commented on SPARK-37181:
--

[~yikunkero] [~chconnell] . I'll work on this and will create a PR 

> pyspark.pandas.read_csv() should support latin-1 encoding
> -
>
> Key: SPARK-37181
> URL: https://issues.apache.org/jira/browse/SPARK-37181
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding 
> is not recognized in pyspark.pandas. You have to use Windows-1252 instead, 
> which is almost the same but not identical. }}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding

2021-11-15 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17443764#comment-17443764
 ] 

pralabhkumar edited comment on SPARK-37181 at 11/15/21, 2:10 PM:
-

However from users point of view , if user mention latin-1 in pyspark.pandas 
then instead of throwing "pyspark.sql.utils.IllegalArgumentException: latin-1" 
, spark can internally convert it to ISO-8859-1

 

cc [~hyukjin.kwon] , [~yikunkero] 

Let me know ,  if my understanding is correct . If yes, then I can work on this 
h1.  


was (Author: pralabhkumar):
However from users point of view , if user mention latin-1 in pyspark.pandas 
then instead of throwing "pyspark.sql.utils.IllegalArgumentException: latin-1" 
, spark can internally convert it to ISO-8859-1

 

cc [~hyukjin.kwon] , [~yikunkero] 

Let me know ,  if I can work on this 
h1.

> pyspark.pandas.read_csv() should support latin-1 encoding
> -
>
> Key: SPARK-37181
> URL: https://issues.apache.org/jira/browse/SPARK-37181
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding 
> is not recognized in pyspark.pandas. You have to use Windows-1252 instead, 
> which is almost the same but not identical. }}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding

2021-11-15 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17443764#comment-17443764
 ] 

pralabhkumar edited comment on SPARK-37181 at 11/15/21, 2:10 PM:
-

However from users point of view , if user mention latin-1 in pyspark.pandas 
then instead of throwing "pyspark.sql.utils.IllegalArgumentException: latin-1" 
, spark can internally convert it to ISO-8859-1

 

cc [~hyukjin.kwon] , [~yikunkero] 

Let me know ,  if I can work on this 
h1.


was (Author: pralabhkumar):
However from users point of view , if user mention latin-1 in pyspark.pandas 
then instead of throwing "pyspark.sql.utils.IllegalArgumentException: latin-1" 
, spark can internally convert it to ISO-8859-1

 

cc [~hyukjin.kwon] 

> pyspark.pandas.read_csv() should support latin-1 encoding
> -
>
> Key: SPARK-37181
> URL: https://issues.apache.org/jira/browse/SPARK-37181
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding 
> is not recognized in pyspark.pandas. You have to use Windows-1252 instead, 
> which is almost the same but not identical. }}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding

2021-11-15 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17443764#comment-17443764
 ] 

pralabhkumar commented on SPARK-37181:
--

However from users point of view , if user mention latin-1 in pyspark.pandas 
then instead of throwing "pyspark.sql.utils.IllegalArgumentException: latin-1" 
, spark can internally convert it to ISO-8859-1

 

cc [~hyukjin.kwon] 

> pyspark.pandas.read_csv() should support latin-1 encoding
> -
>
> Key: SPARK-37181
> URL: https://issues.apache.org/jira/browse/SPARK-37181
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding 
> is not recognized in pyspark.pandas. You have to use Windows-1252 instead, 
> which is almost the same but not identical. }}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding

2021-11-15 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17443720#comment-17443720
 ] 

pralabhkumar edited comment on SPARK-37181 at 11/15/21, 10:32 AM:
--

from pyspark import pandas as ps

latin-1 encoding is same as  ISO-8859-1. You can mentioned the same . 

ps.read_csv("<>", encoding ='ISO-8859-1')

 

[~chconnell] 


was (Author: pralabhkumar):
from pyspark import pandas as ps

latin-1 encoding is same as  ISO-8859-1. You can mentioned the same . 

 

ps.read_csv("/Users/pralkuma/Desktop/rk_scaas/spark/a.txt", encoding 
='ISO-8859-1')

> pyspark.pandas.read_csv() should support latin-1 encoding
> -
>
> Key: SPARK-37181
> URL: https://issues.apache.org/jira/browse/SPARK-37181
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding 
> is not recognized in pyspark.pandas. You have to use Windows-1252 instead, 
> which is almost the same but not identical. }}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding

2021-11-15 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17443720#comment-17443720
 ] 

pralabhkumar commented on SPARK-37181:
--

from pyspark import pandas as ps

latin-1 encoding is same as  ISO-8859-1. You can mentioned the same . 

 

ps.read_csv("/Users/pralkuma/Desktop/rk_scaas/spark/a.txt", encoding 
='ISO-8859-1')

> pyspark.pandas.read_csv() should support latin-1 encoding
> -
>
> Key: SPARK-37181
> URL: https://issues.apache.org/jira/browse/SPARK-37181
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding 
> is not recognized in pyspark.pandas. You have to use Windows-1252 instead, 
> which is almost the same but not identical. }}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30537) toPandas gets wrong dtypes when applied on empty DF when Arrow enabled

2021-10-21 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432812#comment-17432812
 ] 

pralabhkumar commented on SPARK-30537:
--

Thx [~hyukjin.kwon] , working on this 

> toPandas gets wrong dtypes when applied on empty DF when Arrow enabled
> --
>
> Key: SPARK-30537
> URL: https://issues.apache.org/jira/browse/SPARK-30537
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Same issue with SPARK-29188 persists when Arrow optimization is enabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30537) toPandas gets wrong dtypes when applied on empty DF when Arrow enabled

2021-10-21 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432805#comment-17432805
 ] 

pralabhkumar commented on SPARK-30537:
--

[~hyukjin.kwon]

 

I would like to work on this , please let me know if I can work on this 

> toPandas gets wrong dtypes when applied on empty DF when Arrow enabled
> --
>
> Key: SPARK-30537
> URL: https://issues.apache.org/jira/browse/SPARK-30537
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Same issue with SPARK-29188 persists when Arrow optimization is enabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32285) Add PySpark support for nested timestamps with arrow

2021-10-20 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17431140#comment-17431140
 ] 

pralabhkumar commented on SPARK-32285:
--

[~bryanc] . Please review the PR . 

> Add PySpark support for nested timestamps with arrow
> 
>
> Key: SPARK-32285
> URL: https://issues.apache.org/jira/browse/SPARK-32285
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Currently with arrow optimizations, there is post-processing done in pandas 
> for timestamp columns to localize timezone. This is not done for nested 
> columns with timestamps such as StructType or ArrayType.
> Adding support for this is needed for Apache Arrow 1.0.0 upgrade due to use 
> of structs with timestamps in groupedby key over a window.
> As a simple first step, timestamps with 1 level nesting could be done first 
> and this will satisfy the immediate need.
> NOTE: with Arrow 1.0.0, it might be possible to do the timezone processing 
> with pyarrow.array.cast, which could be easier done than in pandas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32161) Hide JVM traceback for SparkUpgradeException

2021-10-18 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430099#comment-17430099
 ] 

pralabhkumar commented on SPARK-32161:
--

[~hyukjin.kwon] 

 

Since the PR is being merged , please change the status of the Jira and 
assigned to me . 

 

> Hide JVM traceback for SparkUpgradeException
> 
>
> Key: SPARK-32161
> URL: https://issues.apache.org/jira/browse/SPARK-32161
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We added {{SparkUpgradeException}} which the JVM traceback is pretty useless. 
> See also https://github.com/apache/spark/pull/28736/files#r449184881
> It should better also whitelist and hide JVM traceback.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32161) Hide JVM traceback for SparkUpgradeException

2021-10-13 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17428185#comment-17428185
 ] 

pralabhkumar commented on SPARK-32161:
--

[~hyukjin.kwon]

 

Please let me if I can work on this .

 

IMHO its a  change in convert_exception method of (sql/util.py) to take care 
SparkUpgradeException

> Hide JVM traceback for SparkUpgradeException
> 
>
> Key: SPARK-32161
> URL: https://issues.apache.org/jira/browse/SPARK-32161
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We added {{SparkUpgradeException}} which the JVM traceback is pretty useless. 
> See also https://github.com/apache/spark/pull/28736/files#r449184881
> It should better also whitelist and hide JVM traceback.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32285) Add PySpark support for nested timestamps with arrow

2021-09-13 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414232#comment-17414232
 ] 

pralabhkumar commented on SPARK-32285:
--

[~hyukjin.kwon]

just added the initial version for converting Spark DF to Pandas for 
ArrayType(TimeStamp) via arrow . Its not the complete PR , I would like to take 
your early opinion .

 

Please let me know , if its in correct direction , i'll complete the rest of 
the work 

> Add PySpark support for nested timestamps with arrow
> 
>
> Key: SPARK-32285
> URL: https://issues.apache.org/jira/browse/SPARK-32285
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Currently with arrow optimizations, there is post-processing done in pandas 
> for timestamp columns to localize timezone. This is not done for nested 
> columns with timestamps such as StructType or ArrayType.
> Adding support for this is needed for Apache Arrow 1.0.0 upgrade due to use 
> of structs with timestamps in groupedby key over a window.
> As a simple first step, timestamps with 1 level nesting could be done first 
> and this will satisfy the immediate need.
> NOTE: with Arrow 1.0.0, it might be possible to do the timezone processing 
> with pyarrow.array.cast, which could be easier done than in pandas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32285) Add PySpark support for nested timestamps with arrow

2021-09-12 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413679#comment-17413679
 ] 

pralabhkumar commented on SPARK-32285:
--

Thx , will share the PR in some time

> Add PySpark support for nested timestamps with arrow
> 
>
> Key: SPARK-32285
> URL: https://issues.apache.org/jira/browse/SPARK-32285
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Currently with arrow optimizations, there is post-processing done in pandas 
> for timestamp columns to localize timezone. This is not done for nested 
> columns with timestamps such as StructType or ArrayType.
> Adding support for this is needed for Apache Arrow 1.0.0 upgrade due to use 
> of structs with timestamps in groupedby key over a window.
> As a simple first step, timestamps with 1 level nesting could be done first 
> and this will satisfy the immediate need.
> NOTE: with Arrow 1.0.0, it might be possible to do the timezone processing 
> with pyarrow.array.cast, which could be easier done than in pandas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32285) Add PySpark support for nested timestamps with arrow

2021-09-11 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413482#comment-17413482
 ] 

pralabhkumar commented on SPARK-32285:
--

[~hyukjin.kwon] [~emkornfi...@gmail.com]

I would like to work on this . Have most of logic ready for 
ArrayType(TimeStamp) . 

 

Please let me know ,if I can work on this . 

> Add PySpark support for nested timestamps with arrow
> 
>
> Key: SPARK-32285
> URL: https://issues.apache.org/jira/browse/SPARK-32285
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Currently with arrow optimizations, there is post-processing done in pandas 
> for timestamp columns to localize timezone. This is not done for nested 
> columns with timestamps such as StructType or ArrayType.
> Adding support for this is needed for Apache Arrow 1.0.0 upgrade due to use 
> of structs with timestamps in groupedby key over a window.
> As a simple first step, timestamps with 1 level nesting could be done first 
> and this will satisfy the immediate need.
> NOTE: with Arrow 1.0.0, it might be possible to do the timezone processing 
> with pyarrow.array.cast, which could be easier done than in pandas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36622) spark.history.kerberos.principal doesn't take value _HOST

2021-09-08 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411859#comment-17411859
 ] 

pralabhkumar commented on SPARK-36622:
--

[~angerszhuuu] [~tgraves]

 

Please review the PR 

> spark.history.kerberos.principal doesn't take value _HOST
> -
>
> Key: SPARK-36622
> URL: https://issues.apache.org/jira/browse/SPARK-36622
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Security, Spark Core
>Affects Versions: 3.0.1, 3.1.2, 3.2.0
>Reporter: pralabhkumar
>Priority: Minor
>
> spark.history.kerberos.principal doesn't understand value _HOST. 
> It says failure to login for principal : spark/_HOST@realm . 
> It will be helpful to take _HOST value via config file and change it with 
> current hostname(similar to what Hive does) . This will also help to run SHS 
> on multiple machines without hardcoding principal hostname.  
> .spark.history.kerberos.principal
>  
> It require minor change in HistoryServer.scala in initSecurity  method . 
>  
> Please let me know , if this request make sense , I'll create the PR . 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36622) spark.history.kerberos.principal doesn't take value _HOST

2021-09-06 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17410578#comment-17410578
 ] 

pralabhkumar commented on SPARK-36622:
--

[~angerszhuuu]

[~tgraves]

[~hyukjin.kwon]

 

Have created the PR . Please review 

 

 

> spark.history.kerberos.principal doesn't take value _HOST
> -
>
> Key: SPARK-36622
> URL: https://issues.apache.org/jira/browse/SPARK-36622
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Security, Spark Core
>Affects Versions: 3.0.1, 3.1.2, 3.2.0
>Reporter: pralabhkumar
>Priority: Minor
>
> spark.history.kerberos.principal doesn't understand value _HOST. 
> It says failure to login for principal : spark/_HOST@realm . 
> It will be helpful to take _HOST value via config file and change it with 
> current hostname(similar to what Hive does) . This will also help to run SHS 
> on multiple machines without hardcoding principal hostname.  
> .spark.history.kerberos.principal
>  
> It require minor change in HistoryServer.scala in initSecurity  method . 
>  
> Please let me know , if this request make sense , I'll create the PR . 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36622) spark.history.kerberos.principal doesn't take value _HOST

2021-09-06 Thread pralabhkumar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pralabhkumar updated SPARK-36622:
-
Affects Version/s: 3.2.0

> spark.history.kerberos.principal doesn't take value _HOST
> -
>
> Key: SPARK-36622
> URL: https://issues.apache.org/jira/browse/SPARK-36622
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Security, Spark Core
>Affects Versions: 3.0.1, 3.1.2, 3.2.0
>Reporter: pralabhkumar
>Priority: Minor
>
> spark.history.kerberos.principal doesn't understand value _HOST. 
> It says failure to login for principal : spark/_HOST@realm . 
> It will be helpful to take _HOST value via config file and change it with 
> current hostname(similar to what Hive does) . This will also help to run SHS 
> on multiple machines without hardcoding principal hostname.  
> .spark.history.kerberos.principal
>  
> It require minor change in HistoryServer.scala in initSecurity  method . 
>  
> Please let me know , if this request make sense , I'll create the PR . 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36622) spark.history.kerberos.principal doesn't take value _HOST

2021-09-03 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409617#comment-17409617
 ] 

pralabhkumar commented on SPARK-36622:
--

[~thejdeep] 

Its better to have _HOST , its been common practice for  hiveserver and similar 
 projects. 

 

[~tgraves]

Agreed

 

Please let me know  , if you are ok . I can create the PR . 

 

> spark.history.kerberos.principal doesn't take value _HOST
> -
>
> Key: SPARK-36622
> URL: https://issues.apache.org/jira/browse/SPARK-36622
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Security, Spark Core
>Affects Versions: 3.0.1, 3.1.2
>Reporter: pralabhkumar
>Priority: Minor
>
> spark.history.kerberos.principal doesn't understand value _HOST. 
> It says failure to login for principal : spark/_HOST@realm . 
> It will be helpful to take _HOST value via config file and change it with 
> current hostname(similar to what Hive does) . This will also help to run SHS 
> on multiple machines without hardcoding principal hostname.  
> .spark.history.kerberos.principal
>  
> It require minor change in HistoryServer.scala in initSecurity  method . 
>  
> Please let me know , if this request make sense , I'll create the PR . 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36622) spark.history.kerberos.principal doesn't take value _HOST

2021-08-31 Thread pralabhkumar (Jira)

pralabhkumar created SPARK-36622:


 Summary: spark.history.kerberos.principal doesn't take value _HOST
 Key: SPARK-36622
 URL: https://issues.apache.org/jira/browse/SPARK-36622
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Security, Spark Core
Affects Versions: 3.1.2, 3.0.1
Reporter: pralabhkumar


spark.history.kerberos.principal doesn't understand value _HOST. 

It says failure to login for principal : spark/_HOST@realm . 

It will be helpful to take _HOST value via config file and change it with 
current hostname(similar to what Hive does) . This will also help to run SHS on 
multiple machines without hardcoding principal hostname.  
.spark.history.kerberos.principal

 

It require minor change in HistoryServer.scala in initSecurity  method . 

 

Please let me know , if this request make sense , I'll create the PR . 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32924) Web UI sort on duration is wrong

2020-11-21 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17236743#comment-17236743
 ] 

pralabhkumar commented on SPARK-32924:
--

[~rakson] [~hyukjin.kwon]

Can I open PR for this ?

> Web UI sort on duration is wrong
> 
>
> Key: SPARK-32924
> URL: https://issues.apache.org/jira/browse/SPARK-32924
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.6
>Reporter: t oo
>Priority: Major
> Attachments: ui_sort.png
>
>
> See attachment, 9 s(econds) is showing as larger than 8.1min



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29103) CheckAnalysis for data source V2 ALTER TABLE ignores case sensitivity

2019-09-20 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934368#comment-16934368
 ] 

pralabhkumar commented on SPARK-29103:
--

[~joseph.torres]

In method  findNestedField of class StructType.scala , one can make 

fieldNames.headOption.map(_.toLowerCase(Locale.ROOT)

or

 

In class checkAnalysis.scala ,

Under case alter: AlterTable if 

one can make fieldName to lowercase before passing to 

table.schema.findNestedField(fieldName, includeCollections = true)

 

Let me know if this approach is fine , I can create the PR for the same

 

 

> CheckAnalysis for data source V2 ALTER TABLE ignores case sensitivity
> -
>
> Key: SPARK-29103
> URL: https://issues.apache.org/jira/browse/SPARK-29103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jose Torres
>Priority: Blocker
>
> For each column referenced, we run
> ```val field = table.schema.findNestedField(fieldName, includeCollections = 
> true)```
> and fail analysis if the field isn't there. This check is always 
> case-sensitive on column names, even if the underlying catalog is case 
> insensitive, so it will sometimes throw on ALTER operations which the catalog 
> supports.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25788) Elastic net penalties for GLMs

2019-07-18 Thread pralabhkumar (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887858#comment-16887858
 ] 

pralabhkumar commented on SPARK-25788:
--

[~shahid] I can work on this . Please let me know if its ok 

> Elastic net penalties for GLMs 
> ---
>
> Key: SPARK-25788
> URL: https://issues.apache.org/jira/browse/SPARK-25788
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.2
>Reporter: Christian Lorentzen
>Priority: Major
>
> Currently, both LinearRegression and LogisticRegression support an elastic 
> net penality (setElasticNetParam), i.e. L1 and L2 penalties. This feature 
> could and should also be added to GeneralizedLinearRegression.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 120 matches

Mail list logo