[jira] [Commented] (SPARK-33782) Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode
[ https://issues.apache.org/jira/browse/SPARK-33782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17626417#comment-17626417 ] pralabhkumar commented on SPARK-33782: -- [~hyukjin.kwon] [~dongjoon] Please let me know, if this Jira is relevant . I have already created the PR and its been already reviewed by couple of PMC . Please help to get it reviewed if the Jira is relevant otherwise i'll close the PR > Place spark.files, spark.jars and spark.files under the current working > directory on the driver in K8S cluster mode > --- > > Key: SPARK-33782 > URL: https://issues.apache.org/jira/browse/SPARK-33782 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > In Yarn cluster modes, the passed files are able to be accessed in the > current working directory. Looks like this is not the case in Kubernates > cluset mode. > By doing this, users can, for example, leverage PEX to manage Python > dependences in Apache Spark: > {code} > pex pyspark==3.0.1 pyarrow==0.15.1 pandas==0.25.3 -o myarchive.pex > PYSPARK_PYTHON=./myarchive.pex spark-submit --files myarchive.pex > {code} > See also https://github.com/apache/spark/pull/30735/files#r540935585. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33782) Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode
[ https://issues.apache.org/jira/browse/SPARK-33782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17615997#comment-17615997 ] pralabhkumar commented on SPARK-33782: -- [~hyukjin.kwon] Can u please help to review the PR . It would be of great help . > Place spark.files, spark.jars and spark.files under the current working > directory on the driver in K8S cluster mode > --- > > Key: SPARK-33782 > URL: https://issues.apache.org/jira/browse/SPARK-33782 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > In Yarn cluster modes, the passed files are able to be accessed in the > current working directory. Looks like this is not the case in Kubernates > cluset mode. > By doing this, users can, for example, leverage PEX to manage Python > dependences in Apache Spark: > {code} > pex pyspark==3.0.1 pyarrow==0.15.1 pandas==0.25.3 -o myarchive.pex > PYSPARK_PYTHON=./myarchive.pex spark-submit --files myarchive.pex > {code} > See also https://github.com/apache/spark/pull/30735/files#r540935585. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33782) Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode
[ https://issues.apache.org/jira/browse/SPARK-33782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603426#comment-17603426 ] pralabhkumar commented on SPARK-33782: -- [~dongjoon] Please review the PR . > Place spark.files, spark.jars and spark.files under the current working > directory on the driver in K8S cluster mode > --- > > Key: SPARK-33782 > URL: https://issues.apache.org/jira/browse/SPARK-33782 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > In Yarn cluster modes, the passed files are able to be accessed in the > current working directory. Looks like this is not the case in Kubernates > cluset mode. > By doing this, users can, for example, leverage PEX to manage Python > dependences in Apache Spark: > {code} > pex pyspark==3.0.1 pyarrow==0.15.1 pandas==0.25.3 -o myarchive.pex > PYSPARK_PYTHON=./myarchive.pex spark-submit --files myarchive.pex > {code} > See also https://github.com/apache/spark/pull/30735/files#r540935585. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39965) Skip PVC cleanup when driver doesn't own PVCs
[ https://issues.apache.org/jira/browse/SPARK-39965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576793#comment-17576793 ] pralabhkumar commented on SPARK-39965: -- [~dongjoon] Thx for taking this . This is really helpful > Skip PVC cleanup when driver doesn't own PVCs > - > > Key: SPARK-39965 > URL: https://issues.apache.org/jira/browse/SPARK-39965 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Assignee: pralabhkumar >Priority: Trivial > > From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , > functionality is added to delete PVC if the Spark driver died. > [https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144] > > However there are cases , where spark on K8s doesn't use PVC and use host > path for storage. > [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes] > > Now in those cases , > * it request to delete PVC (which is not required) . > * It also tries to delete in the case where driver doesn't own the PV (or > spark.kubernetes.driver.ownPersistentVolumeClaim is false) > * Moreover in the cluster , where Spark user doesn't have access to list or > delete PVC , it throws exception . > > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > GET at: > [https://kubernetes.default.svc/api/v1/namespaces/<>/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1|https://kubernetes.default.svc/api/v1/namespaces/dpi-dev/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1]. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. persistentvolumeclaims is forbidden: User > "system:serviceaccount:dpi-dev:spark" cannot list resource > "persistentvolumeclaims" in API group "" in the namespace "<>". > > *Solution* > Ideally there should be configuration > spark.kubernetes.driver.pvc.deleteOnTermination or use > spark.kubernetes.driver.ownPersistentVolumeClaim which should be checked > before calling to delete PVC. If user have not set up PV or if the driver > doesn't own then there is no need to call the api and delete PVC . > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39965) Spark on K8s delete pvc even though it's not being used.
[ https://issues.apache.org/jira/browse/SPARK-39965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576516#comment-17576516 ] pralabhkumar commented on SPARK-39965: -- [~dongjoon] Thx for replying. We don't see an issue except getting exception in the logs (which was mentioned above) . However , please not that , prior to this fix , we were not getting any exception in the logs . Now in scenarios , where PV is not being used by Spark (as in our case), why should we get the above exception in the logs. Currently there is no way to not run Utils.tryLogNonFatalError \{ kubernetesClient .persistentVolumeClaims() .withLabel(SPARK_APP_ID_LABEL, applicationId()) .delete() } IMHO , there should be configuration (which check whether driver own PVC or spark uses PV ). For e.g {code:java} if (conf.get(KUBERNETES_DRIVER_OWN_PVC)) { Utils.tryLogNonFatalError { kubernetesClient .persistentVolumeClaims() .withLabel(SPARK_APP_ID_LABEL, applicationId()) .delete() } } {code} > Spark on K8s delete pvc even though it's not being used. > > > Key: SPARK-39965 > URL: https://issues.apache.org/jira/browse/SPARK-39965 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Minor > > From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , > functionality is added to delete PVC if the Spark driver died. > [https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144] > > However there are cases , where spark on K8s doesn't use PVC and use host > path for storage. > [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes] > > Now in those cases , > * it request to delete PVC (which is not required) . > * It also tries to delete in the case where driver doesn't own the PV (or > spark.kubernetes.driver.ownPersistentVolumeClaim is false) > * Moreover in the cluster , where Spark user doesn't have access to list or > delete PVC , it throws exception . > > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > GET at: > [https://kubernetes.default.svc/api/v1/namespaces/<>/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1|https://kubernetes.default.svc/api/v1/namespaces/dpi-dev/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1]. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. persistentvolumeclaims is forbidden: User > "system:serviceaccount:dpi-dev:spark" cannot list resource > "persistentvolumeclaims" in API group "" in the namespace "<>". > > *Solution* > Ideally there should be configuration > spark.kubernetes.driver.pvc.deleteOnTermination or use > spark.kubernetes.driver.ownPersistentVolumeClaim which should be checked > before calling to delete PVC. If user have not set up PV or if the driver > doesn't own then there is no need to call the api and delete PVC . > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39965) Spark on K8s delete pvc even though it's not being used.
[ https://issues.apache.org/jira/browse/SPARK-39965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17575554#comment-17575554 ] pralabhkumar commented on SPARK-39965: -- [~dongjoon] Please review. > Spark on K8s delete pvc even though it's not being used. > > > Key: SPARK-39965 > URL: https://issues.apache.org/jira/browse/SPARK-39965 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Minor > > From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , > functionality is added to delete PVC if the Spark driver died. > [https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144] > > However there are cases , where spark on K8s doesn't use PVC and use host > path for storage. > [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes] > > Now in those cases , > * it request to delete PVC (which is not required) . > * It also tries to delete in the case where driver doesn't own the PV (or > spark.kubernetes.driver.ownPersistentVolumeClaim is false) > * Moreover in the cluster , where Spark user doesn't have access to list or > delete PVC , it throws exception . > > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > GET at: > [https://kubernetes.default.svc/api/v1/namespaces/<>/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1|https://kubernetes.default.svc/api/v1/namespaces/dpi-dev/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1]. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. persistentvolumeclaims is forbidden: User > "system:serviceaccount:dpi-dev:spark" cannot list resource > "persistentvolumeclaims" in API group "" in the namespace "<>". > > *Solution* > Ideally there should be configuration > spark.kubernetes.driver.pvc.deleteOnTermination or use > spark.kubernetes.driver.ownPersistentVolumeClaim which should be checked > before calling to delete PVC. If user have not set up PV or if the driver > doesn't own then there is no need to call the api and delete PVC . > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39965) Spark on K8s delete pvc even though it's not being used.
[ https://issues.apache.org/jira/browse/SPARK-39965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pralabhkumar updated SPARK-39965: - Description: >From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , >functionality is added to delete PVC if the Spark driver died. [https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144] However there are cases , where spark on K8s doesn't use PVC and use host path for storage. [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes] Now in those cases , * it request to delete PVC (which is not required) . * It also tries to delete in the case where driver doesn't own the PV (or spark.kubernetes.driver.ownPersistentVolumeClaim is false) * Moreover in the cluster , where Spark user doesn't have access to list or delete PVC , it throws exception . io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: [https://kubernetes.default.svc/api/v1/namespaces/<>/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1|https://kubernetes.default.svc/api/v1/namespaces/dpi-dev/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1]. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. persistentvolumeclaims is forbidden: User "system:serviceaccount:dpi-dev:spark" cannot list resource "persistentvolumeclaims" in API group "" in the namespace "<>". *Solution* Ideally there should be configuration spark.kubernetes.driver.pvc.deleteOnTermination or use spark.kubernetes.driver.ownPersistentVolumeClaim which should be checked before calling to delete PVC. If user have not set up PV or if the driver doesn't own then there is no need to call the api and delete PVC . was: >From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , >functionality is added to delete PVC if the Spark driver died. [https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144] However there are cases , where spark on K8s doesn't use PVC and use host path for storage. [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes] Now even in those cases , it request to delete PVC (which is not required) . Moreover in the cluster , where Spark user doesn't have access to list or delete PVC , it throws exception . io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: [https://kubernetes.default.svc/api/v1/namespaces/<>/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1|https://kubernetes.default.svc/api/v1/namespaces/dpi-dev/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1]. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. persistentvolumeclaims is forbidden: User "system:serviceaccount:dpi-dev:spark" cannot list resource "persistentvolumeclaims" in API group "" in the namespace "<>". Ideally there should be configuration spark.kubernetes.driver.pvc.deleteOnTermination or use spark.kubernetes.driver.ownPersistentVolumeClaim which should be checked before calling to delete PVC. If user have not set up PV or if the driver doesn't own then there is no need to call the api and delete PVC . > Spark on K8s delete pvc even though it's not being used. > > > Key: SPARK-39965 > URL: https://issues.apache.org/jira/browse/SPARK-39965 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Minor > > From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , > functionality is added to delete PVC if the Spark driver died. > [https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144] > > However there are cases , where spark on K8s doesn't use PVC and use host > path for storage. > [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes] > > Now in those cases , > * it request to delete PVC (which is not required) . > * It also tries to delete in the case where driver doesn't own the PV (or >
[jira] [Updated] (SPARK-39965) Spark on K8s delete pvc even though it's not being used.
[ https://issues.apache.org/jira/browse/SPARK-39965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pralabhkumar updated SPARK-39965: - Description: >From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , >functionality is added to delete PVC if the Spark driver died. [https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144] However there are cases , where spark on K8s doesn't use PVC and use host path for storage. [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes] Now even in those cases , it request to delete PVC (which is not required) . Moreover in the cluster , where Spark user doesn't have access to list or delete PVC , it throws exception . io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: [https://kubernetes.default.svc/api/v1/namespaces/<>/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1|https://kubernetes.default.svc/api/v1/namespaces/dpi-dev/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1]. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. persistentvolumeclaims is forbidden: User "system:serviceaccount:dpi-dev:spark" cannot list resource "persistentvolumeclaims" in API group "" in the namespace "<>". Ideally there should be configuration spark.kubernetes.driver.pvc.deleteOnTermination or use spark.kubernetes.driver.ownPersistentVolumeClaim which should be checked before calling to delete PVC. If user have not set up PV or if the driver doesn't own then there is no need to call the api and delete PVC . was: >From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , >functionality is added to delete PVC if the Spark driver died. [https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144] However there are cases , where spark on K8s doesn't use PVC and use host path for storage. [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes] Now even in those cases , it request to delete PVC (which is not required) . Moreover in the cluster , where Spark user doesn't have access to list or delete PVC , it throws exception . io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: [https://kubernetes.default.svc/api/v1/namespaces/<>/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1|https://kubernetes.default.svc/api/v1/namespaces/dpi-dev/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1]. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. persistentvolumeclaims is forbidden: User "system:serviceaccount:dpi-dev:spark" cannot list resource "persistentvolumeclaims" in API group "" in the namespace "<>". Ideally there should be configuration spark.kubernetes.driver.pvc.deleteOnTermination which should be checked before calling to delete PVC. If user have not set up PV then there is no need to call the api and delete PVC . > Spark on K8s delete pvc even though it's not being used. > > > Key: SPARK-39965 > URL: https://issues.apache.org/jira/browse/SPARK-39965 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Minor > > From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , > functionality is added to delete PVC if the Spark driver died. > [https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144] > > However there are cases , where spark on K8s doesn't use PVC and use host > path for storage. > [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes] > > Now even in those cases , it request to delete PVC (which is not required) . > Moreover in the cluster , where Spark user doesn't have access to list or > delete PVC , it throws exception . > > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > GET at: >
[jira] [Commented] (SPARK-39965) Spark on K8s delete pvc even though it's not being used.
[ https://issues.apache.org/jira/browse/SPARK-39965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17574494#comment-17574494 ] pralabhkumar commented on SPARK-39965: -- Gentle ping @[dongjoon-hyun .|https://github.com/dongjoon-hyun] > Spark on K8s delete pvc even though it's not being used. > > > Key: SPARK-39965 > URL: https://issues.apache.org/jira/browse/SPARK-39965 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Minor > > From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , > functionality is added to delete PVC if the Spark driver died. > [https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144] > > However there are cases , where spark on K8s doesn't use PVC and use host > path for storage. > [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes] > > Now even in those cases , it request to delete PVC (which is not required) . > Moreover in the cluster , where Spark user doesn't have access to list or > delete PVC , it throws exception . > > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > GET at: > [https://kubernetes.default.svc/api/v1/namespaces/<>/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1|https://kubernetes.default.svc/api/v1/namespaces/dpi-dev/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1]. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. persistentvolumeclaims is forbidden: User > "system:serviceaccount:dpi-dev:spark" cannot list resource > "persistentvolumeclaims" in API group "" in the namespace "<>". > > Ideally there should be configuration > spark.kubernetes.driver.pvc.deleteOnTermination which should be checked > before calling to delete PVC. If user have not set up PV then there is no > need to call the api and delete PVC . > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39965) Spark on K8s delete pvc even though it's not being used.
[ https://issues.apache.org/jira/browse/SPARK-39965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pralabhkumar updated SPARK-39965: - Description: >From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , >functionality is added to delete PVC if the Spark driver died. [https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144] However there are cases , where spark on K8s doesn't use PVC and use host path for storage. [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes] Now even in those cases , it request to delete PVC (which is not required) . Moreover in the cluster , where Spark user doesn't have access to list or delete PVC , it throws exception . io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: [https://kubernetes.default.svc/api/v1/namespaces/<>/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1|https://kubernetes.default.svc/api/v1/namespaces/dpi-dev/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1]. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. persistentvolumeclaims is forbidden: User "system:serviceaccount:dpi-dev:spark" cannot list resource "persistentvolumeclaims" in API group "" in the namespace "<>". Ideally there should be configuration spark.kubernetes.driver.pvc.deleteOnTermination which should be checked before calling to delete PVC. If user have not set up PV then there is no need to call the api and delete PVC . was: In org.apache.spark.util getConfiguredLocalDirs {code:java} if (isRunningInYarnContainer(conf)) { // If we are in yarn mode, systems can have different disk layouts so we must set it // to what Yarn on this system said was available. Note this assumes that Yarn has // created the directories already, and that they are secured so that only the // user has access to them. randomizeInPlace(getYarnLocalDirs(conf).split(",")) } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) { conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator) } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) { conf.getenv("SPARK_LOCAL_DIRS").split(",") }{code} randomizedInplace is not called conf.getenv("SPARK_LOCAL_DIRS").split(",") . This is what used in case of K8s and the shuffle locations are not randomized. IMHO , this should be randomized , so that all the directories have equal changes of pushing the data as was done on yarn side > Spark on K8s delete pvc even though it's not being used. > > > Key: SPARK-39965 > URL: https://issues.apache.org/jira/browse/SPARK-39965 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Minor > > From Spark32 . as a part of [https://github.com/apache/spark/pull/32288] , > functionality is added to delete PVC if the Spark driver died. > [https://github.com/apache/spark/blob/786a70e710369b195d7c117b33fe9983044014d6/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L144] > > However there are cases , where spark on K8s doesn't use PVC and use host > path for storage. > [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes] > > Now even in those cases , it request to delete PVC (which is not required) . > Moreover in the cluster , where Spark user doesn't have access to list or > delete PVC , it throws exception . > > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > GET at: > [https://kubernetes.default.svc/api/v1/namespaces/<>/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1|https://kubernetes.default.svc/api/v1/namespaces/dpi-dev/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-332bd09284b3442f8a6a214fabcd6ab1]. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. persistentvolumeclaims is forbidden: User > "system:serviceaccount:dpi-dev:spark" cannot list resource > "persistentvolumeclaims" in API group "" in the namespace "<>". > > Ideally there should be configuration > spark.kubernetes.driver.pvc.deleteOnTermination which should be checked > before calling to delete PVC. If user have not set up PV then there is no > need to call the api and delete PVC . > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (SPARK-39965) Spark on K8s delete pvc even though it's not being used.
[ https://issues.apache.org/jira/browse/SPARK-39965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pralabhkumar updated SPARK-39965: - Component/s: (was: Spark Core) > Spark on K8s delete pvc even though it's not being used. > > > Key: SPARK-39965 > URL: https://issues.apache.org/jira/browse/SPARK-39965 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Minor > > In org.apache.spark.util getConfiguredLocalDirs > > {code:java} > if (isRunningInYarnContainer(conf)) { > // If we are in yarn mode, systems can have different disk layouts so we > must set it > // to what Yarn on this system said was available. Note this assumes that > Yarn has > // created the directories already, and that they are secured so that only > the > // user has access to them. > randomizeInPlace(getYarnLocalDirs(conf).split(",")) > } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) { > conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator) > } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) { > conf.getenv("SPARK_LOCAL_DIRS").split(",") > }{code} > randomizedInplace is not called conf.getenv("SPARK_LOCAL_DIRS").split(",") . > This is what used in case of K8s and the shuffle locations are not > randomized. > IMHO , this should be randomized , so that all the directories have equal > changes of pushing the data as was done on yarn side > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39965) Spark on K8s delete pvc even though it's not being used.
pralabhkumar created SPARK-39965: Summary: Spark on K8s delete pvc even though it's not being used. Key: SPARK-39965 URL: https://issues.apache.org/jira/browse/SPARK-39965 Project: Spark Issue Type: Bug Components: Kubernetes, Spark Core Affects Versions: 3.3.0 Reporter: pralabhkumar In org.apache.spark.util getConfiguredLocalDirs {code:java} if (isRunningInYarnContainer(conf)) { // If we are in yarn mode, systems can have different disk layouts so we must set it // to what Yarn on this system said was available. Note this assumes that Yarn has // created the directories already, and that they are secured so that only the // user has access to them. randomizeInPlace(getYarnLocalDirs(conf).split(",")) } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) { conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator) } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) { conf.getenv("SPARK_LOCAL_DIRS").split(",") }{code} randomizedInplace is not called conf.getenv("SPARK_LOCAL_DIRS").split(",") . This is what used in case of K8s and the shuffle locations are not randomized. IMHO , this should be randomized , so that all the directories have equal changes of pushing the data as was done on yarn side -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33782) Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode
[ https://issues.apache.org/jira/browse/SPARK-33782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17574314#comment-17574314 ] pralabhkumar commented on SPARK-33782: -- [~hyukjin.kwon] I would like to work on this . Please let me know if its ok > Place spark.files, spark.jars and spark.files under the current working > directory on the driver in K8S cluster mode > --- > > Key: SPARK-33782 > URL: https://issues.apache.org/jira/browse/SPARK-33782 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > In Yarn cluster modes, the passed files are able to be accessed in the > current working directory. Looks like this is not the case in Kubernates > cluset mode. > By doing this, users can, for example, leverage PEX to manage Python > dependences in Apache Spark: > {code} > pex pyspark==3.0.1 pyarrow==0.15.1 pandas==0.25.3 -o myarchive.pex > PYSPARK_PYTHON=./myarchive.pex spark-submit --files myarchive.pex > {code} > See also https://github.com/apache/spark/pull/30735/files#r540935585. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-39375) SPIP: Spark Connect - A client and server interface for Apache Spark.
[ https://issues.apache.org/jira/browse/SPARK-39375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571978#comment-17571978 ] pralabhkumar edited comment on SPARK-39375 at 7/27/22 2:51 PM: --- This is really good proposal and need of an hour (specifically since Livy is dormant and Toree also not very active) . This will hugely help in the use cases related to Notebook. Please let us know , is there an ETA for the first version of this , or any plan to have further sub tasks , so that other people can contribute to it . was (Author: pralabhkumar): This is really good proposal and need of an hour (specifically since Livy is dormant and Toree also not very active) . This will hugely help in the use cases related to Notebook. Please let us know , is there an ETA for the first version of this , or any plan to have further tasks , so that other people can contribute to it . > SPIP: Spark Connect - A client and server interface for Apache Spark. > - > > Key: SPARK-39375 > URL: https://issues.apache.org/jira/browse/SPARK-39375 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Martin Grund >Priority: Major > Labels: SPIP > > Please find the full document for discussion here: [Spark Connect > SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj] > Below, we have just referenced the introduction. > h2. What are you trying to do? > While Spark is used extensively, it was designed nearly a decade ago, which, > in the age of serverless computing and ubiquitous programming language use, > poses a number of limitations. Most of the limitations stem from the tightly > coupled Spark driver architecture and fact that clusters are typically shared > across users: (1) {*}Lack of built-in remote connectivity{*}: the Spark > driver runs both the client application and scheduler, which results in a > heavyweight architecture that requires proximity to the cluster. There is no > built-in capability to remotely connect to a Spark cluster in languages > other than SQL and users therefore rely on external solutions such as the > inactive project [Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich > developer experience{*}: The current architecture and APIs do not cater for > interactive data exploration (as done with Notebooks), or allow for building > out rich developer experience common in modern code editors. (3) > {*}Stability{*}: with the current shared driver architecture, users causing > critical exceptions (e.g. OOM) bring the whole cluster down for all users. > (4) {*}Upgradability{*}: the current entangling of platform and client APIs > (e.g. first and third-party dependencies in the classpath) does not allow for > seamless upgrades between Spark versions (and with that, hinders new feature > adoption). > > We propose to overcome these challenges by building on the DataFrame API and > the underlying unresolved logical plans. The DataFrame API is widely used and > makes it very easy to iteratively express complex logic. We will introduce > {_}Spark Connect{_}, a remote option of the DataFrame API that separates the > client from the Spark server. With Spark Connect, Spark will become > decoupled, allowing for built-in remote connectivity: The decoupled client > SDK can be used to run interactive data exploration and connect to the server > for DataFrame operations. > > Spark Connect will benefit Spark developers in different ways: The decoupled > architecture will result in improved stability, as clients are separated from > the driver. From the Spark Connect client perspective, Spark will be (almost) > versionless, and thus enable seamless upgradability, as server APIs can > evolve without affecting the client API. The decoupled client-server > architecture can be leveraged to build close integrations with local > developer tooling. Finally, separating the client process from the Spark > server process will improve Spark’s overall security posture by avoiding the > tight coupling of the client inside the Spark runtime environment. > > Spark Connect will strengthen Spark’s position as the modern unified engine > for large-scale data analytics and expand applicability to use cases and > developers we could not reach with the current setup: Spark will become > ubiquitously usable as the DataFrame API can be used with (almost) any > programming language. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional
[jira] [Commented] (SPARK-39375) SPIP: Spark Connect - A client and server interface for Apache Spark.
[ https://issues.apache.org/jira/browse/SPARK-39375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571978#comment-17571978 ] pralabhkumar commented on SPARK-39375: -- This is really good proposal and need of an hour (specifically since Livy is dormant and Toree also not very active) . This will hugely help in the use cases related to Notebook. Please let us know , is there an ETA for the first version of this , or any plan to have further tasks , so that other people can contribute to it . > SPIP: Spark Connect - A client and server interface for Apache Spark. > - > > Key: SPARK-39375 > URL: https://issues.apache.org/jira/browse/SPARK-39375 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Martin Grund >Priority: Major > Labels: SPIP > > Please find the full document for discussion here: [Spark Connect > SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj] > Below, we have just referenced the introduction. > h2. What are you trying to do? > While Spark is used extensively, it was designed nearly a decade ago, which, > in the age of serverless computing and ubiquitous programming language use, > poses a number of limitations. Most of the limitations stem from the tightly > coupled Spark driver architecture and fact that clusters are typically shared > across users: (1) {*}Lack of built-in remote connectivity{*}: the Spark > driver runs both the client application and scheduler, which results in a > heavyweight architecture that requires proximity to the cluster. There is no > built-in capability to remotely connect to a Spark cluster in languages > other than SQL and users therefore rely on external solutions such as the > inactive project [Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich > developer experience{*}: The current architecture and APIs do not cater for > interactive data exploration (as done with Notebooks), or allow for building > out rich developer experience common in modern code editors. (3) > {*}Stability{*}: with the current shared driver architecture, users causing > critical exceptions (e.g. OOM) bring the whole cluster down for all users. > (4) {*}Upgradability{*}: the current entangling of platform and client APIs > (e.g. first and third-party dependencies in the classpath) does not allow for > seamless upgrades between Spark versions (and with that, hinders new feature > adoption). > > We propose to overcome these challenges by building on the DataFrame API and > the underlying unresolved logical plans. The DataFrame API is widely used and > makes it very easy to iteratively express complex logic. We will introduce > {_}Spark Connect{_}, a remote option of the DataFrame API that separates the > client from the Spark server. With Spark Connect, Spark will become > decoupled, allowing for built-in remote connectivity: The decoupled client > SDK can be used to run interactive data exploration and connect to the server > for DataFrame operations. > > Spark Connect will benefit Spark developers in different ways: The decoupled > architecture will result in improved stability, as clients are separated from > the driver. From the Spark Connect client perspective, Spark will be (almost) > versionless, and thus enable seamless upgradability, as server APIs can > evolve without affecting the client API. The decoupled client-server > architecture can be leveraged to build close integrations with local > developer tooling. Finally, separating the client process from the Spark > server process will improve Spark’s overall security posture by avoiding the > tight coupling of the client inside the Spark runtime environment. > > Spark Connect will strengthen Spark’s position as the modern unified engine > for large-scale data analytics and expand applicability to use cases and > developers we could not reach with the current setup: Spark will become > ubiquitously usable as the DataFrame API can be used with (almost) any > programming language. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39755) SPARK_LOCAL_DIRS locations are not randomized in K8s
[ https://issues.apache.org/jira/browse/SPARK-39755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566430#comment-17566430 ] pralabhkumar commented on SPARK-39755: -- [~hyukjin.kwon] , please let me know if the above suggestion is correct(we are facing simillar issue of what mention in Spark-24992) when running Spark on K8s . I'll implement the same > SPARK_LOCAL_DIRS locations are not randomized in K8s > > > Key: SPARK-39755 > URL: https://issues.apache.org/jira/browse/SPARK-39755 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Minor > > In org.apache.spark.util getConfiguredLocalDirs > > {code:java} > if (isRunningInYarnContainer(conf)) { > // If we are in yarn mode, systems can have different disk layouts so we > must set it > // to what Yarn on this system said was available. Note this assumes that > Yarn has > // created the directories already, and that they are secured so that only > the > // user has access to them. > randomizeInPlace(getYarnLocalDirs(conf).split(",")) > } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) { > conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator) > } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) { > conf.getenv("SPARK_LOCAL_DIRS").split(",") > }{code} > randomizedInplace is not called conf.getenv("SPARK_LOCAL_DIRS").split(",") . > This is what used in case of K8s and the shuffle locations are not > randomized. > IMHO , this should be randomized , so that all the directories have equal > changes of pushing the data as was done on yarn side > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39755) SPARK_LOCAL_DIRS locations are not randomized in K8s
[ https://issues.apache.org/jira/browse/SPARK-39755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566164#comment-17566164 ] pralabhkumar commented on SPARK-39755: -- Problem seen on yarn side and the fix was randomization(https://issues.apache.org/jira/browse/SPARK-24992). Similar problem is seen on K8s . Let me know , if its ok , i'll work on it . > SPARK_LOCAL_DIRS locations are not randomized in K8s > > > Key: SPARK-39755 > URL: https://issues.apache.org/jira/browse/SPARK-39755 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Minor > > In org.apache.spark.util getConfiguredLocalDirs > > {code:java} > if (isRunningInYarnContainer(conf)) { > // If we are in yarn mode, systems can have different disk layouts so we > must set it > // to what Yarn on this system said was available. Note this assumes that > Yarn has > // created the directories already, and that they are secured so that only > the > // user has access to them. > randomizeInPlace(getYarnLocalDirs(conf).split(",")) > } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) { > conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator) > } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) { > conf.getenv("SPARK_LOCAL_DIRS").split(",") > }{code} > randomizedInplace is not called conf.getenv("SPARK_LOCAL_DIRS").split(",") . > This is what used in case of K8s and the shuffle locations are not > randomized. > IMHO , this should be randomized , so that all the directories have equal > changes of pushing the data as was done on yarn side > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39755) SPARK_LOCAL_DIRS locations are not randomized in K8s
[ https://issues.apache.org/jira/browse/SPARK-39755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pralabhkumar updated SPARK-39755: - Description: In org.apache.spark.util getConfiguredLocalDirs {code:java} if (isRunningInYarnContainer(conf)) { // If we are in yarn mode, systems can have different disk layouts so we must set it // to what Yarn on this system said was available. Note this assumes that Yarn has // created the directories already, and that they are secured so that only the // user has access to them. randomizeInPlace(getYarnLocalDirs(conf).split(",")) } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) { conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator) } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) { conf.getenv("SPARK_LOCAL_DIRS").split(",") }{code} randomizedInplace is not called conf.getenv("SPARK_LOCAL_DIRS").split(",") . This is what used in case of K8s and the shuffle locations are not randomized. IMHO , this should be randomized , so that all the directories have equal changes of pushing the data as was done on yarn side was: In org.apache.spark.util getConfiguredLocalDirs {code:java} if (isRunningInYarnContainer(conf)) { // If we are in yarn mode, systems can have different disk layouts so we must set it // to what Yarn on this system said was available. Note this assumes that Yarn has // created the directories already, and that they are secured so that only the // user has access to them. randomizeInPlace(getYarnLocalDirs(conf).split(",")) } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) { conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator) } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) { conf.getenv("SPARK_LOCAL_DIRS").split(",") }{code} conf.getenv("SPARK_LOCAL_DIRS").split(",") is not randomizedInplace. This is what used in case of K8s and the shuffle locations are not randomized. IMHO , this should be randomized , so that all the directories have equal changes of pushing the data as was done on yarn side > SPARK_LOCAL_DIRS locations are not randomized in K8s > > > Key: SPARK-39755 > URL: https://issues.apache.org/jira/browse/SPARK-39755 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Minor > > In org.apache.spark.util getConfiguredLocalDirs > > {code:java} > if (isRunningInYarnContainer(conf)) { > // If we are in yarn mode, systems can have different disk layouts so we > must set it > // to what Yarn on this system said was available. Note this assumes that > Yarn has > // created the directories already, and that they are secured so that only > the > // user has access to them. > randomizeInPlace(getYarnLocalDirs(conf).split(",")) > } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) { > conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator) > } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) { > conf.getenv("SPARK_LOCAL_DIRS").split(",") > }{code} > randomizedInplace is not called conf.getenv("SPARK_LOCAL_DIRS").split(",") . > This is what used in case of K8s and the shuffle locations are not > randomized. > IMHO , this should be randomized , so that all the directories have equal > changes of pushing the data as was done on yarn side > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39755) SPARK_LOCAL_DIRS locations are not randomized in K8s
[ https://issues.apache.org/jira/browse/SPARK-39755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pralabhkumar updated SPARK-39755: - Description: In org.apache.spark.util getConfiguredLocalDirs {code:java} if (isRunningInYarnContainer(conf)) { // If we are in yarn mode, systems can have different disk layouts so we must set it // to what Yarn on this system said was available. Note this assumes that Yarn has // created the directories already, and that they are secured so that only the // user has access to them. randomizeInPlace(getYarnLocalDirs(conf).split(",")) } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) { conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator) } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) { conf.getenv("SPARK_LOCAL_DIRS").split(",") }{code} conf.getenv("SPARK_LOCAL_DIRS").split(",") is not randomizedInplace. This is what used in case of K8s and the shuffle locations are not randomized. IMHO , this should be randomized , so that all the directories have equal changes of pushing the data as was done on yarn side was: In org.apache.spark.util getConfiguredLocalDirs {code:java} if (isRunningInYarnContainer(conf)) { // If we are in yarn mode, systems can have different disk layouts so we must set it // to what Yarn on this system said was available. Note this assumes that Yarn has // created the directories already, and that they are secured so that only the // user has access to them. randomizeInPlace(getYarnLocalDirs(conf).split(",")) } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) { conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator) } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) { conf.getenv("SPARK_LOCAL_DIRS").split(",") }{code} conf.getenv("SPARK_LOCAL_DIRS").split(",") is not randomizedInplace. This is what used in case of K8s and the shuffle locations are not randomized. IMHO , this should be randomized , so that all the directories have equal changes of pushing the data. > SPARK_LOCAL_DIRS locations are not randomized in K8s > > > Key: SPARK-39755 > URL: https://issues.apache.org/jira/browse/SPARK-39755 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Minor > > In org.apache.spark.util getConfiguredLocalDirs > > {code:java} > if (isRunningInYarnContainer(conf)) { > // If we are in yarn mode, systems can have different disk layouts so we > must set it > // to what Yarn on this system said was available. Note this assumes that > Yarn has > // created the directories already, and that they are secured so that only > the > // user has access to them. > randomizeInPlace(getYarnLocalDirs(conf).split(",")) > } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) { > conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator) > } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) { > conf.getenv("SPARK_LOCAL_DIRS").split(",") > }{code} > conf.getenv("SPARK_LOCAL_DIRS").split(",") is not randomizedInplace. This is > what used in case of K8s and the shuffle locations are not randomized. > IMHO , this should be randomized , so that all the directories have equal > changes of pushing the data as was done on yarn side > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39755) SPARK_LOCAL_DIRS locations are not randomized in K8s
[ https://issues.apache.org/jira/browse/SPARK-39755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566142#comment-17566142 ] pralabhkumar commented on SPARK-39755: -- [~dongjoon] Gentle ping . > SPARK_LOCAL_DIRS locations are not randomized in K8s > > > Key: SPARK-39755 > URL: https://issues.apache.org/jira/browse/SPARK-39755 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Minor > > In org.apache.spark.util getConfiguredLocalDirs > > {code:java} > if (isRunningInYarnContainer(conf)) { > // If we are in yarn mode, systems can have different disk layouts so we > must set it > // to what Yarn on this system said was available. Note this assumes that > Yarn has > // created the directories already, and that they are secured so that only > the > // user has access to them. > randomizeInPlace(getYarnLocalDirs(conf).split(",")) > } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) { > conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator) > } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) { > conf.getenv("SPARK_LOCAL_DIRS").split(",") > }{code} > conf.getenv("SPARK_LOCAL_DIRS").split(",") is not randomizedInplace. This is > what used in case of K8s and the shuffle locations are not randomized. > IMHO , this should be randomized , so that all the directories have equal > changes of pushing the data. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39755) SPARK_LOCAL_DIRS locations are not randomized in K8s
[ https://issues.apache.org/jira/browse/SPARK-39755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pralabhkumar updated SPARK-39755: - Summary: SPARK_LOCAL_DIRS locations are not randomized in K8s (was: Spark-shuffle locations are not randomized in K8s) > SPARK_LOCAL_DIRS locations are not randomized in K8s > > > Key: SPARK-39755 > URL: https://issues.apache.org/jira/browse/SPARK-39755 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Minor > > In org.apache.spark.util getConfiguredLocalDirs > > {code:java} > if (isRunningInYarnContainer(conf)) { > // If we are in yarn mode, systems can have different disk layouts so we > must set it > // to what Yarn on this system said was available. Note this assumes that > Yarn has > // created the directories already, and that they are secured so that only > the > // user has access to them. > randomizeInPlace(getYarnLocalDirs(conf).split(",")) > } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) { > conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator) > } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) { > conf.getenv("SPARK_LOCAL_DIRS").split(",") > }{code} > conf.getenv("SPARK_LOCAL_DIRS").split(",") is not randomizedInplace. This is > what used in case of K8s and the shuffle locations are not randomized. > IMHO , this should be randomized , so that all the directories have equal > changes of pushing the data. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39755) Spark-shuffle locations are not randomized in K8s
[ https://issues.apache.org/jira/browse/SPARK-39755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pralabhkumar updated SPARK-39755: - Summary: Spark-shuffle locations are not randomized in K8s (was: Spark-shuffle locations are not randomized in K8s ) > Spark-shuffle locations are not randomized in K8s > - > > Key: SPARK-39755 > URL: https://issues.apache.org/jira/browse/SPARK-39755 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Minor > > In org.apache.spark.util getConfiguredLocalDirs > > {code:java} > if (isRunningInYarnContainer(conf)) { > // If we are in yarn mode, systems can have different disk layouts so we > must set it > // to what Yarn on this system said was available. Note this assumes that > Yarn has > // created the directories already, and that they are secured so that only > the > // user has access to them. > randomizeInPlace(getYarnLocalDirs(conf).split(",")) > } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) { > conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator) > } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) { > conf.getenv("SPARK_LOCAL_DIRS").split(",") > }{code} > conf.getenv("SPARK_LOCAL_DIRS").split(",") is not randomizedInplace. This is > what used in case of K8s and the shuffle locations are not randomized. > IMHO , this should be randomized , so that all the directories have equal > changes of pushing the data. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39755) Spark-shuffle locations are not randomized in K8s
[ https://issues.apache.org/jira/browse/SPARK-39755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17565935#comment-17565935 ] pralabhkumar commented on SPARK-39755: -- [~hyukjin.kwon] Please comment on the same. > Spark-shuffle locations are not randomized in K8s > -- > > Key: SPARK-39755 > URL: https://issues.apache.org/jira/browse/SPARK-39755 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Minor > > In org.apache.spark.util getConfiguredLocalDirs > > {code:java} > if (isRunningInYarnContainer(conf)) { > // If we are in yarn mode, systems can have different disk layouts so we > must set it > // to what Yarn on this system said was available. Note this assumes that > Yarn has > // created the directories already, and that they are secured so that only > the > // user has access to them. > randomizeInPlace(getYarnLocalDirs(conf).split(",")) > } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) { > conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator) > } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) { > conf.getenv("SPARK_LOCAL_DIRS").split(",") > }{code} > conf.getenv("SPARK_LOCAL_DIRS").split(",") is not randomizedInplace. This is > what used in case of K8s and the shuffle locations are not randomized. > IMHO , this should be randomized , so that all the directories have equal > changes of pushing the data. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39755) Spark-shuffle locations are not randomized in K8s
[ https://issues.apache.org/jira/browse/SPARK-39755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pralabhkumar updated SPARK-39755: - Description: In org.apache.spark.util getConfiguredLocalDirs {code:java} if (isRunningInYarnContainer(conf)) { // If we are in yarn mode, systems can have different disk layouts so we must set it // to what Yarn on this system said was available. Note this assumes that Yarn has // created the directories already, and that they are secured so that only the // user has access to them. randomizeInPlace(getYarnLocalDirs(conf).split(",")) } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) { conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator) } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) { conf.getenv("SPARK_LOCAL_DIRS").split(",") }{code} conf.getenv("SPARK_LOCAL_DIRS").split(",") is not randomizedInplace. This is what used in case of K8s and the shuffle locations are not randomized. IMHO , this should be randomized , so that all the directories have equal changes of pushing the data. > Spark-shuffle locations are not randomized in K8s > -- > > Key: SPARK-39755 > URL: https://issues.apache.org/jira/browse/SPARK-39755 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Minor > > In org.apache.spark.util getConfiguredLocalDirs > > {code:java} > if (isRunningInYarnContainer(conf)) { > // If we are in yarn mode, systems can have different disk layouts so we > must set it > // to what Yarn on this system said was available. Note this assumes that > Yarn has > // created the directories already, and that they are secured so that only > the > // user has access to them. > randomizeInPlace(getYarnLocalDirs(conf).split(",")) > } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) { > conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator) > } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) { > conf.getenv("SPARK_LOCAL_DIRS").split(",") > }{code} > conf.getenv("SPARK_LOCAL_DIRS").split(",") is not randomizedInplace. This is > what used in case of K8s and the shuffle locations are not randomized. > IMHO , this should be randomized , so that all the directories have equal > changes of pushing the data. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39755) Spark-shuffle locations are not randomized in K8s
pralabhkumar created SPARK-39755: Summary: Spark-shuffle locations are not randomized in K8s Key: SPARK-39755 URL: https://issues.apache.org/jira/browse/SPARK-39755 Project: Spark Issue Type: Bug Components: Kubernetes, Spark Core Affects Versions: 3.3.0 Reporter: pralabhkumar -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38292) Support `na_filter` for pyspark.pandas.read_csv
[ https://issues.apache.org/jira/browse/SPARK-38292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557987#comment-17557987 ] pralabhkumar commented on SPARK-38292: -- [~hyukjin.kwon] Please let me know if its ok . I'll do the same. > Support `na_filter` for pyspark.pandas.read_csv > --- > > Key: SPARK-38292 > URL: https://issues.apache.org/jira/browse/SPARK-38292 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > pandas support `na_filter` parameter for `read_csv` function. > (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) > We also want to support this to follow the behavior of pandas. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38292) Support `na_filter` for pyspark.pandas.read_csv
[ https://issues.apache.org/jira/browse/SPARK-38292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557537#comment-17557537 ] pralabhkumar commented on SPARK-38292: -- [~hyukjin.kwon] Thx for the suggestion . After going through the code (DataFrameReader and Univocity spark parser code) . Here is the analysis . Example A,,B A,,B ==> spark.read.option(“nullValue”,”A”) ==> results in null, null, B Reason for this is * _parse method in_ org.apache.spark.sql.catalyst.csv.UnivocityParser * Parse string => A, A,B (settings.setNullValue in com.univocity.parsers.csv.CsvParser replaces the ,, value with A) * Now nullSafeDatum will check if (datum == options.{_}nullValue{_} || datum == null) and return null for both the values , since datum = options.nullValue => null, null, B * Not sure if this is expected output since from com.univocity.parsers.csv.CsvParser point of view expected output should be “A,A,B” after setting .setNullValue("A") *Solution* Now in case of na_filter , what I am thinking is to add one property if ( (na_filter && datum == options.{_}nullValue)|| datum == null){_} _Now if the input string is A,,B and user have set na_filter to False , then_ com.univocity.parsers.csv.CsvParser will return as its is since setNullValue is (“”) And then (na_filter && datum == options.{_}nullValue) condition become false and{_} converter.apply(datum) , which will leave the value as its . > Support `na_filter` for pyspark.pandas.read_csv > --- > > Key: SPARK-38292 > URL: https://issues.apache.org/jira/browse/SPARK-38292 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > pandas support `na_filter` parameter for `read_csv` function. > (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) > We also want to support this to follow the behavior of pandas. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38292) Support `na_filter` for pyspark.pandas.read_csv
[ https://issues.apache.org/jira/browse/SPARK-38292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556851#comment-17556851 ] pralabhkumar commented on SPARK-38292: -- [~itholic] [~hyukjin.kwon] Would like to discuss the logic The difference comes na_filter = False , when there are missing values . For .eg 22,,1980-09-26 33,,1980-09-26 Pandas with na_filter , read it as its . However Spark will read missing value with null . This happens because of univocity-parsers , which reads missing value as null . Approach in case of na_filter. Once file is read in namespace.py via reader.csv(patj) , replace missing values with empty string (df.fillna("")). We also need to change the datatype of the column to string (as panda does). Please let me know , if its correct direction , i'll create a PR . > Support `na_filter` for pyspark.pandas.read_csv > --- > > Key: SPARK-38292 > URL: https://issues.apache.org/jira/browse/SPARK-38292 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > pandas support `na_filter` parameter for `read_csv` function. > (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) > We also want to support this to follow the behavior of pandas. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39399) proxy-user support not working for Spark on k8s in cluster deploy mode
[ https://issues.apache.org/jira/browse/SPARK-39399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556302#comment-17556302 ] pralabhkumar commented on SPARK-39399: -- Gentle ping [~hyukjin.kwon] [~dongjoon] > proxy-user support not working for Spark on k8s in cluster deploy mode > -- > > Key: SPARK-39399 > URL: https://issues.apache.org/jira/browse/SPARK-39399 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.2.0 >Reporter: Shrikant >Priority: Major > > As part of https://issues.apache.org/jira/browse/SPARK-25355 Proxy user > support was added for Spark on K8s. But the PR only added proxy user on the > spark-submit command to the childArgs. The actual functionality of > authentication using the proxy user is not working in case of cluster deploy > mode for Spark on K8s. > We get AccessControlException when trying to access the kerberized HDFS > through a proxy user. > Spark-Submit: > $SPARK_HOME/bin/spark-submit \ > --master \ > --deploy-mode cluster \ > --name with_proxy_user_di \ > --proxy-user \ > --class org.apache.spark.examples.SparkPi \ > --conf spark.kubernetes.container.image= \ > --conf spark.kubernetes.driver.podTemplateFile=driver.yaml \ > --conf spark.kubernetes.executor.podTemplateFile=executor.yaml \ > --conf spark.kubernetes.driver.limit.cores=1 \ > --conf spark.executor.instances=1 \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ > --conf spark.kubernetes.namespace= \ > --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf \ > --conf spark.eventLog.enabled=true \ > --conf spark.eventLog.dir=hdfs:///scaas/shs_logs \--conf > spark.kubernetes.file.upload.path=hdfs:///tmp \--conf > spark.kubernetes.container.image.pullPolicy=Always \ > --conf > spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///opt/log4j/log4j.properties > \ $SPARK_HOME/examples/jars/spark-examples_2.12-3.2.0-1.jar > Driver Logs: > {code:java} > ++ id -u > + myuid=185 > ++ id -g > + mygid=0 > + set +e > ++ getent passwd 185 > + uidentry= > + set -e > + '[' -z '' ']' > + '[' -w /etc/passwd ']' > + echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false' > + SPARK_CLASSPATH=':/opt/spark/jars/*' > + env > + grep SPARK_JAVA_OPT_ > + sort -t_ -k4 -n > + sed 's/[^=]*=\(.*\)/\1/g' > + readarray -t SPARK_EXECUTOR_JAVA_OPTS > + '[' -n '' ']' > + '[' -z ']' > + '[' -z ']' > + '[' -n '' ']' > + '[' -z x ']' > + SPARK_CLASSPATH='/opt/hadoop/conf::/opt/spark/jars/*' > + '[' -z x ']' > + SPARK_CLASSPATH='/opt/spark/conf:/opt/hadoop/conf::/opt/spark/jars/*' > + case "$1" in > + shift 1 > + CMD=("$SPARK_HOME/bin/spark-submit" --conf > "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client > "$@") > + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf > spark.driver.bindAddress= --deploy-mode client --proxy-user proxy_user > --properties-file /opt/spark/conf/spark.properties --class > org.apache.spark.examples.SparkPi spark-internal > WARNING: An illegal reflective access operation has occurred > WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform > (file:/opt/spark/jars/spark-unsafe_2.12-3.2.0-1.jar) to constructor > java.nio.DirectByteBuffer(long,int) > WARNING: Please consider reporting this to the maintainers of > org.apache.spark.unsafe.Platform > WARNING: Use --illegal-access=warn to enable warnings of further illegal > reflective access operations > WARNING: All illegal access operations will be denied in a future release > 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field > org.apache.hadoop.metrics2.lib.MutableRate > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with > annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", > sampleName="Ops", always=false, type=DEFAULT, value={"Rate of successful > kerberos logins and latency (milliseconds)"}, valueName="Time") > 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field > org.apache.hadoop.metrics2.lib.MutableRate > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with > annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", > sampleName="Ops", always=false, type=DEFAULT, value={"Rate of failed kerberos > logins and latency (milliseconds)"}, valueName="Time") > 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field > org.apache.hadoop.metrics2.lib.MutableRate > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with > annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", > sampleName="Ops", always=false, type=DEFAULT, value={"GetGroups"}, > valueName="Time") > 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field private > org.apache.hadoop.metrics2.lib.MutableGaugeLong >
[jira] [Commented] (SPARK-39399) proxy-user support not working for Spark on k8s in cluster deploy mode
[ https://issues.apache.org/jira/browse/SPARK-39399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554904#comment-17554904 ] pralabhkumar commented on SPARK-39399: -- ping [~hyukjin.kwon] , please help us on the same or please provide some reference who can take this forward. > proxy-user support not working for Spark on k8s in cluster deploy mode > -- > > Key: SPARK-39399 > URL: https://issues.apache.org/jira/browse/SPARK-39399 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.2.0 >Reporter: Shrikant >Priority: Major > > As part of https://issues.apache.org/jira/browse/SPARK-25355 Proxy user > support was added for Spark on K8s. But the PR only added proxy user on the > spark-submit command to the childArgs. The actual functionality of > authentication using the proxy user is not working in case of cluster deploy > mode for Spark on K8s. > We get AccessControlException when trying to access the kerberized HDFS > through a proxy user. > Spark-Submit: > $SPARK_HOME/bin/spark-submit \ > --master \ > --deploy-mode cluster \ > --name with_proxy_user_di \ > --proxy-user \ > --class org.apache.spark.examples.SparkPi \ > --conf spark.kubernetes.container.image= \ > --conf spark.kubernetes.driver.podTemplateFile=driver.yaml \ > --conf spark.kubernetes.executor.podTemplateFile=executor.yaml \ > --conf spark.kubernetes.driver.limit.cores=1 \ > --conf spark.executor.instances=1 \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ > --conf spark.kubernetes.namespace= \ > --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf \ > --conf spark.eventLog.enabled=true \ > --conf spark.eventLog.dir=hdfs:///scaas/shs_logs \--conf > spark.kubernetes.file.upload.path=hdfs:///tmp \--conf > spark.kubernetes.container.image.pullPolicy=Always \ > --conf > spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///opt/log4j/log4j.properties > \ $SPARK_HOME/examples/jars/spark-examples_2.12-3.2.0-1.jar > Driver Logs: > {code:java} > ++ id -u > + myuid=185 > ++ id -g > + mygid=0 > + set +e > ++ getent passwd 185 > + uidentry= > + set -e > + '[' -z '' ']' > + '[' -w /etc/passwd ']' > + echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false' > + SPARK_CLASSPATH=':/opt/spark/jars/*' > + env > + grep SPARK_JAVA_OPT_ > + sort -t_ -k4 -n > + sed 's/[^=]*=\(.*\)/\1/g' > + readarray -t SPARK_EXECUTOR_JAVA_OPTS > + '[' -n '' ']' > + '[' -z ']' > + '[' -z ']' > + '[' -n '' ']' > + '[' -z x ']' > + SPARK_CLASSPATH='/opt/hadoop/conf::/opt/spark/jars/*' > + '[' -z x ']' > + SPARK_CLASSPATH='/opt/spark/conf:/opt/hadoop/conf::/opt/spark/jars/*' > + case "$1" in > + shift 1 > + CMD=("$SPARK_HOME/bin/spark-submit" --conf > "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client > "$@") > + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf > spark.driver.bindAddress= --deploy-mode client --proxy-user proxy_user > --properties-file /opt/spark/conf/spark.properties --class > org.apache.spark.examples.SparkPi spark-internal > WARNING: An illegal reflective access operation has occurred > WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform > (file:/opt/spark/jars/spark-unsafe_2.12-3.2.0-1.jar) to constructor > java.nio.DirectByteBuffer(long,int) > WARNING: Please consider reporting this to the maintainers of > org.apache.spark.unsafe.Platform > WARNING: Use --illegal-access=warn to enable warnings of further illegal > reflective access operations > WARNING: All illegal access operations will be denied in a future release > 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field > org.apache.hadoop.metrics2.lib.MutableRate > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with > annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", > sampleName="Ops", always=false, type=DEFAULT, value={"Rate of successful > kerberos logins and latency (milliseconds)"}, valueName="Time") > 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field > org.apache.hadoop.metrics2.lib.MutableRate > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with > annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", > sampleName="Ops", always=false, type=DEFAULT, value={"Rate of failed kerberos > logins and latency (milliseconds)"}, valueName="Time") > 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field > org.apache.hadoop.metrics2.lib.MutableRate > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with > annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", > sampleName="Ops", always=false, type=DEFAULT, value={"GetGroups"}, > valueName="Time") > 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field private >
[jira] [Commented] (SPARK-38292) Support `na_filter` for pyspark.pandas.read_csv
[ https://issues.apache.org/jira/browse/SPARK-38292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554551#comment-17554551 ] pralabhkumar commented on SPARK-38292: -- [~itholic] I would like to work on this . > Support `na_filter` for pyspark.pandas.read_csv > --- > > Key: SPARK-38292 > URL: https://issues.apache.org/jira/browse/SPARK-38292 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > pandas support `na_filter` parameter for `read_csv` function. > (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) > We also want to support this to follow the behavior of pandas. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39179) Improve the test coverage for pyspark/shuffle.py
pralabhkumar created SPARK-39179: Summary: Improve the test coverage for pyspark/shuffle.py Key: SPARK-39179 URL: https://issues.apache.org/jira/browse/SPARK-39179 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: pralabhkumar Assignee: pralabhkumar Fix For: 3.4.0 Improve the test coverage of taskcontext.py -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39179) Improve the test coverage for pyspark/shuffle.py
[ https://issues.apache.org/jira/browse/SPARK-39179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17536555#comment-17536555 ] pralabhkumar commented on SPARK-39179: -- I am working on this . > Improve the test coverage for pyspark/shuffle.py > > > Key: SPARK-39179 > URL: https://issues.apache.org/jira/browse/SPARK-39179 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Assignee: pralabhkumar >Priority: Minor > Fix For: 3.4.0 > > > Improve the test coverage of shuffle.py -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39179) Improve the test coverage for pyspark/shuffle.py
[ https://issues.apache.org/jira/browse/SPARK-39179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pralabhkumar updated SPARK-39179: - Description: Improve the test coverage of shuffle.py (was: Improve the test coverage of taskcontext.py) > Improve the test coverage for pyspark/shuffle.py > > > Key: SPARK-39179 > URL: https://issues.apache.org/jira/browse/SPARK-39179 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Assignee: pralabhkumar >Priority: Minor > Fix For: 3.4.0 > > > Improve the test coverage of shuffle.py -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39102) Replace the usage of guava's Files.createTempDir() with java.nio.file.Files.createTempDirectory()
[ https://issues.apache.org/jira/browse/SPARK-39102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17533764#comment-17533764 ] pralabhkumar commented on SPARK-39102: -- Sure I'll work on this > Replace the usage of guava's Files.createTempDir() with > java.nio.file.Files.createTempDirectory() > -- > > Key: SPARK-39102 > URL: https://issues.apache.org/jira/browse/SPARK-39102 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0, 3.2.1, 3.4.0 >Reporter: pralabhkumar >Priority: Minor > > Hi > There are several classes where Spark is using guava's Files.createTempDir() > which have vulnerabilities. I think its better to move to > java.nio.file.Files.createTempDirectory() for those classes. > Classes > Java8RDDAPISuite > JavaAPISuite.java > RPackageUtilsSuite > StreamTestHelper > TestShuffleDataContext > ExternalBlockHandlerSuite > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39097) Improve the test coverage for pyspark/taskcontext.py
[ https://issues.apache.org/jira/browse/SPARK-39097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17533251#comment-17533251 ] pralabhkumar commented on SPARK-39097: -- [~hyukjin.kwon] While analysis the unit test cases for taskcontext in test_taskcontext.py , i found most of the test cases are there . However its not coming in the coverage , possibly because methods are called inside tasks (rdd.map(lambda x: TaskContext.get().stageId()). So for e.g report says [https://app.codecov.io/gh/apache/spark/blob/master/python/pyspark/taskcontext.py] stageID is not covered in the test case. However test , test_stage_id is testing the stageid method . stage1 = rdd.map(lambda x: TaskContext.get().stageId()).take(1)[0] If I change the code to below and bring taskcontext to driver , then the coverage report says stageid is covered via unit test case. rdd.map(lambda x: TaskContext.get()).take(1)[0].stageId() I can change the code to the above one to have the coverage , please let me know , if this is correct. > Improve the test coverage for pyspark/taskcontext.py > - > > Key: SPARK-39097 > URL: https://issues.apache.org/jira/browse/SPARK-39097 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Assignee: pralabhkumar >Priority: Minor > Fix For: 3.4.0 > > > Improve the test coverage of taskcontext.py -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39097) Improve the test coverage for pyspark/taskcontext.py
[ https://issues.apache.org/jira/browse/SPARK-39097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pralabhkumar updated SPARK-39097: - Description: Improve the test coverage of taskcontext.py (was: Improve the test coverage of rddsampler.py) > Improve the test coverage for pyspark/taskcontext.py > - > > Key: SPARK-39097 > URL: https://issues.apache.org/jira/browse/SPARK-39097 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Assignee: pralabhkumar >Priority: Minor > Fix For: 3.4.0 > > > Improve the test coverage of taskcontext.py -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39102) Replace the usage of guava's Files.createTempDir() with java.nio.file.Files.createTempDirectory()
[ https://issues.apache.org/jira/browse/SPARK-39102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532443#comment-17532443 ] pralabhkumar commented on SPARK-39102: -- ping [~hyukjin.kwon] > Replace the usage of guava's Files.createTempDir() with > java.nio.file.Files.createTempDirectory() > -- > > Key: SPARK-39102 > URL: https://issues.apache.org/jira/browse/SPARK-39102 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0, 3.2.1, 3.4.0 >Reporter: pralabhkumar >Priority: Minor > > Hi > There are several classes where Spark is using guava's Files.createTempDir() > which have vulnerabilities. I think its better to move to > java.nio.file.Files.createTempDirectory() for those classes. > Classes > Java8RDDAPISuite > JavaAPISuite.java > RPackageUtilsSuite > StreamTestHelper > TestShuffleDataContext > ExternalBlockHandlerSuite > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39102) Replace the usage of guava's Files.createTempDir() with java.nio.file.Files.createTempDirectory()
pralabhkumar created SPARK-39102: Summary: Replace the usage of guava's Files.createTempDir() with java.nio.file.Files.createTempDirectory() Key: SPARK-39102 URL: https://issues.apache.org/jira/browse/SPARK-39102 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.1, 3.2.0, 3.4.0 Reporter: pralabhkumar Hi There are several classes where Spark is using guava's Files.createTempDir() which have vulnerabilities. I think its better to move to java.nio.file.Files.createTempDirectory() for those classes. Classes Java8RDDAPISuite JavaAPISuite.java RPackageUtilsSuite StreamTestHelper TestShuffleDataContext ExternalBlockHandlerSuite -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38262) Upgrade Google guava to version 30.0-jre
[ https://issues.apache.org/jira/browse/SPARK-38262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531526#comment-17531526 ] pralabhkumar commented on SPARK-38262: -- [~bjornjorgensen] So QQ , as part of this PR , it is not upgraded to version 30.0 , because of issues on Hive and Hadoop side. * So is there any plan to fix [CVE-2020-8908|https://nvd.nist.gov/vuln/detail/CVE-2020-8908] * does this effect https://issues.apache.org/jira/browse/HADOOP-18036 any decision on Spark side > Upgrade Google guava to version 30.0-jre > > > Key: SPARK-38262 > URL: https://issues.apache.org/jira/browse/SPARK-38262 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.3.0 >Reporter: Bjørn Jørgensen >Priority: Major > > This is duplicated many times like in > [SPARK-32502|https://issues.apache.org/jira/browse/SPARK-32502] > Apache Spark is using com.google.guava:guava version 14.0.1 which has two > security issues. > [CVE-2018-10237|https://nvd.nist.gov/vuln/detail/CVE-2018-10237] > [CVE-2020-8908|https://nvd.nist.gov/vuln/detail/CVE-2020-8908] > We should upgrade to [version > 30.0|https://mvnrepository.com/artifact/com.google.guava/guava/30.0-jre] > I will add some links to what I have found about this issue > [HIVE-25617:fix bug introduced by > CVE-2020-8908|https://github.com/apache/hive/pull/2725] > [Upgrade Guava to 27|https://github.com/apache/druid/pull/10683] > [HIVE-21961: Upgrade Hadoop to 3.1.4, Guava to 27.0-jre and Jetty to > 9.4.20.v20190813|https://github.com/apache/hive/pull/1821] > [Shade Guava manually|https://github.com/apache/druid/issues/6942] > [[DISCUSS] Hadoop 3, dropping support for Hadoop > 2.x|https://lists.apache.org/thread/zmc389trnkh6x444so8mdb2h0x0noqq4] -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39097) Improve the test coverage for pyspark/taskcontext.py
pralabhkumar created SPARK-39097: Summary: Improve the test coverage for pyspark/taskcontext.py Key: SPARK-39097 URL: https://issues.apache.org/jira/browse/SPARK-39097 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: pralabhkumar Assignee: pralabhkumar Fix For: 3.4.0 Improve the test coverage of rddsampler.py -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39097) Improve the test coverage for pyspark/taskcontext.py
[ https://issues.apache.org/jira/browse/SPARK-39097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17531481#comment-17531481 ] pralabhkumar commented on SPARK-39097: -- I am working on this. > Improve the test coverage for pyspark/taskcontext.py > - > > Key: SPARK-39097 > URL: https://issues.apache.org/jira/browse/SPARK-39097 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Assignee: pralabhkumar >Priority: Minor > Fix For: 3.4.0 > > > Improve the test coverage of rddsampler.py -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25355) Support --proxy-user for Spark on K8s
[ https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17530083#comment-17530083 ] pralabhkumar commented on SPARK-25355: -- [~hyukjin.kwon] Can u please us or redirect us to someone who can help us on the above two comments . > Support --proxy-user for Spark on K8s > - > > Key: SPARK-25355 > URL: https://issues.apache.org/jira/browse/SPARK-25355 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Stavros Kontopoulos >Assignee: Pedro Rossi >Priority: Major > Fix For: 3.1.0 > > > SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition > needed is the support for proxy user. A proxy user is impersonated by a > superuser who executes operations on behalf of the proxy user. More on this: > [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html] > [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md] > This has been implemented for Yarn upstream and Spark on Mesos here: > [https://github.com/mesosphere/spark/pull/26] > [~ifilonenko] creating this issue according to our discussion. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39029) Improve the test coverage for pyspark/broadcast.py
[ https://issues.apache.org/jira/browse/SPARK-39029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528269#comment-17528269 ] pralabhkumar commented on SPARK-39029: -- I am working on this . > Improve the test coverage for pyspark/broadcast.py > -- > > Key: SPARK-39029 > URL: https://issues.apache.org/jira/browse/SPARK-39029 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Minor > > Improve the test coverage of rddsampler.py -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39029) Improve the test coverage for pyspark/broadcast.py
pralabhkumar created SPARK-39029: Summary: Improve the test coverage for pyspark/broadcast.py Key: SPARK-39029 URL: https://issues.apache.org/jira/browse/SPARK-39029 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: pralabhkumar Improve the test coverage of rddsampler.py -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38492) Improve the test coverage for PySpark
[ https://issues.apache.org/jira/browse/SPARK-38492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17523293#comment-17523293 ] pralabhkumar commented on SPARK-38492: -- on it . Thx > Improve the test coverage for PySpark > - > > Key: SPARK-38492 > URL: https://issues.apache.org/jira/browse/SPARK-38492 > Project: Spark > Issue Type: Umbrella > Components: PySpark, Tests >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > Currently, PySpark test coverage is around 91% according to codecov report: > [https://app.codecov.io/gh/apache/spark|https://app.codecov.io/gh/apache/spark] > Since there are still 9% missing tests, so I think it would be great to > improve our test coverage. > Of course we might not target to 100%, but as much as possible, to the level > that we can currently cover with CI. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38874) CLONE - Improve the test coverage for pyspark/ml module
pralabhkumar created SPARK-38874: Summary: CLONE - Improve the test coverage for pyspark/ml module Key: SPARK-38874 URL: https://issues.apache.org/jira/browse/SPARK-38874 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: pralabhkumar Currently, ml module has 90% of test coverage. We could improve the test coverage by adding the missing tests for ml module. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-38871) Improve the test coverage for PySpark/rddsampler.py
[ https://issues.apache.org/jira/browse/SPARK-38871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pralabhkumar closed SPARK-38871. This issue is wrongly created , hence closing it > Improve the test coverage for PySpark/rddsampler.py > --- > > Key: SPARK-38871 > URL: https://issues.apache.org/jira/browse/SPARK-38871 > Project: Spark > Issue Type: Umbrella > Components: PySpark, Tests >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Major > > Currently, PySpark test coverage is around 91% according to codecov report: > [https://app.codecov.io/gh/apache/spark|https://app.codecov.io/gh/apache/spark] > Since there are still 9% missing tests, so I think it would be great to > improve our test coverage. > Of course we might not target to 100%, but as much as possible, to the level > that we can currently cover with CI. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38871) Improve the test coverage for PySpark/rddsampler.py
[ https://issues.apache.org/jira/browse/SPARK-38871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pralabhkumar resolved SPARK-38871. -- Resolution: Invalid > Improve the test coverage for PySpark/rddsampler.py > --- > > Key: SPARK-38871 > URL: https://issues.apache.org/jira/browse/SPARK-38871 > Project: Spark > Issue Type: Umbrella > Components: PySpark, Tests >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Major > > Currently, PySpark test coverage is around 91% according to codecov report: > [https://app.codecov.io/gh/apache/spark|https://app.codecov.io/gh/apache/spark] > Since there are still 9% missing tests, so I think it would be great to > improve our test coverage. > Of course we might not target to 100%, but as much as possible, to the level > that we can currently cover with CI. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py
[ https://issues.apache.org/jira/browse/SPARK-38879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521135#comment-17521135 ] pralabhkumar commented on SPARK-38879: -- I will be working on this . > Improve the test coverage for pyspark/rddsampler.py > --- > > Key: SPARK-38879 > URL: https://issues.apache.org/jira/browse/SPARK-38879 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.4.0 > > > Improve the test coverage of rddsampler.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py
[ https://issues.apache.org/jira/browse/SPARK-38879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521135#comment-17521135 ] pralabhkumar edited comment on SPARK-38879 at 4/12/22 1:07 PM: --- Please allow me to work on this was (Author: pralabhkumar): I will be working on this . > Improve the test coverage for pyspark/rddsampler.py > --- > > Key: SPARK-38879 > URL: https://issues.apache.org/jira/browse/SPARK-38879 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.4.0 > > > Improve the test coverage of rddsampler.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py
[ https://issues.apache.org/jira/browse/SPARK-38879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pralabhkumar updated SPARK-38879: - Description: Improve the test coverage of rddsampler.py (was: Improve the test coverage of statcounter.py ) > Improve the test coverage for pyspark/rddsampler.py > --- > > Key: SPARK-38879 > URL: https://issues.apache.org/jira/browse/SPARK-38879 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.4.0 > > > Improve the test coverage of rddsampler.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py
pralabhkumar created SPARK-38879: Summary: Improve the test coverage for pyspark/rddsampler.py Key: SPARK-38879 URL: https://issues.apache.org/jira/browse/SPARK-38879 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: pralabhkumar Assignee: Hyukjin Kwon Fix For: 3.4.0 Improve the test coverage of statcounter.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38871) Improve the test coverage for PySpark/rddsampler.py
[ https://issues.apache.org/jira/browse/SPARK-38871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521134#comment-17521134 ] pralabhkumar commented on SPARK-38871: -- Please close this one , wrongly cloned > Improve the test coverage for PySpark/rddsampler.py > --- > > Key: SPARK-38871 > URL: https://issues.apache.org/jira/browse/SPARK-38871 > Project: Spark > Issue Type: Umbrella > Components: PySpark, Tests >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Major > > Currently, PySpark test coverage is around 91% according to codecov report: > [https://app.codecov.io/gh/apache/spark|https://app.codecov.io/gh/apache/spark] > Since there are still 9% missing tests, so I think it would be great to > improve our test coverage. > Of course we might not target to 100%, but as much as possible, to the level > that we can currently cover with CI. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38876) CLONE - Improve the test coverage for pyspark/*.py
pralabhkumar created SPARK-38876: Summary: CLONE - Improve the test coverage for pyspark/*.py Key: SPARK-38876 URL: https://issues.apache.org/jira/browse/SPARK-38876 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: pralabhkumar Currently, there are several Python scripts under pyspark/ directory. (e.g. rdd.py, util.py, serializers.py, ...) We could improve the test coverage by adding the missing tests for these scripts. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38872) CLONE - Improve the test coverage for pyspark/pandas module
pralabhkumar created SPARK-38872: Summary: CLONE - Improve the test coverage for pyspark/pandas module Key: SPARK-38872 URL: https://issues.apache.org/jira/browse/SPARK-38872 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: pralabhkumar Currently, pandas module (pandas API on Spark) has 94% of test coverage. We could improve the test coverage by adding the missing tests for pandas module. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38878) CLONE - Improve the test coverage for pyspark/statcounter.py
pralabhkumar created SPARK-38878: Summary: CLONE - Improve the test coverage for pyspark/statcounter.py Key: SPARK-38878 URL: https://issues.apache.org/jira/browse/SPARK-38878 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: pralabhkumar Assignee: Hyukjin Kwon Fix For: 3.4.0 Improve the test coverage of statcounter.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38875) CLONE - Improve the test coverage for pyspark/sql module
pralabhkumar created SPARK-38875: Summary: CLONE - Improve the test coverage for pyspark/sql module Key: SPARK-38875 URL: https://issues.apache.org/jira/browse/SPARK-38875 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: pralabhkumar Currently, sql module has 90% of test coverage. We could improve the test coverage by adding the missing tests for sql module. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38877) CLONE - Improve the test coverage for pyspark/find_spark_home.py
pralabhkumar created SPARK-38877: Summary: CLONE - Improve the test coverage for pyspark/find_spark_home.py Key: SPARK-38877 URL: https://issues.apache.org/jira/browse/SPARK-38877 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.4.0 Reporter: pralabhkumar Assignee: Hyukjin Kwon Fix For: 3.4.0 We should test when the environment variables are not set (https://app.codecov.io/gh/apache/spark/blob/master/python/pyspark/find_spark_home.py) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38871) Improve the test coverage for PySpark/rddsampler.py
pralabhkumar created SPARK-38871: Summary: Improve the test coverage for PySpark/rddsampler.py Key: SPARK-38871 URL: https://issues.apache.org/jira/browse/SPARK-38871 Project: Spark Issue Type: Umbrella Components: PySpark, Tests Affects Versions: 3.3.0 Reporter: pralabhkumar Currently, PySpark test coverage is around 91% according to codecov report: [https://app.codecov.io/gh/apache/spark|https://app.codecov.io/gh/apache/spark] Since there are still 9% missing tests, so I think it would be great to improve our test coverage. Of course we might not target to 100%, but as much as possible, to the level that we can currently cover with CI. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38873) CLONE - Improve the test coverage for pyspark/mllib module
pralabhkumar created SPARK-38873: Summary: CLONE - Improve the test coverage for pyspark/mllib module Key: SPARK-38873 URL: https://issues.apache.org/jira/browse/SPARK-38873 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: pralabhkumar Currently, mllib module has 88% of test coverage. We could improve the test coverage by adding the missing tests for mllib module. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38854) Improve the test coverage for pyspark/statcounter.py
[ https://issues.apache.org/jira/browse/SPARK-38854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17520374#comment-17520374 ] pralabhkumar commented on SPARK-38854: -- [~gurwls223] I would like to work on this , hence can u please remove the Assignee to "unassigned" for now (since I cloned from your task it automatically came) Thx. > Improve the test coverage for pyspark/statcounter.py > > > Key: SPARK-38854 > URL: https://issues.apache.org/jira/browse/SPARK-38854 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Assignee: Hyukjin Kwon >Priority: Minor > > Improve the test coverage of statcounter.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38854) Improve the test coverage for pyspark/statcounter.py
[ https://issues.apache.org/jira/browse/SPARK-38854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pralabhkumar updated SPARK-38854: - Affects Version/s: 3.3.0 (was: 3.4.0) > Improve the test coverage for pyspark/statcounter.py > > > Key: SPARK-38854 > URL: https://issues.apache.org/jira/browse/SPARK-38854 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Assignee: Hyukjin Kwon >Priority: Minor > > Improve the test coverage of statcounter.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38854) Improve the test coverage for pyspark/statcounter.py
[ https://issues.apache.org/jira/browse/SPARK-38854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pralabhkumar updated SPARK-38854: - Description: Improve the test coverage of statcounter.py (was: We should test when the environment variables are not set (https://app.codecov.io/gh/apache/spark/blob/master/python/pyspark/find_spark_home.py)) > Improve the test coverage for pyspark/statcounter.py > > > Key: SPARK-38854 > URL: https://issues.apache.org/jira/browse/SPARK-38854 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: pralabhkumar >Assignee: Hyukjin Kwon >Priority: Minor > > Improve the test coverage of statcounter.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38854) Improve the test coverage for pyspark/statcounter.py
[ https://issues.apache.org/jira/browse/SPARK-38854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pralabhkumar updated SPARK-38854: - Fix Version/s: (was: 3.4.0) > Improve the test coverage for pyspark/statcounter.py > > > Key: SPARK-38854 > URL: https://issues.apache.org/jira/browse/SPARK-38854 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: pralabhkumar >Assignee: Hyukjin Kwon >Priority: Minor > > We should test when the environment variables are not set > (https://app.codecov.io/gh/apache/spark/blob/master/python/pyspark/find_spark_home.py) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38854) Improve the test coverage for pyspark/statcounter.py
pralabhkumar created SPARK-38854: Summary: Improve the test coverage for pyspark/statcounter.py Key: SPARK-38854 URL: https://issues.apache.org/jira/browse/SPARK-38854 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.4.0 Reporter: pralabhkumar Assignee: Hyukjin Kwon Fix For: 3.4.0 We should test when the environment variables are not set (https://app.codecov.io/gh/apache/spark/blob/master/python/pyspark/find_spark_home.py) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38637) pyspark.pandas.config.OptionError: "No such option: 'mode.chained_assignment'
[ https://issues.apache.org/jira/browse/SPARK-38637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17517393#comment-17517393 ] pralabhkumar commented on SPARK-38637: -- [~itholic] can I work on this > pyspark.pandas.config.OptionError: "No such option: 'mode.chained_assignment' > - > > Key: SPARK-38637 > URL: https://issues.apache.org/jira/browse/SPARK-38637 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Prakhar Sandhu >Priority: Major > > I replaced import pandas as pd to import pyspark.pandas as pd in my code. > {code:java} > pd.set_option("mode.chained_assignment", None) {code} > The above command was working with pandas but this option is not available in > pyspark.pandas . > {code:java} > pyspark.pandas.config.OptionError: "No such option: > 'mode.chained_assignment'. Available options are [display.max_rows, > compute.max_rows, compute.shortcut_limit, compute.ops_on_diff_frames, > compute.default_index_type, compute.ordered_head, > plotting.max_rows, plotting.sample_ratio, plotting.backend]" {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24432) Add support for dynamic resource allocation
[ https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17478618#comment-17478618 ] pralabhkumar commented on SPARK-24432: -- [~dongjoon] one quick question . - The K8s dynamic allocation with storage migration between executors is already in `master` branch for Apache Spark 3.1.0. If u can please provide the PR which is doing that , it would be really helpful > Add support for dynamic resource allocation > --- > > Key: SPARK-24432 > URL: https://issues.apache.org/jira/browse/SPARK-24432 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Yinan Li >Priority: Major > > This is an umbrella ticket for work on adding support for dynamic resource > allocation into the Kubernetes mode. This requires a Kubernetes-specific > external shuffle service. The feature is available in our fork at > github.com/apache-spark-on-k8s/spark. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37491) Fix Series.asof when values of the series is not sorted
[ https://issues.apache.org/jira/browse/SPARK-37491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472056#comment-17472056 ] pralabhkumar commented on SPARK-37491: -- Lets take example of pser = pd.Series([2, 1, np.nan, 4], index=[10, 20, 30, 40], name="Koalas") pser.asof([5,20]) will give output [Nan , 1] While ps.from_pandas(pser).asof[5,20] will give output [Nan, 2] *Explanation* Data frame created after applying condition. F.when(index_scol <= SF.lit(index).cast(index_type) Without applying max aggregation +-+--+-+ |col_5 |col_25 |__index_level_0__| +-+--+-+ |null|2.0|10 | |null|1.0|20 | |null|null|30 | |null|null|40 | +-+--+-+ Since we are taking max , output is coming 2. Ideally what we need is the last non null value or each col with increasing order of __index_level_0__. Now to implement the logic . What I planning to do is create a below DF from the above DF , using explode , partition and row_number __index_level_0__. Identifier value row_number 40 col_5 null. 1 30 col_5 null 2 20 col_5 null 3 10 col_5 null 4 40 col_20 2 1 30 col_20 1 2 20 col_20 null 3 10 col_20 null 4 Then filter on row_number=1 . There are other things to take care , but majority of the logic is this . Please let me know if its in correct direction ( This is actually passing all the asof test cases ,including the case which is described in jira. ) . [~itholic] > Fix Series.asof when values of the series is not sorted > --- > > Key: SPARK-37491 > URL: https://issues.apache.org/jira/browse/SPARK-37491 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > > https://github.com/apache/spark/pull/34737#discussion_r758223279 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37491) Fix Series.asof when values of the series is not sorted
[ https://issues.apache.org/jira/browse/SPARK-37491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17471314#comment-17471314 ] pralabhkumar commented on SPARK-37491: -- I am working on it . > Fix Series.asof when values of the series is not sorted > --- > > Key: SPARK-37491 > URL: https://issues.apache.org/jira/browse/SPARK-37491 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > > https://github.com/apache/spark/pull/34737#discussion_r758223279 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37491) Fix Series.asof when values of the series is not sorted
[ https://issues.apache.org/jira/browse/SPARK-37491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17469471#comment-17469471 ] pralabhkumar commented on SPARK-37491: -- I would like to work on this . Basically the problem is in series.py , finding Max . cond = [ F.max(F.when(index_scol <= SF.lit(index).cast(index_type), self.spark.column)) for index in where ] cc [~hyukjin.kwon] [~itholic] > Fix Series.asof when values of the series is not sorted > --- > > Key: SPARK-37491 > URL: https://issues.apache.org/jira/browse/SPARK-37491 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > > https://github.com/apache/spark/pull/34737#discussion_r758223279 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37491) Fix Series.asof when values of the series is not sorted
[ https://issues.apache.org/jira/browse/SPARK-37491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17469471#comment-17469471 ] pralabhkumar edited comment on SPARK-37491 at 1/5/22, 6:27 PM: --- I would like to work on this . Basically the problem is in series.py . We should not find max here. cond = [ F.max(F.when(index_scol <= SF.lit(index).cast(index_type), self.spark.column)) for index in where ] cc [~hyukjin.kwon] [~itholic] was (Author: pralabhkumar): I would like to work on this . Basically the problem is in series.py , finding Max . cond = [ F.max(F.when(index_scol <= SF.lit(index).cast(index_type), self.spark.column)) for index in where ] cc [~hyukjin.kwon] [~itholic] > Fix Series.asof when values of the series is not sorted > --- > > Key: SPARK-37491 > URL: https://issues.apache.org/jira/browse/SPARK-37491 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > > https://github.com/apache/spark/pull/34737#discussion_r758223279 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37188) pyspark.pandas histogram accepts the title option but does not add a title to the plot
[ https://issues.apache.org/jira/browse/SPARK-37188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17446305#comment-17446305 ] pralabhkumar commented on SPARK-37188: -- [~hyukjin.kwon] Working on it . Thx > pyspark.pandas histogram accepts the title option but does not add a title to > the plot > -- > > Key: SPARK-37188 > URL: https://issues.apache.org/jira/browse/SPARK-37188 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > In pyspark.pandas if you write a line like this > {quote}DF.plot.hist(bins=20, title="US Counties -- FullVaxPer100") > {quote} > it compiles and runs, but the plot has no title. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37188) pyspark.pandas histogram accepts the title option but does not add a title to the plot
[ https://issues.apache.org/jira/browse/SPARK-37188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17445978#comment-17445978 ] pralabhkumar edited comment on SPARK-37188 at 11/18/21, 2:56 PM: - IMHO , the issue is in pyspark.pandas.plot plotly.py plot_histogram method . arguments (kwargs) , passed by user are not passed to plotly when creating the figure. Therefore this issue is not just with title but can happen with other arguments like "activeshape" , "font". Once I passes the user argument to go.Layout(title=kwargs.get("title")) , title issue is not happening(provided user passes title). I think , we should pass all the arguments provided by user and expected by go.Layout. Similarly for go.Bar [~yikunkero] [~hyukjin.kwon] . Please let me know , if I my understanding is correct , I can create a PR for it . was (Author: pralabhkumar): IMHO , the issue is in pyspark.pandas.plot plotly.py plot_histogram method . arguments (kwargs) , passed by user are not passed to plotly when creating the figure. Therefore this issue is not just with title but can happen with other arguments like "activeshape" , "font". Once I passes the user argument to go.Layout(title=kwargs.get("title")) , title issue is not happening(provided user passes title). I think , we should pass all the arguments provided by user and expected by go.Layout. Similarly for go.Bar [~yikunkero] . Please let me know , if I my understanding is correct , I can create a PR for it . > pyspark.pandas histogram accepts the title option but does not add a title to > the plot > -- > > Key: SPARK-37188 > URL: https://issues.apache.org/jira/browse/SPARK-37188 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > In pyspark.pandas if you write a line like this > {quote}DF.plot.hist(bins=20, title="US Counties -- FullVaxPer100") > {quote} > it compiles and runs, but the plot has no title. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37188) pyspark.pandas histogram accepts the title option but does not add a title to the plot
[ https://issues.apache.org/jira/browse/SPARK-37188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17445978#comment-17445978 ] pralabhkumar commented on SPARK-37188: -- IMHO , the issue is in pyspark.pandas.plot plotly.py plot_histogram method . arguments (kwargs) , passed by user are not passed to plotly when creating the figure. Therefore this issue is not just with title but can happen with other arguments like "activeshape" , "font". Once I passes the user argument to go.Layout(title=kwargs.get("title")) , title issue is not happening(provided user passes title). I think , we should pass all the arguments provided by user and expected by go.Layout. Similarly for go.Bar [~yikunkero] . Please let me know , if I my understanding is correct , I can create a PR for it . > pyspark.pandas histogram accepts the title option but does not add a title to > the plot > -- > > Key: SPARK-37188 > URL: https://issues.apache.org/jira/browse/SPARK-37188 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > In pyspark.pandas if you write a line like this > {quote}DF.plot.hist(bins=20, title="US Counties -- FullVaxPer100") > {quote} > it compiles and runs, but the plot has no title. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37189) pyspark.pandas histogram accepts the range option but does not use it
[ https://issues.apache.org/jira/browse/SPARK-37189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17445960#comment-17445960 ] pralabhkumar edited comment on SPARK-37189 at 11/18/21, 2:52 PM: - was (Author: pralabhkumar): IMHO , the issue is in pyspark.pandas.plot plotly.py plot_histogram method . arguments (kwargs) , passed by user are not passed to plotly when creating the figure. Therefore this issue is not just with title but can happen with other arguments like "activeshape" , "font". Once I passes the user argument to go.Layout , title issue is not happening(provided user passes title). [~yikunkero] . Please let me know , if I my understanding is correct , I can create a PR for it . > pyspark.pandas histogram accepts the range option but does not use it > - > > Key: SPARK-37189 > URL: https://issues.apache.org/jira/browse/SPARK-37189 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > In pyspark.pandas if you write a line like this > {quote}DF.plot.hist(bins=30, range=[0, 20], title="US Counties -- > DeathsPer100k (<20)") > {quote} > it compiles and runs, but the plot does not respect the range. All the values > are shown. > The workaround is to create a new DataFrame that pre-selects just the rows > you want, but line above should work also. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-37189) pyspark.pandas histogram accepts the range option but does not use it
[ https://issues.apache.org/jira/browse/SPARK-37189 ] pralabhkumar deleted comment on SPARK-37189: -- was (Author: pralabhkumar): > pyspark.pandas histogram accepts the range option but does not use it > - > > Key: SPARK-37189 > URL: https://issues.apache.org/jira/browse/SPARK-37189 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > In pyspark.pandas if you write a line like this > {quote}DF.plot.hist(bins=30, range=[0, 20], title="US Counties -- > DeathsPer100k (<20)") > {quote} > it compiles and runs, but the plot does not respect the range. All the values > are shown. > The workaround is to create a new DataFrame that pre-selects just the rows > you want, but line above should work also. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37189) pyspark.pandas histogram accepts the range option but does not use it
[ https://issues.apache.org/jira/browse/SPARK-37189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17445960#comment-17445960 ] pralabhkumar commented on SPARK-37189: -- IMHO , the issue is in pyspark.pandas.plot plotly.py plot_histogram method . arguments (kwargs) , passed by user are not passed to plotly when creating the figure. Therefore this issue is not just with title but can happen with other arguments like "activeshape" , "font". Once I passes the user argument to go.Layout , title issue is not happening(provided user passes title). [~yikunkero] . Please let me know , if I my understanding is correct , I can create a PR for it . > pyspark.pandas histogram accepts the range option but does not use it > - > > Key: SPARK-37189 > URL: https://issues.apache.org/jira/browse/SPARK-37189 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > In pyspark.pandas if you write a line like this > {quote}DF.plot.hist(bins=30, range=[0, 20], title="US Counties -- > DeathsPer100k (<20)") > {quote} > it compiles and runs, but the plot does not respect the range. All the values > are shown. > The workaround is to create a new DataFrame that pre-selects just the rows > you want, but line above should work also. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding
[ https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444256#comment-17444256 ] pralabhkumar commented on SPARK-37181: -- [~yikunkero] [~chconnell] . I'll work on this and will create a PR > pyspark.pandas.read_csv() should support latin-1 encoding > - > > Key: SPARK-37181 > URL: https://issues.apache.org/jira/browse/SPARK-37181 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding > is not recognized in pyspark.pandas. You have to use Windows-1252 instead, > which is almost the same but not identical. }} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding
[ https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17443764#comment-17443764 ] pralabhkumar edited comment on SPARK-37181 at 11/15/21, 2:10 PM: - However from users point of view , if user mention latin-1 in pyspark.pandas then instead of throwing "pyspark.sql.utils.IllegalArgumentException: latin-1" , spark can internally convert it to ISO-8859-1 cc [~hyukjin.kwon] , [~yikunkero] Let me know , if my understanding is correct . If yes, then I can work on this h1. was (Author: pralabhkumar): However from users point of view , if user mention latin-1 in pyspark.pandas then instead of throwing "pyspark.sql.utils.IllegalArgumentException: latin-1" , spark can internally convert it to ISO-8859-1 cc [~hyukjin.kwon] , [~yikunkero] Let me know , if I can work on this h1. > pyspark.pandas.read_csv() should support latin-1 encoding > - > > Key: SPARK-37181 > URL: https://issues.apache.org/jira/browse/SPARK-37181 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding > is not recognized in pyspark.pandas. You have to use Windows-1252 instead, > which is almost the same but not identical. }} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding
[ https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17443764#comment-17443764 ] pralabhkumar edited comment on SPARK-37181 at 11/15/21, 2:10 PM: - However from users point of view , if user mention latin-1 in pyspark.pandas then instead of throwing "pyspark.sql.utils.IllegalArgumentException: latin-1" , spark can internally convert it to ISO-8859-1 cc [~hyukjin.kwon] , [~yikunkero] Let me know , if I can work on this h1. was (Author: pralabhkumar): However from users point of view , if user mention latin-1 in pyspark.pandas then instead of throwing "pyspark.sql.utils.IllegalArgumentException: latin-1" , spark can internally convert it to ISO-8859-1 cc [~hyukjin.kwon] > pyspark.pandas.read_csv() should support latin-1 encoding > - > > Key: SPARK-37181 > URL: https://issues.apache.org/jira/browse/SPARK-37181 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding > is not recognized in pyspark.pandas. You have to use Windows-1252 instead, > which is almost the same but not identical. }} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding
[ https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17443764#comment-17443764 ] pralabhkumar commented on SPARK-37181: -- However from users point of view , if user mention latin-1 in pyspark.pandas then instead of throwing "pyspark.sql.utils.IllegalArgumentException: latin-1" , spark can internally convert it to ISO-8859-1 cc [~hyukjin.kwon] > pyspark.pandas.read_csv() should support latin-1 encoding > - > > Key: SPARK-37181 > URL: https://issues.apache.org/jira/browse/SPARK-37181 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding > is not recognized in pyspark.pandas. You have to use Windows-1252 instead, > which is almost the same but not identical. }} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding
[ https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17443720#comment-17443720 ] pralabhkumar edited comment on SPARK-37181 at 11/15/21, 10:32 AM: -- from pyspark import pandas as ps latin-1 encoding is same as ISO-8859-1. You can mentioned the same . ps.read_csv("<>", encoding ='ISO-8859-1') [~chconnell] was (Author: pralabhkumar): from pyspark import pandas as ps latin-1 encoding is same as ISO-8859-1. You can mentioned the same . ps.read_csv("/Users/pralkuma/Desktop/rk_scaas/spark/a.txt", encoding ='ISO-8859-1') > pyspark.pandas.read_csv() should support latin-1 encoding > - > > Key: SPARK-37181 > URL: https://issues.apache.org/jira/browse/SPARK-37181 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding > is not recognized in pyspark.pandas. You have to use Windows-1252 instead, > which is almost the same but not identical. }} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding
[ https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17443720#comment-17443720 ] pralabhkumar commented on SPARK-37181: -- from pyspark import pandas as ps latin-1 encoding is same as ISO-8859-1. You can mentioned the same . ps.read_csv("/Users/pralkuma/Desktop/rk_scaas/spark/a.txt", encoding ='ISO-8859-1') > pyspark.pandas.read_csv() should support latin-1 encoding > - > > Key: SPARK-37181 > URL: https://issues.apache.org/jira/browse/SPARK-37181 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding > is not recognized in pyspark.pandas. You have to use Windows-1252 instead, > which is almost the same but not identical. }} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30537) toPandas gets wrong dtypes when applied on empty DF when Arrow enabled
[ https://issues.apache.org/jira/browse/SPARK-30537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432812#comment-17432812 ] pralabhkumar commented on SPARK-30537: -- Thx [~hyukjin.kwon] , working on this > toPandas gets wrong dtypes when applied on empty DF when Arrow enabled > -- > > Key: SPARK-30537 > URL: https://issues.apache.org/jira/browse/SPARK-30537 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > Same issue with SPARK-29188 persists when Arrow optimization is enabled. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30537) toPandas gets wrong dtypes when applied on empty DF when Arrow enabled
[ https://issues.apache.org/jira/browse/SPARK-30537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432805#comment-17432805 ] pralabhkumar commented on SPARK-30537: -- [~hyukjin.kwon] I would like to work on this , please let me know if I can work on this > toPandas gets wrong dtypes when applied on empty DF when Arrow enabled > -- > > Key: SPARK-30537 > URL: https://issues.apache.org/jira/browse/SPARK-30537 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > Same issue with SPARK-29188 persists when Arrow optimization is enabled. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32285) Add PySpark support for nested timestamps with arrow
[ https://issues.apache.org/jira/browse/SPARK-32285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17431140#comment-17431140 ] pralabhkumar commented on SPARK-32285: -- [~bryanc] . Please review the PR . > Add PySpark support for nested timestamps with arrow > > > Key: SPARK-32285 > URL: https://issues.apache.org/jira/browse/SPARK-32285 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > Currently with arrow optimizations, there is post-processing done in pandas > for timestamp columns to localize timezone. This is not done for nested > columns with timestamps such as StructType or ArrayType. > Adding support for this is needed for Apache Arrow 1.0.0 upgrade due to use > of structs with timestamps in groupedby key over a window. > As a simple first step, timestamps with 1 level nesting could be done first > and this will satisfy the immediate need. > NOTE: with Arrow 1.0.0, it might be possible to do the timezone processing > with pyarrow.array.cast, which could be easier done than in pandas. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32161) Hide JVM traceback for SparkUpgradeException
[ https://issues.apache.org/jira/browse/SPARK-32161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430099#comment-17430099 ] pralabhkumar commented on SPARK-32161: -- [~hyukjin.kwon] Since the PR is being merged , please change the status of the Jira and assigned to me . > Hide JVM traceback for SparkUpgradeException > > > Key: SPARK-32161 > URL: https://issues.apache.org/jira/browse/SPARK-32161 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > We added {{SparkUpgradeException}} which the JVM traceback is pretty useless. > See also https://github.com/apache/spark/pull/28736/files#r449184881 > It should better also whitelist and hide JVM traceback. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32161) Hide JVM traceback for SparkUpgradeException
[ https://issues.apache.org/jira/browse/SPARK-32161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17428185#comment-17428185 ] pralabhkumar commented on SPARK-32161: -- [~hyukjin.kwon] Please let me if I can work on this . IMHO its a change in convert_exception method of (sql/util.py) to take care SparkUpgradeException > Hide JVM traceback for SparkUpgradeException > > > Key: SPARK-32161 > URL: https://issues.apache.org/jira/browse/SPARK-32161 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > We added {{SparkUpgradeException}} which the JVM traceback is pretty useless. > See also https://github.com/apache/spark/pull/28736/files#r449184881 > It should better also whitelist and hide JVM traceback. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32285) Add PySpark support for nested timestamps with arrow
[ https://issues.apache.org/jira/browse/SPARK-32285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414232#comment-17414232 ] pralabhkumar commented on SPARK-32285: -- [~hyukjin.kwon] just added the initial version for converting Spark DF to Pandas for ArrayType(TimeStamp) via arrow . Its not the complete PR , I would like to take your early opinion . Please let me know , if its in correct direction , i'll complete the rest of the work > Add PySpark support for nested timestamps with arrow > > > Key: SPARK-32285 > URL: https://issues.apache.org/jira/browse/SPARK-32285 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > Currently with arrow optimizations, there is post-processing done in pandas > for timestamp columns to localize timezone. This is not done for nested > columns with timestamps such as StructType or ArrayType. > Adding support for this is needed for Apache Arrow 1.0.0 upgrade due to use > of structs with timestamps in groupedby key over a window. > As a simple first step, timestamps with 1 level nesting could be done first > and this will satisfy the immediate need. > NOTE: with Arrow 1.0.0, it might be possible to do the timezone processing > with pyarrow.array.cast, which could be easier done than in pandas. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32285) Add PySpark support for nested timestamps with arrow
[ https://issues.apache.org/jira/browse/SPARK-32285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413679#comment-17413679 ] pralabhkumar commented on SPARK-32285: -- Thx , will share the PR in some time > Add PySpark support for nested timestamps with arrow > > > Key: SPARK-32285 > URL: https://issues.apache.org/jira/browse/SPARK-32285 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > Currently with arrow optimizations, there is post-processing done in pandas > for timestamp columns to localize timezone. This is not done for nested > columns with timestamps such as StructType or ArrayType. > Adding support for this is needed for Apache Arrow 1.0.0 upgrade due to use > of structs with timestamps in groupedby key over a window. > As a simple first step, timestamps with 1 level nesting could be done first > and this will satisfy the immediate need. > NOTE: with Arrow 1.0.0, it might be possible to do the timezone processing > with pyarrow.array.cast, which could be easier done than in pandas. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32285) Add PySpark support for nested timestamps with arrow
[ https://issues.apache.org/jira/browse/SPARK-32285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413482#comment-17413482 ] pralabhkumar commented on SPARK-32285: -- [~hyukjin.kwon] [~emkornfi...@gmail.com] I would like to work on this . Have most of logic ready for ArrayType(TimeStamp) . Please let me know ,if I can work on this . > Add PySpark support for nested timestamps with arrow > > > Key: SPARK-32285 > URL: https://issues.apache.org/jira/browse/SPARK-32285 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > Currently with arrow optimizations, there is post-processing done in pandas > for timestamp columns to localize timezone. This is not done for nested > columns with timestamps such as StructType or ArrayType. > Adding support for this is needed for Apache Arrow 1.0.0 upgrade due to use > of structs with timestamps in groupedby key over a window. > As a simple first step, timestamps with 1 level nesting could be done first > and this will satisfy the immediate need. > NOTE: with Arrow 1.0.0, it might be possible to do the timezone processing > with pyarrow.array.cast, which could be easier done than in pandas. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36622) spark.history.kerberos.principal doesn't take value _HOST
[ https://issues.apache.org/jira/browse/SPARK-36622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411859#comment-17411859 ] pralabhkumar commented on SPARK-36622: -- [~angerszhuuu] [~tgraves] Please review the PR > spark.history.kerberos.principal doesn't take value _HOST > - > > Key: SPARK-36622 > URL: https://issues.apache.org/jira/browse/SPARK-36622 > Project: Spark > Issue Type: Improvement > Components: Deploy, Security, Spark Core >Affects Versions: 3.0.1, 3.1.2, 3.2.0 >Reporter: pralabhkumar >Priority: Minor > > spark.history.kerberos.principal doesn't understand value _HOST. > It says failure to login for principal : spark/_HOST@realm . > It will be helpful to take _HOST value via config file and change it with > current hostname(similar to what Hive does) . This will also help to run SHS > on multiple machines without hardcoding principal hostname. > .spark.history.kerberos.principal > > It require minor change in HistoryServer.scala in initSecurity method . > > Please let me know , if this request make sense , I'll create the PR . > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36622) spark.history.kerberos.principal doesn't take value _HOST
[ https://issues.apache.org/jira/browse/SPARK-36622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17410578#comment-17410578 ] pralabhkumar commented on SPARK-36622: -- [~angerszhuuu] [~tgraves] [~hyukjin.kwon] Have created the PR . Please review > spark.history.kerberos.principal doesn't take value _HOST > - > > Key: SPARK-36622 > URL: https://issues.apache.org/jira/browse/SPARK-36622 > Project: Spark > Issue Type: Improvement > Components: Deploy, Security, Spark Core >Affects Versions: 3.0.1, 3.1.2, 3.2.0 >Reporter: pralabhkumar >Priority: Minor > > spark.history.kerberos.principal doesn't understand value _HOST. > It says failure to login for principal : spark/_HOST@realm . > It will be helpful to take _HOST value via config file and change it with > current hostname(similar to what Hive does) . This will also help to run SHS > on multiple machines without hardcoding principal hostname. > .spark.history.kerberos.principal > > It require minor change in HistoryServer.scala in initSecurity method . > > Please let me know , if this request make sense , I'll create the PR . > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36622) spark.history.kerberos.principal doesn't take value _HOST
[ https://issues.apache.org/jira/browse/SPARK-36622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pralabhkumar updated SPARK-36622: - Affects Version/s: 3.2.0 > spark.history.kerberos.principal doesn't take value _HOST > - > > Key: SPARK-36622 > URL: https://issues.apache.org/jira/browse/SPARK-36622 > Project: Spark > Issue Type: Improvement > Components: Deploy, Security, Spark Core >Affects Versions: 3.0.1, 3.1.2, 3.2.0 >Reporter: pralabhkumar >Priority: Minor > > spark.history.kerberos.principal doesn't understand value _HOST. > It says failure to login for principal : spark/_HOST@realm . > It will be helpful to take _HOST value via config file and change it with > current hostname(similar to what Hive does) . This will also help to run SHS > on multiple machines without hardcoding principal hostname. > .spark.history.kerberos.principal > > It require minor change in HistoryServer.scala in initSecurity method . > > Please let me know , if this request make sense , I'll create the PR . > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36622) spark.history.kerberos.principal doesn't take value _HOST
[ https://issues.apache.org/jira/browse/SPARK-36622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409617#comment-17409617 ] pralabhkumar commented on SPARK-36622: -- [~thejdeep] Its better to have _HOST , its been common practice for hiveserver and similar projects. [~tgraves] Agreed Please let me know , if you are ok . I can create the PR . > spark.history.kerberos.principal doesn't take value _HOST > - > > Key: SPARK-36622 > URL: https://issues.apache.org/jira/browse/SPARK-36622 > Project: Spark > Issue Type: Improvement > Components: Deploy, Security, Spark Core >Affects Versions: 3.0.1, 3.1.2 >Reporter: pralabhkumar >Priority: Minor > > spark.history.kerberos.principal doesn't understand value _HOST. > It says failure to login for principal : spark/_HOST@realm . > It will be helpful to take _HOST value via config file and change it with > current hostname(similar to what Hive does) . This will also help to run SHS > on multiple machines without hardcoding principal hostname. > .spark.history.kerberos.principal > > It require minor change in HistoryServer.scala in initSecurity method . > > Please let me know , if this request make sense , I'll create the PR . > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36622) spark.history.kerberos.principal doesn't take value _HOST
pralabhkumar created SPARK-36622: Summary: spark.history.kerberos.principal doesn't take value _HOST Key: SPARK-36622 URL: https://issues.apache.org/jira/browse/SPARK-36622 Project: Spark Issue Type: Improvement Components: Deploy, Security, Spark Core Affects Versions: 3.1.2, 3.0.1 Reporter: pralabhkumar spark.history.kerberos.principal doesn't understand value _HOST. It says failure to login for principal : spark/_HOST@realm . It will be helpful to take _HOST value via config file and change it with current hostname(similar to what Hive does) . This will also help to run SHS on multiple machines without hardcoding principal hostname. .spark.history.kerberos.principal It require minor change in HistoryServer.scala in initSecurity method . Please let me know , if this request make sense , I'll create the PR . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32924) Web UI sort on duration is wrong
[ https://issues.apache.org/jira/browse/SPARK-32924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17236743#comment-17236743 ] pralabhkumar commented on SPARK-32924: -- [~rakson] [~hyukjin.kwon] Can I open PR for this ? > Web UI sort on duration is wrong > > > Key: SPARK-32924 > URL: https://issues.apache.org/jira/browse/SPARK-32924 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.6 >Reporter: t oo >Priority: Major > Attachments: ui_sort.png > > > See attachment, 9 s(econds) is showing as larger than 8.1min -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29103) CheckAnalysis for data source V2 ALTER TABLE ignores case sensitivity
[ https://issues.apache.org/jira/browse/SPARK-29103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934368#comment-16934368 ] pralabhkumar commented on SPARK-29103: -- [~joseph.torres] In method findNestedField of class StructType.scala , one can make fieldNames.headOption.map(_.toLowerCase(Locale.ROOT) or In class checkAnalysis.scala , Under case alter: AlterTable if one can make fieldName to lowercase before passing to table.schema.findNestedField(fieldName, includeCollections = true) Let me know if this approach is fine , I can create the PR for the same > CheckAnalysis for data source V2 ALTER TABLE ignores case sensitivity > - > > Key: SPARK-29103 > URL: https://issues.apache.org/jira/browse/SPARK-29103 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jose Torres >Priority: Blocker > > For each column referenced, we run > ```val field = table.schema.findNestedField(fieldName, includeCollections = > true)``` > and fail analysis if the field isn't there. This check is always > case-sensitive on column names, even if the underlying catalog is case > insensitive, so it will sometimes throw on ALTER operations which the catalog > supports. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25788) Elastic net penalties for GLMs
[ https://issues.apache.org/jira/browse/SPARK-25788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887858#comment-16887858 ] pralabhkumar commented on SPARK-25788: -- [~shahid] I can work on this . Please let me know if its ok > Elastic net penalties for GLMs > --- > > Key: SPARK-25788 > URL: https://issues.apache.org/jira/browse/SPARK-25788 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.3.2 >Reporter: Christian Lorentzen >Priority: Major > > Currently, both LinearRegression and LogisticRegression support an elastic > net penality (setElasticNetParam), i.e. L1 and L2 penalties. This feature > could and should also be added to GeneralizedLinearRegression. > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org