[GitHub] spark pull request #23252: [SPARK-26239] File-based secret key loading for S...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/23252#discussion_r239990434 --- Diff: resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStepSuite.scala --- @@ -16,10 +16,13 @@ */ package org.apache.spark.deploy.k8s.features -import scala.collection.JavaConverters._ +import java.io.File +import java.nio.charset.StandardCharsets +import java.nio.file.Files import io.fabric8.kubernetes.api.model._ import org.scalatest.BeforeAndAfter +import scala.collection.JavaConverters._ --- End diff -- Think I fixed this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/23252 @vanzin > (Also, another nit, Spark authentication is not necessarily SASL-based.) The security doc was updated to reflect this. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23252: [SPARK-26239] File-based secret key loading for S...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/23252#discussion_r239944338 --- Diff: core/src/main/scala/org/apache/spark/SecurityManager.scala --- @@ -380,6 +400,12 @@ private[spark] class SecurityManager( } } + private def readSecretFile(secretFilePath: String): String = { +val secretFile = new File(secretFilePath) +require(secretFile.isFile, s"No file found containing the secret key at $secretFilePath.") +Base64.getEncoder.encodeToString(Files.readAllBytes(secretFile.toPath)) --- End diff -- Will check non-empty only --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/23252 > How is this file protected in kubernetes? More precisely it's a core concept in Kubernetes that container boundaries provide strong enough isolation such that data doesn't leak between containers. Outside of fringe cases like `hostPath` volumes and shared persistent volume mounts backed by NFS containers are assumed to have isolated views of disk and mounted volumes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/23252 @tgravescs thanks for the feedback! > How is this file protected in kubernetes? One can mount secrets into files in Kubernetes, and it's up to Kubernetes to mount these securely. See [here](https://kubernetes.io/docs/concepts/configuration/secret/). We're already allowing support for user-mounted secrets so then it's up to the user to create the secret bytes and point to the right mount location. There are alternative methods, e.g. using Vault. Persistent volume claims that are permisssioned properly can also work. > I actually don't like this idea at least for yarn and other deployments, I see people abusing it (accidentally) and using it in non-secure manner. Sure, we can forbid this being used in non-Kubernetes contexts. > Do you need updates to the kurbernetes specific docs on how users use this? Docs are updated in the `security.md` file. Specifying how the file is mounted is up to the user. We can suggest using Kubernetes secrets, but again in Kubernetes other mounting options are available. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23174: [SPARK-26194][k8s] Auto generate auth secret for ...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/23174#discussion_r239894356 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala --- @@ -87,44 +88,61 @@ private[spark] class BasicExecutorFeatureStep(kubernetesConf: KubernetesExecutor val executorCpuQuantity = new QuantityBuilder(false) .withAmount(executorCoresRequest) .build() -val executorExtraClasspathEnv = executorExtraClasspath.map { cp => - new EnvVarBuilder() -.withName(ENV_CLASSPATH) -.withValue(cp) -.build() -} -val executorExtraJavaOptionsEnv = kubernetesConf - .get(EXECUTOR_JAVA_OPTIONS) - .map { opts => -val subsOpts = Utils.substituteAppNExecIds(opts, kubernetesConf.appId, - kubernetesConf.executorId) -val delimitedOpts = Utils.splitCommandString(subsOpts) -delimitedOpts.zipWithIndex.map { - case (opt, index) => -new EnvVarBuilder().withName(s"$ENV_JAVA_OPT_PREFIX$index").withValue(opt).build() + +val executorEnv: Seq[EnvVar] = { +(Seq( + (ENV_DRIVER_URL, driverUrl), + (ENV_EXECUTOR_CORES, executorCores.toString), + (ENV_EXECUTOR_MEMORY, executorMemoryString), + (ENV_APPLICATION_ID, kubernetesConf.appId), + // This is to set the SPARK_CONF_DIR to be /opt/spark/conf + (ENV_SPARK_CONF_DIR, SPARK_CONF_DIR_INTERNAL), + (ENV_EXECUTOR_ID, kubernetesConf.executorId) +) ++ kubernetesConf.environment).map { case (k, v) => + new EnvVarBuilder() +.withName(k) +.withValue(v) +.build() } - }.getOrElse(Seq.empty[EnvVar]) -val executorEnv = (Seq( - (ENV_DRIVER_URL, driverUrl), - (ENV_EXECUTOR_CORES, executorCores.toString), - (ENV_EXECUTOR_MEMORY, executorMemoryString), - (ENV_APPLICATION_ID, kubernetesConf.appId), - // This is to set the SPARK_CONF_DIR to be /opt/spark/conf - (ENV_SPARK_CONF_DIR, SPARK_CONF_DIR_INTERNAL), - (ENV_EXECUTOR_ID, kubernetesConf.executorId)) ++ - kubernetesConf.environment) - .map(env => new EnvVarBuilder() -.withName(env._1) -.withValue(env._2) -.build() - ) ++ Seq( - new EnvVarBuilder() -.withName(ENV_EXECUTOR_POD_IP) -.withValueFrom(new EnvVarSourceBuilder() - .withNewFieldRef("v1", "status.podIP") + } ++ { +Seq(new EnvVarBuilder() + .withName(ENV_EXECUTOR_POD_IP) + .withValueFrom(new EnvVarSourceBuilder() +.withNewFieldRef("v1", "status.podIP") +.build()) .build()) -.build() -) ++ executorExtraJavaOptionsEnv ++ executorExtraClasspathEnv.toSeq + } ++ { +Option(secMgr.getSecretKey()).map { authSecret => + new EnvVarBuilder() +.withName(SecurityManager.ENV_AUTH_SECRET) +.withValue(authSecret) --- End diff -- > Anyway, this isn't different from someone else being able to read secrets in the same namespace as the pod. It isn't in theory, but in practice my understanding is that secrets are often permissioned very differently from pod objects in the cluster. We should be optimizing for the more common use case, which will work out of the box for more users and also is more secure in the context of more common configurations. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23252: [SPARK-26239] File-based secret key loading for S...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/23252#discussion_r239884995 --- Diff: core/src/test/scala/org/apache/spark/SecurityManagerSuite.scala --- @@ -440,12 +473,27 @@ class SecurityManagerSuite extends SparkFunSuite with ResetSystemProperties { intercept[IllegalArgumentException] { mgr.getSecretKey() } + case FILE => +val secretFile = createTempSecretFile() +conf.set(AUTH_SECRET_FILE, secretFile.getAbsolutePath) +mgr.initializeAuth() +assert(encodeFileAsBase64(secretFile) === mgr.getSecretKey()) } } } ) } } + private def encodeFileAsBase64(secretFile: File) = { +Base64.getEncoder.encodeToString(Files.readAllBytes(secretFile.toPath)) + } + + private def createTempSecretFile(contents: String = "test-secret"): File = { +val secretDir = Utils.createTempDir("temp-secrets") +val secretFile = new File(secretDir, "temp-secret.txt") +Files.write(secretFile.toPath, contents.getBytes(StandardCharsets.UTF_8)) +secretFile --- End diff -- Hm, can you clarify? The temporary directory should be generated with a random string. I believe this is sufficient for this test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/23252 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23174: [SPARK-26194][k8s] Auto generate auth secret for ...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/23174#discussion_r239702559 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala --- @@ -87,44 +88,61 @@ private[spark] class BasicExecutorFeatureStep(kubernetesConf: KubernetesExecutor val executorCpuQuantity = new QuantityBuilder(false) .withAmount(executorCoresRequest) .build() -val executorExtraClasspathEnv = executorExtraClasspath.map { cp => - new EnvVarBuilder() -.withName(ENV_CLASSPATH) -.withValue(cp) -.build() -} -val executorExtraJavaOptionsEnv = kubernetesConf - .get(EXECUTOR_JAVA_OPTIONS) - .map { opts => -val subsOpts = Utils.substituteAppNExecIds(opts, kubernetesConf.appId, - kubernetesConf.executorId) -val delimitedOpts = Utils.splitCommandString(subsOpts) -delimitedOpts.zipWithIndex.map { - case (opt, index) => -new EnvVarBuilder().withName(s"$ENV_JAVA_OPT_PREFIX$index").withValue(opt).build() + +val executorEnv: Seq[EnvVar] = { +(Seq( + (ENV_DRIVER_URL, driverUrl), + (ENV_EXECUTOR_CORES, executorCores.toString), + (ENV_EXECUTOR_MEMORY, executorMemoryString), + (ENV_APPLICATION_ID, kubernetesConf.appId), + // This is to set the SPARK_CONF_DIR to be /opt/spark/conf + (ENV_SPARK_CONF_DIR, SPARK_CONF_DIR_INTERNAL), + (ENV_EXECUTOR_ID, kubernetesConf.executorId) +) ++ kubernetesConf.environment).map { case (k, v) => + new EnvVarBuilder() +.withName(k) +.withValue(v) +.build() } - }.getOrElse(Seq.empty[EnvVar]) -val executorEnv = (Seq( - (ENV_DRIVER_URL, driverUrl), - (ENV_EXECUTOR_CORES, executorCores.toString), - (ENV_EXECUTOR_MEMORY, executorMemoryString), - (ENV_APPLICATION_ID, kubernetesConf.appId), - // This is to set the SPARK_CONF_DIR to be /opt/spark/conf - (ENV_SPARK_CONF_DIR, SPARK_CONF_DIR_INTERNAL), - (ENV_EXECUTOR_ID, kubernetesConf.executorId)) ++ - kubernetesConf.environment) - .map(env => new EnvVarBuilder() -.withName(env._1) -.withValue(env._2) -.build() - ) ++ Seq( - new EnvVarBuilder() -.withName(ENV_EXECUTOR_POD_IP) -.withValueFrom(new EnvVarSourceBuilder() - .withNewFieldRef("v1", "status.podIP") + } ++ { +Seq(new EnvVarBuilder() + .withName(ENV_EXECUTOR_POD_IP) + .withValueFrom(new EnvVarSourceBuilder() +.withNewFieldRef("v1", "status.podIP") +.build()) .build()) -.build() -) ++ executorExtraJavaOptionsEnv ++ executorExtraClasspathEnv.toSeq + } ++ { +Option(secMgr.getSecretKey()).map { authSecret => + new EnvVarBuilder() +.withName(SecurityManager.ENV_AUTH_SECRET) +.withValue(authSecret) --- End diff -- I filed https://issues.apache.org/jira/browse/SPARK-26301 to suggest the alternative scheme. Unlike SPARK-26139 this would change the functionality that was merged here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23252: [SPARK-26239] File-based secret key loading for S...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/23252#discussion_r239701821 --- Diff: core/src/main/scala/org/apache/spark/SecurityManager.scala --- @@ -367,11 +372,26 @@ private[spark] class SecurityManager( case _ => require(sparkConf.contains(SPARK_AUTH_SECRET_CONF), - s"A secret key must be specified via the $SPARK_AUTH_SECRET_CONF config.") + s"A secret key must be specified via the $SPARK_AUTH_SECRET_CONF config") return } -secretKey = Utils.createSecret(sparkConf) +if (sparkConf.get(AUTH_SECRET_FILE_DRIVER).isDefined --- End diff -- Also considered adding a validation that file-based auth can only be used for Kubernetes. Omitted it for simplicity, but file-based auth seems dubious for non-K8s contexts. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23252: [SPARK-26239] File-based secret key loading for S...
GitHub user mccheah opened a pull request: https://github.com/apache/spark/pull/23252 [SPARK-26239] File-based secret key loading for SASL. ## What changes were proposed in this pull request? This proposes an alternative way to load secret keys into a Spark application that is running on Kubernetes. Instead of automatically generating the secret, the secret key can reside in a file that is shared between both the driver and executor containers. ## How was this patch tested? Unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/palantir/spark auth-secret-with-file Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23252.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23252 commit 957cb15a2d48b4cf2b5c7f1a8c124df3a53bf4d9 Author: mcheah Date: 2018-12-07T05:04:56Z [SPARK-26239] File-based secret key loading for SASL. This proposes an alternative way to load secret keys into a Spark application that is running on Kubernetes. Instead of automatically generating the secret, the secret key can reside in a file that is shared between both the driver and executor containers. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23252: [SPARK-26239] File-based secret key loading for SASL.
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/23252 @vanzin @vinooganesh @ifilonenko --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23174: [SPARK-26194][k8s] Auto generate auth secret for ...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/23174#discussion_r239694862 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala --- @@ -87,44 +88,61 @@ private[spark] class BasicExecutorFeatureStep(kubernetesConf: KubernetesExecutor val executorCpuQuantity = new QuantityBuilder(false) .withAmount(executorCoresRequest) .build() -val executorExtraClasspathEnv = executorExtraClasspath.map { cp => - new EnvVarBuilder() -.withName(ENV_CLASSPATH) -.withValue(cp) -.build() -} -val executorExtraJavaOptionsEnv = kubernetesConf - .get(EXECUTOR_JAVA_OPTIONS) - .map { opts => -val subsOpts = Utils.substituteAppNExecIds(opts, kubernetesConf.appId, - kubernetesConf.executorId) -val delimitedOpts = Utils.splitCommandString(subsOpts) -delimitedOpts.zipWithIndex.map { - case (opt, index) => -new EnvVarBuilder().withName(s"$ENV_JAVA_OPT_PREFIX$index").withValue(opt).build() + +val executorEnv: Seq[EnvVar] = { +(Seq( + (ENV_DRIVER_URL, driverUrl), + (ENV_EXECUTOR_CORES, executorCores.toString), + (ENV_EXECUTOR_MEMORY, executorMemoryString), + (ENV_APPLICATION_ID, kubernetesConf.appId), + // This is to set the SPARK_CONF_DIR to be /opt/spark/conf + (ENV_SPARK_CONF_DIR, SPARK_CONF_DIR_INTERNAL), + (ENV_EXECUTOR_ID, kubernetesConf.executorId) +) ++ kubernetesConf.environment).map { case (k, v) => + new EnvVarBuilder() +.withName(k) +.withValue(v) +.build() } - }.getOrElse(Seq.empty[EnvVar]) -val executorEnv = (Seq( - (ENV_DRIVER_URL, driverUrl), - (ENV_EXECUTOR_CORES, executorCores.toString), - (ENV_EXECUTOR_MEMORY, executorMemoryString), - (ENV_APPLICATION_ID, kubernetesConf.appId), - // This is to set the SPARK_CONF_DIR to be /opt/spark/conf - (ENV_SPARK_CONF_DIR, SPARK_CONF_DIR_INTERNAL), - (ENV_EXECUTOR_ID, kubernetesConf.executorId)) ++ - kubernetesConf.environment) - .map(env => new EnvVarBuilder() -.withName(env._1) -.withValue(env._2) -.build() - ) ++ Seq( - new EnvVarBuilder() -.withName(ENV_EXECUTOR_POD_IP) -.withValueFrom(new EnvVarSourceBuilder() - .withNewFieldRef("v1", "status.podIP") + } ++ { +Seq(new EnvVarBuilder() + .withName(ENV_EXECUTOR_POD_IP) + .withValueFrom(new EnvVarSourceBuilder() +.withNewFieldRef("v1", "status.podIP") +.build()) .build()) -.build() -) ++ executorExtraJavaOptionsEnv ++ executorExtraClasspathEnv.toSeq + } ++ { +Option(secMgr.getSecretKey()).map { authSecret => + new EnvVarBuilder() +.withName(SecurityManager.ENV_AUTH_SECRET) +.withValue(authSecret) --- End diff -- Ah I thought about this a bit more and realized that this is more insecure than I originally read it to be. If the secret is put directly in the environment variable field itself, then anyone who has permission to get the pod metadata from the Kubernetes API server can now read the secret generated by this application. In practice permissioning on pod specs is often far looser than permissioning on Kubernetes secret objects. In this solution the administrator has to restrict access to pod specs to only the user. I think at the very least we want this to be configured via creating a Kubernetes secret object, then [loading the environment variable](https://kubernetes.io/docs/concepts/configuration/secret/#using-secrets-as-environment-variables) to point to the secret object. In the meantime I'm going to push the PR that allows secrets to be specified as file paths directly. I will also file a Spark ticket to avoid putting the environment variable directly in the pod spec object itself. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23174: [SPARK-26194][k8s] Auto generate auth secret for k8s app...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/23174 Ok that's fine. Will merge to master if there are no further comments in the near future. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23174: [SPARK-26194][k8s] Auto generate auth secret for k8s app...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/23174 It matters because we're discussing direction - that is, what opinion Spark wants to take regarding how to set up security on K8s. It's not obvious from our discussion on SPARK-26239 that we agree that we should allow such optionality for other authentication schemes. In other words, if we just merge this PR without further discussion and consensus on SPARK-26239, we're effectively communicating that Spark is locked in to the authentication backed by K8s secrets. I want to emphasize that it's important to agree on the direction for the bigger picture early on, and then we say that this patch still fits into the bigger vision. I also want to intend to take this patch and check that work on SPARK-26239 would work nicely with it, but to the best of my knowledge the additional options should layer on top of this default one just fine. Would like some concrete prototyping to confirm this though. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23174: [SPARK-26194][k8s] Auto generate auth secret for k8s app...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/23174 Ok that's fine, with the caveat that we merge the subsequent optionality soon. I'll work on the file-based secret authentication and encryption this week. I'm very concerned that we'll ship with this but with no other security options if we're not rigorously moving SPARK-26239 forward. Merging to master in a few hours, letting it stay open for a bit for any other commentary. @gdearment for SA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23174: [SPARK-26194][k8s] Auto generate auth secret for k8s app...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/23174 It's just to have the assurance that we will have some way to bypass this for auth at least for 3.x. I'd like to concretely determine this before merging if possible. But I hope that the suggestion proposed in SPARK-26239 could be agreed upon fairly quickly? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23174: [SPARK-26194][k8s] Auto generate auth secret for k8s app...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/23174 I think as long as we have one alternate mechanism proposed in SPARK-26239 this is ok to merge. I proposed one in [this comment](https://issues.apache.org/jira/browse/SPARK-26239?focusedCommentId=16705273&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16705273). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22959: [SPARK-25876][k8s] Simplify kubernetes configuration typ...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/22959 Ah I forgot to merge, sorry! Merging into master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/23057 Is this ready to merge? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21306: [SPARK-24252][SQL] Add catalog registration and t...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/21306#discussion_r238036776 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/PartitionTransforms.java --- @@ -0,0 +1,229 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalog.v2; + +/** + * A standard set of transformations that are passed to data sources during table creation. + * + * @see PartitionTransform + */ +public class PartitionTransforms { + private PartitionTransforms() { + } + + /** + * Create a transform for a column with the given name. + * + * This transform is used to pass named transforms that are not known to Spark. + * + * @param transform a name of the transform to apply to the column + * @param colName a column name + * @return an Apply transform for the column + */ + public static PartitionTransform apply(String transform, String colName) { +if ("identity".equals(transform)) { --- End diff -- Is it possible to defer partition support, or is is fundamentally important enough to get that correct now because we will be building on it on e.g. the very next evolution of this API and its uses? I'm thinking about how to minimize the amount of API we're proposing per change, particularly if choices aren't particularly obvious. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21306: [SPARK-24252][SQL] Add catalog registration and t...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/21306#discussion_r237985697 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/TableCatalog.java --- @@ -0,0 +1,137 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalog.v2; + +import org.apache.spark.sql.catalyst.TableIdentifier; +import org.apache.spark.sql.catalyst.analysis.NoSuchTableException; +import org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException; +import org.apache.spark.sql.types.StructType; + +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import java.util.Map; + +public interface TableCatalog extends CatalogProvider { --- End diff -- I'm ok with that! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21306: [SPARK-24252][SQL] Add catalog registration and t...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/21306#discussion_r237979620 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalog/v2/V1MetadataTable.scala --- @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalog.v2 + +import java.util + +import scala.collection.JavaConverters._ + +import org.apache.spark.sql.SaveMode +import org.apache.spark.sql.catalog.v2.PartitionTransforms.{bucket, identity} +import org.apache.spark.sql.catalyst.catalog.{CatalogTable, CatalogUtils} +import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, ReadSupport, WriteSupport} +import org.apache.spark.sql.sources.v2.reader.DataSourceReader +import org.apache.spark.sql.sources.v2.writer.DataSourceWriter +import org.apache.spark.sql.types.StructType + +/** + * An implementation of catalog v2 [[Table]] to expose v1 table metadata. + */ +private[sql] class V1MetadataTable( +v1Table: CatalogTable, +v2Source: Option[DataSourceV2]) extends Table { + + def readDelegate: ReadSupport = v2Source match { +case r: ReadSupport => r +case _ => throw new UnsupportedOperationException(s"Source does not support reads: $v2Source") + } + + def writeDelegate: WriteSupport = v2Source match { +case w: WriteSupport => w +case _ => throw new UnsupportedOperationException(s"Source does not support writes: $v2Source") + } + + lazy val options: Map[String, String] = { +v1Table.storage.locationUri match { --- End diff -- A second read over this, don't think we necessarily have to use the `Option` lambdas here, and in fact may be less legible, varying from developer to developer. But if one were to do so, it'd be something like this... ``` v1Table.storage.properties + v1Table.storage.locationUri.map(uri -> Map("path" -> CatalogUtils.URITOString(uri)).getOrElse(Map.empty) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21306: [SPARK-24252][SQL] Add catalog registration and t...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/21306#discussion_r237978582 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/TableCatalog.java --- @@ -0,0 +1,137 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalog.v2; + +import org.apache.spark.sql.catalyst.TableIdentifier; +import org.apache.spark.sql.catalyst.analysis.NoSuchTableException; +import org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException; +import org.apache.spark.sql.types.StructType; + +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import java.util.Map; + +public interface TableCatalog extends CatalogProvider { --- End diff -- Perhaps something as simple as `Catalog` as the top level then, which is sub-interfaced by `TableCatalog`, `FunctionCatalog`, and other "kinds" of catalogs. They can all share the same `initialize` method which is defined by `Catalog`. That sounds like the simplest idea, perhaps? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21306: [SPARK-24252][SQL] Add catalog registration and table ca...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/21306 @stczwd that's my understanding yeah. Others can feel free to correct me otherwise. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23174: [SPARK-26194][k8s] Auto generate auth secret for k8s app...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/23174 The trouble is the API proposed here and how it would have to change for future features. If we wanted to add the optionality to support authentication via mounted files later, then what's the API for that, and how would that change the API for users that were relying on this authentication mechanism? That's why it's important to see the optionality now, so it can be clear to us that are our options, and this is how we are going to use them. A proposed scheme is to have `spark.authenticate.k8s.secret.provider=autok8ssecret`, then document what that does. Perhaps that's the default mode. Then add another scheme, say `spark.authenticate.k8s.secret.provider=files` and then further options for specifying where that file is located on both the driver and the executors. It's helpful to put this patch in the context of where we want to go for authentication in general moving forward. Otherwise this feature taken in isolation will make it appear that Spark is being opinionated about using Kubernetes secrets and environment variables for authentication. if it's not introduced in this patch, then at least we should file JIRA tickets and reference them as future add-ons to this and have a roadmap for what SASL on K8s will look like in the bigger picture for 3.x. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23174: [SPARK-26194][k8s] Auto generate auth secret for k8s app...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/23174 > There doesn't need to be a single solution. This patch going in does not preclude adding more features later, one of which might be reading this from a pre-defined secret. The way it's written now, if the user opts-in to using `spark.authenticate` with Kubernetes, the application will _always_ automatically generate the secret and use that as the security mechanism. I think we'd prefer to see the various options that are available up front and this patch should probably introduce both the automatic secret creation version (if we agree that this is secure) and the manual provision version. If this change is merged into 3.x without any other changes, users will be forced to use the K8s secret based SASL scheme and this feature will be unusable for other users such as with the vault case discussed above. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21306: [SPARK-24252][SQL] Add catalog registration and table ca...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/21306 @stczwd my understanding here is that a table isn't a streaming table or a batch table, but rather that a table points to data that can either be scanned in stream or in batch, and that the table is responsible for returning either streaming scanners or batch scanners when the logical plan calls for it. The reason why I believe this is the case is because of https://github.com/apache/spark/pull/23086/files#diff-d111d7e2179b55465840c9a81ea004f2R65 and its eventual analogous streaming variant. In the new abstractions we propose here and in [our proposal](https://docs.google.com/document/d/1uUmKCpWLdh9vHxP7AWJ9EgbwB_U6T3EJYNjhISGmiQg/edit), the catalog gets a reference to a `Table` object that can build `Scan`s over that table. In other words, the crucial overarching theme in all of the following matters is that a Table isn't inherently a streaming or a batch table, but rather a Table supports returning streaming and/or batch scans. The table returned by the catalog is a pointer to the data, and the Scan defines how one reads that data. > Source needs to be defined for stream table The catalog returns an instance of `Table` that can create `Scan`s that support the `toStream` method. > Stream table requires a special flags to indicate that it is a stream table. When one gets back a `Scan`, calling its `toStream` method will indicate that the table's data is about to be scanned in a streaming manner. > User and Program need to be aware of whether this table is a stream table. Probably would be done from the SQL code side. But not as certain about this, can you elaborate? > What would we do if the user wants to change the stream table to batch table or convert the batch table to stream table? The new abstraction handles this at the `Scan` level instead of the `Table` level. `Table`s are themselves not streamed or batched, but rather they construct scans that can read them in either stream or batch; the Scan implements `toBatch` and/or `toStream` to support the appropriate read method. > What does the stream table metadata you define look like? What is the difference between batch table metadata and batch table metadata? This I don't think is as clear given what has been proposed so far. Will let others offer comment here. Others should feel free to offer more commentary or correct anything from above. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21306: [SPARK-24252][SQL] Add catalog registration and t...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/21306#discussion_r237721926 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/TableCatalog.java --- @@ -0,0 +1,137 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalog.v2; + +import org.apache.spark.sql.catalyst.TableIdentifier; +import org.apache.spark.sql.catalyst.analysis.NoSuchTableException; +import org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException; +import org.apache.spark.sql.types.StructType; + +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import java.util.Map; + +public interface TableCatalog extends CatalogProvider { --- End diff -- The semantics here aren't clear at least to me. Typically a `Provider` is a class that can instantiate a `something`. Here it appears to provide itself? The abstraction I would imagine would be to have either: * `CatalogProvider` has a method called `get(options, sqlConf)` which returns a `TableCatalog` configured with the given options and `SQLConf`, or * Remove `CatalogProvider` entirely and put `initialize` in this interface. Every `TableCatalog` instance must be initialized before calling any other methods like `loadTable`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21306: [SPARK-24252][SQL] Add catalog registration and t...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/21306#discussion_r237723417 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/Table.java --- @@ -0,0 +1,46 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalog.v2; + +import org.apache.spark.sql.types.StructType; + +import java.util.List; +import java.util.Map; + +/** + * Represents table metadata from a {@link TableCatalog} or other table sources. + */ +public interface Table { + /** + * Return the table properties. + * @return this table's map of string properties + */ + Map properties(); + + /** + * Return the table schema. + * @return this table's schema as a struct type + */ + StructType schema(); + + /** + * Return the table partitioning transforms. + * @return this table's partitioning transforms + */ + List partitioning(); --- End diff -- Should we default this to no partitioning? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21306: [SPARK-24252][SQL] Add catalog registration and t...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/21306#discussion_r237722772 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/PartitionTransforms.java --- @@ -0,0 +1,229 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalog.v2; + +/** + * A standard set of transformations that are passed to data sources during table creation. + * + * @see PartitionTransform + */ +public class PartitionTransforms { + private PartitionTransforms() { + } + + /** + * Create a transform for a column with the given name. + * + * This transform is used to pass named transforms that are not known to Spark. + * + * @param transform a name of the transform to apply to the column + * @param colName a column name + * @return an Apply transform for the column + */ + public static PartitionTransform apply(String transform, String colName) { +if ("identity".equals(transform)) { --- End diff -- Is it necessary to support all `PartitionTransform` types for a first pass? Though, I would imagine we'd have to for the v2 to v1 catalog adapter. If it weren't for that, I would suggest supporting only a simple set of `PartitionTransform`, such as only `identity`, to keep this PR focused on the catalog API and not the partitions API. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21306: [SPARK-24252][SQL] Add catalog registration and t...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/21306#discussion_r237723294 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/Table.java --- @@ -0,0 +1,46 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalog.v2; + +import org.apache.spark.sql.types.StructType; + +import java.util.List; +import java.util.Map; + +/** + * Represents table metadata from a {@link TableCatalog} or other table sources. + */ +public interface Table { --- End diff -- Looking at this patch in comparison to the other again (updated to https://github.com/apache/spark/pull/23086) it looks like this work should be rebased on top of the batch read refactor's PR in order to not have two `Table` classes that do the same thing - is this the right assessment? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21306: [SPARK-24252][SQL] Add catalog registration and t...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/21306#discussion_r237723354 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/Table.java --- @@ -0,0 +1,46 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalog.v2; + +import org.apache.spark.sql.types.StructType; + +import java.util.List; +import java.util.Map; + +/** + * Represents table metadata from a {@link TableCatalog} or other table sources. + */ +public interface Table { + /** + * Return the table properties. + * @return this table's map of string properties + */ + Map properties(); --- End diff -- Should we `default` this to an empty Map? I don't think all tables will support custom properties. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21306: [SPARK-24252][SQL] Add catalog registration and t...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/21306#discussion_r237722264 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalog/v2/V1MetadataTable.scala --- @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalog.v2 + +import java.util + +import scala.collection.JavaConverters._ + +import org.apache.spark.sql.SaveMode +import org.apache.spark.sql.catalog.v2.PartitionTransforms.{bucket, identity} +import org.apache.spark.sql.catalyst.catalog.{CatalogTable, CatalogUtils} +import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, ReadSupport, WriteSupport} +import org.apache.spark.sql.sources.v2.reader.DataSourceReader +import org.apache.spark.sql.sources.v2.writer.DataSourceWriter +import org.apache.spark.sql.types.StructType + +/** + * An implementation of catalog v2 [[Table]] to expose v1 table metadata. + */ +private[sql] class V1MetadataTable( +v1Table: CatalogTable, +v2Source: Option[DataSourceV2]) extends Table { + + def readDelegate: ReadSupport = v2Source match { +case r: ReadSupport => r +case _ => throw new UnsupportedOperationException(s"Source does not support reads: $v2Source") + } + + def writeDelegate: WriteSupport = v2Source match { +case w: WriteSupport => w +case _ => throw new UnsupportedOperationException(s"Source does not support writes: $v2Source") + } + + lazy val options: Map[String, String] = { +v1Table.storage.locationUri match { --- End diff -- Consider `.map(...).getOrElse`, but we haven't been consistent on using or not using `match` on `Option` types throughout Spark in general anyways so it's fine to leave as-is. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21306: [SPARK-24252][SQL] Add catalog registration and t...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/21306#discussion_r237722352 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalog/v2/V1MetadataTable.scala --- @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalog.v2 + +import java.util + +import scala.collection.JavaConverters._ + +import org.apache.spark.sql.SaveMode +import org.apache.spark.sql.catalog.v2.PartitionTransforms.{bucket, identity} +import org.apache.spark.sql.catalyst.catalog.{CatalogTable, CatalogUtils} +import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, ReadSupport, WriteSupport} +import org.apache.spark.sql.sources.v2.reader.DataSourceReader +import org.apache.spark.sql.sources.v2.writer.DataSourceWriter +import org.apache.spark.sql.types.StructType + +/** + * An implementation of catalog v2 [[Table]] to expose v1 table metadata. + */ +private[sql] class V1MetadataTable( +v1Table: CatalogTable, +v2Source: Option[DataSourceV2]) extends Table { + + def readDelegate: ReadSupport = v2Source match { +case r: ReadSupport => r +case _ => throw new UnsupportedOperationException(s"Source does not support reads: $v2Source") + } + + def writeDelegate: WriteSupport = v2Source match { +case w: WriteSupport => w +case _ => throw new UnsupportedOperationException(s"Source does not support writes: $v2Source") + } + + lazy val options: Map[String, String] = { +v1Table.storage.locationUri match { --- End diff -- Also any particular reason this and the following variables have to be `lazy`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21306: [SPARK-24252][SQL] Add catalog registration and t...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/21306#discussion_r237711409 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/CatalogProvider.java --- @@ -0,0 +1,50 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalog.v2; + +import org.apache.spark.sql.internal.SQLConf; + +/** + * A marker interface to provide a catalog implementation for Spark. + * + * Implementations can provide catalog functions by implementing additional interfaces, like + * {@link TableCatalog} to expose table operations. + * + * Catalog implementations must implement this marker interface to be loaded by + * {@link Catalogs#load(String, SQLConf)}. The loader will instantiate catalog classes using the + * required public no-arg constructor. After creating an instance, it will be configured by calling + * {@link #initialize(CaseInsensitiveStringMap)}. + * + * Catalog implementations are registered to a name by adding a configuration option to Spark: + * {@code spark.sql.catalog.catalog-name=com.example.YourCatalogClass}. All configuration properties + * in the Spark configuration that share the catalog name prefix, + * {@code spark.sql.catalog.catalog-name.(key)=(value)} will be passed in the case insensitive + * string map of options in initialization with the prefix removed. An additional property, + * {@code name}, is also added to the options and will contain the catalog's name; in this case, + * "catalog-name". + */ +public interface CatalogProvider { --- End diff -- As we discussed, will these APIs now live in the `sql-api` package? Also at what point are we going to introduce this new Maven module and package? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22959: [SPARK-25876][k8s] Simplify kubernetes configuration typ...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/22959 Given that `Utils.isTesting` is used elsewhere, I'm fine with merging this. Going to in a few hours if there are no further comments. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21978: [SPARK-25006][SQL] Add CatalogTableIdentifier.
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/21978#discussion_r237656841 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala --- @@ -18,48 +18,106 @@ package org.apache.spark.sql.catalyst /** - * An identifier that optionally specifies a database. + * An identifier that optionally specifies a database and catalog. * * Format (unquoted): "name" or "db.name" * Format (quoted): "`name`" or "`db`.`name`" */ -sealed trait IdentifierWithDatabase { +sealed trait IdentifierWithOptionalDatabaseAndCatalog { val identifier: String def database: Option[String] + def catalog: Option[String] + /* * Escapes back-ticks within the identifier name with double-back-ticks. */ private def quoteIdentifier(name: String): String = name.replace("`", "``") def quotedString: String = { -val replacedId = quoteIdentifier(identifier) -val replacedDb = database.map(quoteIdentifier(_)) - -if (replacedDb.isDefined) s"`${replacedDb.get}`.`$replacedId`" else s"`$replacedId`" +// database is required if catalog is present +assert(database.isDefined || catalog.isEmpty) +def q(s: String): String = s"`${quoteIdentifier(s)}`" +Seq(catalog.map(q), database.map(q), Some(q(identifier))).flatten.mkString(".") } def unquotedString: String = { -if (database.isDefined) s"${database.get}.$identifier" else identifier +Seq(catalog, database, Some(identifier)).flatten.mkString(".") } override def toString: String = quotedString } +object CatalogTableIdentifier { + def apply(table: String): CatalogTableIdentifier = +new CatalogTableIdentifier(table, None, None) + + def apply(table: String, database: String): CatalogTableIdentifier = +new CatalogTableIdentifier(table, Some(database), None) + + def apply(table: String, database: String, catalog: String): CatalogTableIdentifier = +new CatalogTableIdentifier(table, Some(database), Some(catalog)) +} + /** - * Identifies a table in a database. - * If `database` is not defined, the current database is used. - * When we register a permanent function in the FunctionRegistry, we use - * unquotedString as the function name. + * Identifies a table in a database and catalog. + * If `database` is not defined, the current catalog's default database is used. + * If `catalog` is not defined, the current catalog is used. --- End diff -- Sounds good. When we add the logical side of leveraging catalogs we can revisit the API of how to set the current catalog. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23174: [SPARK-26194][k8s] Auto generate auth secret for k8s app...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/23174 > Why? And how are mounted files better? Environment variables leak far more easily than file contents. One can accidentally `printenv` in a shell attached to the and get the secret contents. `System.getenv` has a similar effect within the application code itself. For what it's worth I'm also not sure if the secret would be listed under the environment variables in the Spark UI (would have to test this). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23174: [SPARK-26194][k8s] Auto generate auth secret for k8s app...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/23174 Would it be possible to also provide support for passing this via a mounted file? Some users would prefer to avoid propagating sensitive information via environment variables. Also the user should be able to specify their own mounted file; spark-submit shouldn't always mount an auto-generated secret for the user. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21978: [SPARK-25006][SQL] Add CatalogTableIdentifier.
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/21978#discussion_r237578911 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala --- @@ -18,48 +18,106 @@ package org.apache.spark.sql.catalyst /** - * An identifier that optionally specifies a database. + * An identifier that optionally specifies a database and catalog. * * Format (unquoted): "name" or "db.name" * Format (quoted): "`name`" or "`db`.`name`" */ -sealed trait IdentifierWithDatabase { +sealed trait IdentifierWithOptionalDatabaseAndCatalog { val identifier: String def database: Option[String] + def catalog: Option[String] + /* * Escapes back-ticks within the identifier name with double-back-ticks. */ private def quoteIdentifier(name: String): String = name.replace("`", "``") def quotedString: String = { -val replacedId = quoteIdentifier(identifier) -val replacedDb = database.map(quoteIdentifier(_)) - -if (replacedDb.isDefined) s"`${replacedDb.get}`.`$replacedId`" else s"`$replacedId`" +// database is required if catalog is present +assert(database.isDefined || catalog.isEmpty) +def q(s: String): String = s"`${quoteIdentifier(s)}`" +Seq(catalog.map(q), database.map(q), Some(q(identifier))).flatten.mkString(".") } def unquotedString: String = { -if (database.isDefined) s"${database.get}.$identifier" else identifier +Seq(catalog, database, Some(identifier)).flatten.mkString(".") } override def toString: String = quotedString } +object CatalogTableIdentifier { + def apply(table: String): CatalogTableIdentifier = +new CatalogTableIdentifier(table, None, None) + + def apply(table: String, database: String): CatalogTableIdentifier = +new CatalogTableIdentifier(table, Some(database), None) + + def apply(table: String, database: String, catalog: String): CatalogTableIdentifier = +new CatalogTableIdentifier(table, Some(database), Some(catalog)) +} + /** - * Identifies a table in a database. - * If `database` is not defined, the current database is used. - * When we register a permanent function in the FunctionRegistry, we use - * unquotedString as the function name. + * Identifies a table in a database and catalog. + * If `database` is not defined, the current catalog's default database is used. + * If `catalog` is not defined, the current catalog is used. --- End diff -- "current" meaning "global"? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21978: [SPARK-25006][SQL] Add CatalogTableIdentifier.
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/21978#discussion_r237578805 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala --- @@ -18,48 +18,106 @@ package org.apache.spark.sql.catalyst /** - * An identifier that optionally specifies a database. + * An identifier that optionally specifies a database and catalog. * * Format (unquoted): "name" or "db.name" --- End diff -- Update formats in these scaladocs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22959: [SPARK-25876][k8s] Simplify kubernetes configurat...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22959#discussion_r237313976 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala --- @@ -112,125 +72,139 @@ private[spark] case class KubernetesConf[T <: KubernetesRoleSpecificConf]( def getOption(key: String): Option[String] = sparkConf.getOption(key) } +private[spark] class KubernetesDriverConf( +sparkConf: SparkConf, +val appId: String, +val mainAppResource: MainAppResource, +val mainClass: String, +val appArgs: Array[String], +val pyFiles: Seq[String]) + extends KubernetesConf(sparkConf) { + + override val resourceNamePrefix: String = { +val custom = if (Utils.isTesting) get(KUBERNETES_DRIVER_POD_NAME_PREFIX) else None --- End diff -- Possibly inject this in the test so that we don't have to use `Utils.isTesting`? Preference against using test flags to override behavior. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23086: [SPARK-25528][SQL] data source v2 API refactor (b...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/23086#discussion_r236500170 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/Table.java --- @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2; + +import org.apache.spark.annotation.Evolving; +import org.apache.spark.sql.sources.v2.reader.Scan; +import org.apache.spark.sql.sources.v2.reader.ScanBuilder; +import org.apache.spark.sql.types.StructType; + +/** + * An interface representing a logical structured data set of a data source. For example, the + * implementation can be a directory on the file system, a topic of Kafka, or a table in the + * catalog, etc. + * + * This interface can mixin the following interfaces to support different operations: + * + * {@link SupportsBatchRead}: this table can be read in batch queries. + * + */ +@Evolving +public interface Table { + + /** + * Returns the schema of this table. + */ + StructType schema(); + + /** + * Returns a {@link ScanBuilder} which can be used to build a {@link Scan} later. Spark will call + * this method for each data scanning query. + * + * The builder can take some query specific information to do operators pushdown, and keep these + * information in the created {@link Scan}. + */ + ScanBuilder newScanBuilder(DataSourceOptions options); --- End diff -- Makes sense to me - `DataSourceOptions` was carrying along identifiers that really belong to a table identifier and that should be interpreted at the catalog level, not the data read level. In other words the implementation of this `Table` should already know _what_ locations to look up (e.g. "files comprising dataset D"), now it's a matter of _how_ (e.g. pushdown, filter predicates). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23086: [SPARK-25528][SQL] data source v2 API refactor (b...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/23086#discussion_r236490800 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/Table.java --- @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2; --- End diff -- Moving this to the Catalyst package would set a precedent for user-overridable behavior to live in the catalyst project. I'm not aware of anything in the Catalyst package being considered as public API right now. Are we allowed to start such a convention at this juncture? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21978: [SPARK-25006][SQL] Add CatalogTableIdentifier.
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/21978 Wanted to follow up here - are we planning on merging this or are there more things we need to discuss? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23053: [SPARK-25957][K8S] Make building alternate language bind...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/23053 Ok that makes sense. I will merge this into master then. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23053: [SPARK-25957][K8S] Make building alternate language bind...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/23053 So now the utility must always build the Java image, and optionally will build python and R - is that correct? Is there any way to opt-out of building Java? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23053: [SPARK-25957][K8S] Add ability to skip building optional...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/23053 My understanding is that K8s is still marked as experimental, and something like this can be changed with the expectation that we're hardening in 3.0. If anything, we should be making this breaking change now because we want to set the standard for 3.0. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23053: [SPARK-25957][K8S] Add ability to skip building optional...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/23053 Hm, reviewing this again I think a better way to do it is to only opt-in to building images of different bindings. So for example the Python image should only be built when `-p ` is specified. It seems strange that by default the utility is building all images. Perhaps a better interface for this entry point is this: ``` bin/docker-image-tool.sh -f -j -p ``` What does everyone think"? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23086: [SPARK-25528][SQL] data source v2 API refactor (b...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/23086#discussion_r234793177 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/TableProvider.java --- @@ -0,0 +1,62 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2; + +import org.apache.spark.annotation.InterfaceStability; +import org.apache.spark.sql.sources.DataSourceRegister; +import org.apache.spark.sql.types.StructType; + +/** + * The base interface for v2 data sources which don't have a real catalog. Implementations must + * have a public, 0-arg constructor. + * + * The major responsibility of this interface is to return a {@link Table} for read/write. + */ +@InterfaceStability.Evolving +// TODO: do not extend `DataSourceV2`, after we finish the API refactor completely. +public interface TableProvider extends DataSourceV2 { + + /** + * Return a {@link Table} instance to do read/write with user-specified options. + * + * @param options the user-specified options that can identify a table, e.g. file path, Kafka + *topic name, etc. It's an immutable case-insensitive string-to-string map. + */ + Table getTable(DataSourceOptions options); + + /** + * Return a {@link Table} instance to do read/write with user-specified schema and options. + * + * By default this method throws {@link UnsupportedOperationException}, implementations should + * override this method to handle user-specified schema. + * + * @param options the user-specified options that can identify a table, e.g. file path, Kafka + *topic name, etc. It's an immutable case-insensitive string-to-string map. + * @param schema the user-specified schema. + */ + default Table getTable(DataSourceOptions options, StructType schema) { --- End diff -- Basically just saying we should just push down this requested schema into the `ScanBuilder`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23013: [SPARK-25023] More detailed security guidance for K8S
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/23013 The SASL mechanisms still work, though it's more similar to something like standalone mode than having the robustness of YARN. We could generate the SASL secrets and put them in K8s secrets as a k8s specific solution, for example. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23086: [SPARK-25528][SQL] data source v2 API refactor (b...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/23086#discussion_r234736735 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/TableProvider.java --- @@ -0,0 +1,62 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2; + +import org.apache.spark.annotation.InterfaceStability; +import org.apache.spark.sql.sources.DataSourceRegister; +import org.apache.spark.sql.types.StructType; + +/** + * The base interface for v2 data sources which don't have a real catalog. Implementations must + * have a public, 0-arg constructor. + * + * The major responsibility of this interface is to return a {@link Table} for read/write. + */ +@InterfaceStability.Evolving +// TODO: do not extend `DataSourceV2`, after we finish the API refactor completely. +public interface TableProvider extends DataSourceV2 { + + /** + * Return a {@link Table} instance to do read/write with user-specified options. + * + * @param options the user-specified options that can identify a table, e.g. file path, Kafka + *topic name, etc. It's an immutable case-insensitive string-to-string map. + */ + Table getTable(DataSourceOptions options); + + /** + * Return a {@link Table} instance to do read/write with user-specified schema and options. + * + * By default this method throws {@link UnsupportedOperationException}, implementations should + * override this method to handle user-specified schema. + * + * @param options the user-specified options that can identify a table, e.g. file path, Kafka + *topic name, etc. It's an immutable case-insensitive string-to-string map. + * @param schema the user-specified schema. + */ + default Table getTable(DataSourceOptions options, StructType schema) { --- End diff -- I know that this is from prior DataSourceV2 semantics, but what's the difference between providing the `schema` here and the column pruning aspect of `ScanBuilder`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23086: [SPARK-25528][SQL] data source v2 API refactor (b...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/23086#discussion_r234736935 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/Batch.java --- @@ -0,0 +1,47 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2.reader; + +import org.apache.spark.annotation.InterfaceStability; + +/** + * A physical representation of a data source scan for batch queries. This interface is used to + * provide physical information, like how many partitions the scanned data has, and how to read + * records from the partitions. + */ +@InterfaceStability.Evolving +public interface Batch { --- End diff -- `BatchScan`, perhaps? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22547: [SPARK-25528][SQL] data source V2 read side API refactor...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/22547 @cloud-fan @rdblue I believe we've converged on an appropriate API as per our last sync. Do we have a plan to move this forward, with the separated smaller patches? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22760: [SPARK-25751][K8S][TEST] Unit Testing for Kerbero...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22760#discussion_r233257218 --- Diff: resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/features/HadoopConfExecutorFeatureStepSuite.scala --- @@ -0,0 +1,68 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s.features + +import org.mockito.{Mock, MockitoAnnotations} +import org.mockito.Matchers.{eq => Eq} +import org.mockito.Mockito.when +import org.mockito.invocation.InvocationOnMock +import org.mockito.stubbing.Answer +import org.scalatest.BeforeAndAfter + +import org.apache.spark.{SparkConf, SparkFunSuite} +import org.apache.spark.deploy.k8s.{KubernetesConf, KubernetesExecutorSpecificConf, SparkPod} +import org.apache.spark.deploy.k8s.Constants.HADOOP_CONFIG_MAP_NAME +import org.apache.spark.deploy.k8s.features.hadooputils.HadoopBootstrapUtil + +class HadoopConfExecutorFeatureStepSuite extends SparkFunSuite with BeforeAndAfter { + private val hadoopConfMapName = "HADOOP_CONF_NAME" + private val sparkPod = SparkPod.initialPod() + + @Mock + private var kubernetesConf: KubernetesConf[KubernetesExecutorSpecificConf] = _ + + @Mock + private var hadoopBootstrapUtil: HadoopBootstrapUtil = _ + + before { +MockitoAnnotations.initMocks(this) +val sparkConf = new SparkConf(false) + .set(HADOOP_CONFIG_MAP_NAME, hadoopConfMapName) +when(kubernetesConf.sparkConf).thenReturn(sparkConf) + } + + test("bootstrapHadoopConf being applied to a base spark pod") { +when(hadoopBootstrapUtil.bootstrapHadoopConfDir( --- End diff -- There is a point of over-mocking, I agree. I think `KubernetesConf` shouldn't be mocked in most cases because it's just a structure. This is similar to not mocking something like a case class or a hash map. I also don't think we need to use Mockito Answers here - that could be done with stub implementations of the submodule. I didn't have a strong enough conviction of not using Mockito answers, but in general I think we should favor stub implementations over `Answer`; it's more readable. I think it's fine though that we want to test "We're calling this submodule with these parameters", because the rest of the module's correctness is unit-test covered in the unit tests of that submodule. The general premise we'd like to follow is that a unit test should only execute code that is in the class under test. In other words, in this concrete case, since `HadoopBootstrapUtil` is a separate class, no code in `HadoopBootstrapUtil` should be run as part of the test. The class under test is responsible for calling the utility submodule with certain arguments but is not concerned about what that submodule actually does. Thus the test also doesn't have to be concerned here about what that submodule actually does, but should check that the submodule was actually called. If we want to check the submodule's correctness, we unit test the submodule. Now - did we need a submodule to begin with? We could have, for example, kept HadoopBootstrapUtil not really be a submodule at all, but just a `static` call on a utility method. I think that's debatable - it makes each test need to cover a larger amount of code; that is, we no longer test the utility method in isolation, which is what we get here. Therefore it's unclear if HadoopBootstrapUtil being its own object that can be stubbed, is the right approach. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22760: [SPARK-25751][K8S][TEST] Unit Testing for Kerbero...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22760#discussion_r233198612 --- Diff: resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/features/HadoopConfExecutorFeatureStepSuite.scala --- @@ -0,0 +1,68 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s.features + +import org.mockito.{Mock, MockitoAnnotations} +import org.mockito.Matchers.{eq => Eq} +import org.mockito.Mockito.when +import org.mockito.invocation.InvocationOnMock +import org.mockito.stubbing.Answer +import org.scalatest.BeforeAndAfter + +import org.apache.spark.{SparkConf, SparkFunSuite} +import org.apache.spark.deploy.k8s.{KubernetesConf, KubernetesExecutorSpecificConf, SparkPod} +import org.apache.spark.deploy.k8s.Constants.HADOOP_CONFIG_MAP_NAME +import org.apache.spark.deploy.k8s.features.hadooputils.HadoopBootstrapUtil + +class HadoopConfExecutorFeatureStepSuite extends SparkFunSuite with BeforeAndAfter { + private val hadoopConfMapName = "HADOOP_CONF_NAME" + private val sparkPod = SparkPod.initialPod() + + @Mock + private var kubernetesConf: KubernetesConf[KubernetesExecutorSpecificConf] = _ + + @Mock + private var hadoopBootstrapUtil: HadoopBootstrapUtil = _ + + before { +MockitoAnnotations.initMocks(this) +val sparkConf = new SparkConf(false) + .set(HADOOP_CONFIG_MAP_NAME, hadoopConfMapName) +when(kubernetesConf.sparkConf).thenReturn(sparkConf) + } + + test("bootstrapHadoopConf being applied to a base spark pod") { +when(hadoopBootstrapUtil.bootstrapHadoopConfDir( --- End diff -- Instead of mocks would we implement stub interfaces then? The purpose of the mocks is to make every unit test run only code that is in the class under test, plus utility methods. I think the only alternative to mocks is writing stub interfaces, and it's not clear how much more / less boilerplate that would be also. If the stub interfaces lead to less boilerplate than mocks then we should go with that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23013: [SPARK-25023] More detailed security guidance for K8S
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/23013 I'm curious as to how much specific K8s guidance we should be giving in Spark documentation. There is such a thing as over-documenting; though we should certainly bias towards writing more documentation. But users can research Kubernetes-specific concepts using the Kubernetes docs. For example, when using the volume mounts feature, the user has to provide a volume type. These volume type strings have very specific semantics and functionality that is well defined in the Kubernetes docs. So, if the user provides a `hostPath` volume, can we expect that the user would have already looked at Kubernetes documentation to know what those kinds of volumes can do? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22760: [SPARK-25751][K8S][TEST] Unit Testing for Kerberos Suppo...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/22760 This looks good to me. @vanzin want to sign off? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22547: [SPARK-25528][SQL] data source V2 read side API r...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22547#discussion_r230973917 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousInputStream.scala --- @@ -46,17 +45,22 @@ import org.apache.spark.sql.types.StructType * scenarios, where some offsets after the specified initial ones can't be * properly read. */ -class KafkaContinuousReadSupport( +class KafkaContinuousInputStream( --- End diff -- Makes sense. I really consider this to be a blocker on getting this merged and approved. It's difficult to have confidence in a review over such a large change. Thoughts @cloud-fan @rdblue? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21306: [SPARK-24252][SQL] Add catalog registration and t...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/21306#discussion_r230973235 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/TableChange.java --- @@ -0,0 +1,182 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalog.v2; + +import org.apache.spark.sql.types.DataType; + +/** + * TableChange subclasses represent requested changes to a table. These are passed to + * {@link TableCatalog#alterTable}. For example, + * + * import TableChange._ + * val catalog = source.asInstanceOf[TableSupport].catalog() + * catalog.alterTable(ident, + * addColumn("x", IntegerType), + * renameColumn("a", "b"), + * deleteColumn("c") + * ) + * + */ +public interface TableChange { --- End diff -- Would it be a valid operation to change the partitioning of the table without dropping the entire table and re-creating it? E.g. change the bucket size for such and such column. Seems pretty difficult to do in practice though since the underlying data layout would have to change as part of the modification. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21306: [SPARK-24252][SQL] Add catalog registration and t...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/21306#discussion_r230972839 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/Table.java --- @@ -0,0 +1,46 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalog.v2; + +import org.apache.spark.sql.types.StructType; + +import java.util.List; +import java.util.Map; + +/** + * Represents table metadata from a {@link TableCatalog} or other table sources. + */ +public interface Table { --- End diff -- The nomenclature here appears to conflict with @cloud-fan's refactor in https://github.com/apache/spark/pull/22547/files#diff-45399ef5eed5c873d5f12bf0f1671b8fR40. Maybe we can call this `TableMetadata` or `TableDescription`? Or perhaps we rename the other construct? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22760: [SPARK-25751][K8S][TEST] Unit Testing for Kerbero...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22760#discussion_r230965800 --- Diff: resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/features/KubernetesFeaturesTestUtils.scala --- @@ -63,4 +63,66 @@ object KubernetesFeaturesTestUtils { def containerHasEnvVar(container: Container, envVarName: String): Boolean = { container.getEnv.asScala.exists(envVar => envVar.getName == envVarName) } + + def containerHasEnvVars(container: Container, envs: Map[String, String]): Unit = { +assertHelper[Set[(String, String)]](envs.toSet, + container.getEnv.asScala +.map { e => (e.getName, e.getValue) }.toSet, + subsetOfTup[Set[(String, String)], String], "a subset of") + } + + def containerHasVolumeMounts(container: Container, vms: Map[String, String]): Unit = { +assertHelper[Set[(String, String)]](vms.toSet, + container.getVolumeMounts.asScala +.map { vm => (vm.getName, vm.getMountPath) }.toSet, + subsetOfTup[Set[(String, String)], String], "a subset of") + } + + def podHasLabels(pod: Pod, labels: Map[String, String]): Unit = { +assertHelper[Set[(String, String)]](labels.toSet, pod.getMetadata.getLabels.asScala.toSet, + subsetOfTup[Set[(String, String)], String], "a subset of") + } + + def podHasVolumes(pod: Pod, volumes: Seq[Volume]): Unit = { +assertHelper[Set[Volume]](volumes.toSet, pod.getSpec.getVolumes.asScala.toSet, + subsetOfElem[Set[Volume], Volume], "a subset of") + } + + // Mocking bootstrapHadoopConfDir + def hadoopConfBootPod(inputPod: SparkPod): SparkPod = +SparkPod( + new PodBuilder(inputPod.pod) +.editOrNewMetadata() + .addToLabels("bootstrap-hconf", "true") + .endMetadata() +.build(), + inputPod.container) + + // Mocking bootstrapKerberosPod + def krbBootPod(inputPod: SparkPod): SparkPod = +SparkPod( + new PodBuilder(inputPod.pod) +.editOrNewMetadata() + .addToLabels("bootstrap-kerberos", "true") + .endMetadata() +.build(), + inputPod.container) + + // Mocking bootstrapSparkUserPod + def userBootPod(inputPod: SparkPod): SparkPod = +SparkPod( + new PodBuilder(inputPod.pod) +.editOrNewMetadata() + .addToLabels("bootstrap-user", "true") + .endMetadata() +.build(), + inputPod.container) + + def subsetOfElem[T <: Set[B], B <: Any]: (T, T) => Boolean = (a, b) => a.subsetOf(b) + def subsetOfTup[T <: Set[(B, B)], B <: Any]: (T, T) => Boolean = (a, b) => a.subsetOf(b) + + def assertHelper[T](con1: T, con2: T, + expr: (T, T) => Boolean = (a: T, b: T) => a == b, exprMsg: String = "equal to"): Unit = { +assert(expr(con1, con2), s"$con1 is not $exprMsg $con2 as expected") --- End diff -- I think here the simpler approach is preferred - on each assertion line I want to see right then and there what error message I'll get back, not having to go through a helper method. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22760: [SPARK-25751][K8S][TEST] Unit Testing for Kerbero...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22760#discussion_r230941006 --- Diff: resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/features/hadooputils/HadoopBootstrapUtilSuite.scala --- @@ -0,0 +1,221 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s.features.hadooputils + +import java.io.File +import java.util.UUID + +import scala.collection.JavaConverters._ + +import com.google.common.io.Files +import io.fabric8.kubernetes.api.model._ +import org.scalatest.BeforeAndAfter + +import org.apache.spark.SparkFunSuite +import org.apache.spark.deploy.k8s.Constants._ +import org.apache.spark.deploy.k8s.SparkPod +import org.apache.spark.deploy.k8s.features.KubernetesFeaturesTestUtils._ +import org.apache.spark.util.Utils + +class HadoopBootstrapUtilSuite extends SparkFunSuite with BeforeAndAfter{ + private val sparkPod = SparkPod.initialPod() + private val hadoopBootstrapUtil = new HadoopBootstrapUtil + private var tmpDir: File = _ + private var tmpFile: File = _ + + before { +tmpDir = Utils.createTempDir() +tmpFile = File.createTempFile(s"${UUID.randomUUID().toString}", ".txt", tmpDir) +Files.write("contents".getBytes, tmpFile) + } + + after { +tmpFile.delete() +tmpDir.delete() + } + + test("bootstrapKerberosPod with file location specified for krb5.conf file") { +val dtSecretName = "EXAMPLE_SECRET_NAME" +val dtSecretItemKey = "EXAMPLE_ITEM_KEY" +val userName = "SPARK_USER_NAME" +val fileLocation = Some(tmpFile.getAbsolutePath) +val stringPath = tmpFile.getName +val newKrb5ConfName = Some("/etc/krb5.conf") +val resultingPod = hadoopBootstrapUtil.bootstrapKerberosPod( + dtSecretName, + dtSecretItemKey, + userName, + fileLocation, + newKrb5ConfName, + None, + sparkPod) +val expectedVolumes = Seq( + new VolumeBuilder() +.withName(KRB_FILE_VOLUME) +.withNewConfigMap() + .withName(newKrb5ConfName.get) + .withItems(new KeyToPathBuilder() +.withKey(stringPath) +.withPath(stringPath) +.build()) +.endConfigMap() +.build(), + new VolumeBuilder() +.withName(SPARK_APP_HADOOP_SECRET_VOLUME_NAME) +.withNewSecret() + .withSecretName(dtSecretName) + .endSecret() +.build() +) +podHasVolumes(resultingPod.pod, expectedVolumes) +containerHasEnvVars(resultingPod.container, Map( + ENV_HADOOP_TOKEN_FILE_LOCATION -> s"$SPARK_APP_HADOOP_CREDENTIALS_BASE_DIR/$dtSecretItemKey", + ENV_SPARK_USER -> userName) +) +containerHasVolumeMounts(resultingPod.container, Map( + SPARK_APP_HADOOP_SECRET_VOLUME_NAME -> SPARK_APP_HADOOP_CREDENTIALS_BASE_DIR, + KRB_FILE_VOLUME -> (KRB_FILE_DIR_PATH + "/krb5.conf")) +) + } + + test("bootstrapKerberosPod with pre-existing configMap specified for krb5.conf file") { +val dtSecretName = "EXAMPLE_SECRET_NAME" --- End diff -- These can be constants at the top of the file or in a companion object. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22760: [SPARK-25751][K8S][TEST] Unit Testing for Kerbero...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22760#discussion_r230942826 --- Diff: resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/features/KubernetesFeaturesTestUtils.scala --- @@ -63,4 +63,66 @@ object KubernetesFeaturesTestUtils { def containerHasEnvVar(container: Container, envVarName: String): Boolean = { container.getEnv.asScala.exists(envVar => envVar.getName == envVarName) } + + def containerHasEnvVars(container: Container, envs: Map[String, String]): Unit = { +assertHelper[Set[(String, String)]](envs.toSet, + container.getEnv.asScala +.map { e => (e.getName, e.getValue) }.toSet, + subsetOfTup[Set[(String, String)], String], "a subset of") + } + + def containerHasVolumeMounts(container: Container, vms: Map[String, String]): Unit = { +assertHelper[Set[(String, String)]](vms.toSet, + container.getVolumeMounts.asScala +.map { vm => (vm.getName, vm.getMountPath) }.toSet, + subsetOfTup[Set[(String, String)], String], "a subset of") + } + + def podHasLabels(pod: Pod, labels: Map[String, String]): Unit = { +assertHelper[Set[(String, String)]](labels.toSet, pod.getMetadata.getLabels.asScala.toSet, + subsetOfTup[Set[(String, String)], String], "a subset of") + } + + def podHasVolumes(pod: Pod, volumes: Seq[Volume]): Unit = { +assertHelper[Set[Volume]](volumes.toSet, pod.getSpec.getVolumes.asScala.toSet, + subsetOfElem[Set[Volume], Volume], "a subset of") + } + + // Mocking bootstrapHadoopConfDir + def hadoopConfBootPod(inputPod: SparkPod): SparkPod = +SparkPod( + new PodBuilder(inputPod.pod) +.editOrNewMetadata() + .addToLabels("bootstrap-hconf", "true") + .endMetadata() +.build(), + inputPod.container) + + // Mocking bootstrapKerberosPod + def krbBootPod(inputPod: SparkPod): SparkPod = +SparkPod( + new PodBuilder(inputPod.pod) +.editOrNewMetadata() + .addToLabels("bootstrap-kerberos", "true") + .endMetadata() +.build(), + inputPod.container) + + // Mocking bootstrapSparkUserPod + def userBootPod(inputPod: SparkPod): SparkPod = +SparkPod( + new PodBuilder(inputPod.pod) +.editOrNewMetadata() + .addToLabels("bootstrap-user", "true") + .endMetadata() +.build(), + inputPod.container) + + def subsetOfElem[T <: Set[B], B <: Any]: (T, T) => Boolean = (a, b) => a.subsetOf(b) + def subsetOfTup[T <: Set[(B, B)], B <: Any]: (T, T) => Boolean = (a, b) => a.subsetOf(b) + + def assertHelper[T](con1: T, con2: T, + expr: (T, T) => Boolean = (a: T, b: T) => a == b, exprMsg: String = "equal to"): Unit = { +assert(expr(con1, con2), s"$con1 is not $exprMsg $con2 as expected") --- End diff -- This interaction seems really awkward to me. Why can't we just embed the error message directly into the assertion on each assertion statement? I'm also not familiar enough with the ScalaTest framework but perhaps we can explore other options the library has to offer. (In Java I'm used to using [AssertJ](https://github.com/joel-costigliola/assertj-core) for fluent assertions but I'm not sure if that will be of any help for Scala objects, particularly Scala collections without converting everything.) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22760: [SPARK-25751][K8S][TEST] Unit Testing for Kerbero...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22760#discussion_r230941669 --- Diff: resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/features/hadooputils/HadoopBootstrapUtilSuite.scala --- @@ -0,0 +1,221 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s.features.hadooputils + +import java.io.File +import java.util.UUID + +import scala.collection.JavaConverters._ + +import com.google.common.io.Files +import io.fabric8.kubernetes.api.model._ +import org.scalatest.BeforeAndAfter + +import org.apache.spark.SparkFunSuite +import org.apache.spark.deploy.k8s.Constants._ +import org.apache.spark.deploy.k8s.SparkPod +import org.apache.spark.deploy.k8s.features.KubernetesFeaturesTestUtils._ +import org.apache.spark.util.Utils + +class HadoopBootstrapUtilSuite extends SparkFunSuite with BeforeAndAfter{ + private val sparkPod = SparkPod.initialPod() + private val hadoopBootstrapUtil = new HadoopBootstrapUtil + private var tmpDir: File = _ + private var tmpFile: File = _ + + before { +tmpDir = Utils.createTempDir() +tmpFile = File.createTempFile(s"${UUID.randomUUID().toString}", ".txt", tmpDir) +Files.write("contents".getBytes, tmpFile) + } + + after { +tmpFile.delete() +tmpDir.delete() + } + + test("bootstrapKerberosPod with file location specified for krb5.conf file") { +val dtSecretName = "EXAMPLE_SECRET_NAME" --- End diff -- These can be constants in a companion object or at the top of the class. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22946: [SPARK-25943][SQL] Fail if mismatching nested struct fie...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/22946 @cloud-fan for review. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22946: [SPARK-25943][SQL] Fail if mismatching nested struct fie...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/22946 Ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22897: [SPARK-25875][k8s] Merge code to set up driver command i...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/22897 Ok, I am merging into master. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22547: [SPARK-25528][SQL] data source V2 read side API r...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22547#discussion_r230505785 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousInputStream.scala --- @@ -46,17 +45,22 @@ import org.apache.spark.sql.types.StructType * scenarios, where some offsets after the specified initial ones can't be * properly read. */ -class KafkaContinuousReadSupport( +class KafkaContinuousInputStream( --- End diff -- +1 for this. A lot of the changes right now are for moving around the streaming code especially, which makes it harder to isolate just the proposed API for review. An alternative is to make this PR separate commits that, while the commits themselves may not compile because of mismatching signatures - but all the commits taken together would compile, and each commit can be reviewed individually for assessing the API and then the implementation. For example I'd propose 3 PRs: * Batch reading, with a commit for the interface changes and a separate commit for the implementation changes * Micro Batch Streaming read, with a commit for the interface changes and a separate commit for the implementation changes * Continuous streaming read, similar to above Thoughts? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22897: [SPARK-25875][k8s] Merge code to set up driver co...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22897#discussion_r230476657 --- Diff: resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh --- @@ -96,22 +96,6 @@ case "$SPARK_K8S_CMD" in "$@" ) ;; - driver-py) --- End diff -- I think this is fine and what we want to do. But at some point we're going to want to make the API between spark submit and launched containers stable. Using this as an example, if a user upgraded their spark-submit version to 3.0 but didn't upgrade the version of Spark in their docker image, the docker container's command will attempt to look up these old environment variables that are no longer being set by spark submit. We should be thinking about making this contract stable from 3.0 onwards. For this coming release I think this is fine. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22608: [SPARK-25750][K8S][TESTS] Kerberos Support Integration T...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/22608 It depends on how we're getting the Hadoop images. If we're building everything from scratch, we could run everything in one container - though having a container run more than one process simultaneously isn't common. It's more common to have a single container have a single responsibility / process. But you can group multiple containers that have related responsibilities into a single pod, hence we'll use 3 containers in one pod here. If we're pulling Hadoop images from elsewhere - which it sounds like we aren't doing in the Apache ecosystem in general though - then we'd need to build our own separate image for the KDC anyways. Multiple containers in the same pod all share the same resource footprint and limit boundaries. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22608: [SPARK-25750][K8S][TESTS] Kerberos Support Integration T...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/22608 > You seem to be running different pods for KDC, NN and DN. Is there an advantage to that? > > Seems to me you could do the same thing with a single pod and simplify things here. > > The it README also mentions "3 CPUs and 4G of memory". Is that still enough with these new things that are run? Think we want different images for each, but that's fine - just run a pod with those three containers in it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22897: [SPARK-25875][k8s] Merge code to set up driver co...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22897#discussion_r230181127 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverCommandFeatureStep.scala --- @@ -0,0 +1,137 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s.features + +import scala.collection.JavaConverters._ + +import io.fabric8.kubernetes.api.model.{ContainerBuilder, EnvVarBuilder} + +import org.apache.spark.deploy.k8s._ +import org.apache.spark.deploy.k8s.Config._ +import org.apache.spark.deploy.k8s.Constants._ +import org.apache.spark.deploy.k8s.submit._ +import org.apache.spark.internal.config._ +import org.apache.spark.launcher.SparkLauncher +import org.apache.spark.util.Utils + +/** + * Creates the driver command for running the user app, and propagates needed configuration so + * executors can also find the app code. + */ +private[spark] class DriverCommandFeatureStep(conf: KubernetesConf[KubernetesDriverSpecificConf]) + extends KubernetesFeatureConfigStep { + + private val driverConf = conf.roleSpecificConf + + override def configurePod(pod: SparkPod): SparkPod = { +driverConf.mainAppResource match { + case Some(JavaMainAppResource(_)) | None => +configureForJava(pod) + + case Some(PythonMainAppResource(res)) => +configureForPython(pod, res) + + case Some(RMainAppResource(res)) => +configureForR(pod, res) +} + } + + override def getAdditionalPodSystemProperties(): Map[String, String] = { +driverConf.mainAppResource match { + case Some(JavaMainAppResource(res)) => +additionalJavaProperties(res) + + case Some(PythonMainAppResource(res)) => +additionalPythonProperties(res) + + case Some(RMainAppResource(res)) => +additionalRProperties(res) + + case None => +Map.empty --- End diff -- Sounds great, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22897: [SPARK-25875][k8s] Merge code to set up driver co...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22897#discussion_r230167668 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverCommandFeatureStep.scala --- @@ -0,0 +1,137 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s.features + +import scala.collection.JavaConverters._ + +import io.fabric8.kubernetes.api.model.{ContainerBuilder, EnvVarBuilder} + +import org.apache.spark.deploy.k8s._ +import org.apache.spark.deploy.k8s.Config._ +import org.apache.spark.deploy.k8s.Constants._ +import org.apache.spark.deploy.k8s.submit._ +import org.apache.spark.internal.config._ +import org.apache.spark.launcher.SparkLauncher +import org.apache.spark.util.Utils + +/** + * Creates the driver command for running the user app, and propagates needed configuration so + * executors can also find the app code. + */ +private[spark] class DriverCommandFeatureStep(conf: KubernetesConf[KubernetesDriverSpecificConf]) + extends KubernetesFeatureConfigStep { + + private val driverConf = conf.roleSpecificConf + + override def configurePod(pod: SparkPod): SparkPod = { +driverConf.mainAppResource match { + case Some(JavaMainAppResource(_)) | None => +configureForJava(pod) + + case Some(PythonMainAppResource(res)) => +configureForPython(pod, res) + + case Some(RMainAppResource(res)) => +configureForR(pod, res) +} + } + + override def getAdditionalPodSystemProperties(): Map[String, String] = { +driverConf.mainAppResource match { + case Some(JavaMainAppResource(res)) => +additionalJavaProperties(res) + + case Some(PythonMainAppResource(res)) => +additionalPythonProperties(res) + + case Some(RMainAppResource(res)) => +additionalRProperties(res) + + case None => +Map.empty --- End diff -- We can create a new main app resource type which encodes NO_RESOURCE, and then make the main app resource non-optional in KubernetesConf. Think it would simplify some of the case matching we do here. Thoughts? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22909: [SPARK-25897][k8s] Hook up k8s integration tests to sbt ...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/22909 Maybe @srowen on the SBT side? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22145: [SPARK-25152][K8S] Enable SparkR Integration Tests for K...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/22145 Just wanted to ping on this - how close are we to getting this working on CI? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22904: [SPARK-25887][K8S] Configurable K8S context suppo...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22904#discussion_r230127363 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala --- @@ -23,6 +23,18 @@ import org.apache.spark.internal.config.ConfigBuilder private[spark] object Config extends Logging { + val KUBERNETES_CONTEXT = +ConfigBuilder("spark.kubernetes.context") + .doc("The desired context from your K8S config file used to configure the K8S " + +"client for interacting with the cluster. Useful if your config file has " + +"multiple clusters or user identities defined. The client library used " + +"locates the config file via the KUBECONFIG environment variable or by defaulting " + +"to .kube/config under your home directory. If not specified then your current " + +"context is used. You can always override specific aspects of the config file " + +"provided configuration using other Spark on K8S configuration options.") + .stringConf + .createWithDefault(null) --- End diff -- Default of `null` seems strange, surely we can make this `Optional` and just handle `empty` in the places that use this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22760: [SPARK-25751][K8S][TEST] Unit Testing for Kerberos Suppo...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/22760 In that case @ifilonenko can we add said test here? Seems like the right place to do it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22897: [SPARK-25875][k8s] Merge code to set up driver co...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22897#discussion_r230119272 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverCommandFeatureStep.scala --- @@ -0,0 +1,137 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s.features + +import scala.collection.JavaConverters._ + +import io.fabric8.kubernetes.api.model.{ContainerBuilder, EnvVarBuilder} + +import org.apache.spark.deploy.k8s._ +import org.apache.spark.deploy.k8s.Config._ +import org.apache.spark.deploy.k8s.Constants._ +import org.apache.spark.deploy.k8s.submit._ +import org.apache.spark.internal.config._ +import org.apache.spark.launcher.SparkLauncher +import org.apache.spark.util.Utils + +/** + * Creates the driver command for running the user app, and propagates needed configuration so + * executors can also find the app code. + */ +private[spark] class DriverCommandFeatureStep(conf: KubernetesConf[KubernetesDriverSpecificConf]) + extends KubernetesFeatureConfigStep { + + private val driverConf = conf.roleSpecificConf + + override def configurePod(pod: SparkPod): SparkPod = { +driverConf.mainAppResource match { + case Some(JavaMainAppResource(_)) | None => --- End diff -- Again - don't match on options, use the option functional primitives. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22897: [SPARK-25875][k8s] Merge code to set up driver co...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22897#discussion_r230115856 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala --- @@ -47,10 +48,24 @@ private[spark] class BasicDriverFeatureStep( // Memory settings private val driverMemoryMiB = conf.get(DRIVER_MEMORY) + + // The memory overhead factor to use. If the user has not set it, then use a different + // value for non-JVM apps. This value is propagated to executors. + private val overheadFactor = conf.roleSpecificConf.mainAppResource match { +case Some(_: NonJVMResource) => --- End diff -- I was under the impression that generally we don'5 want to match against option types - instead we should be using `option.map.getOrElse`? More just my impression of the Scala idiomatic style than anything. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22897: [SPARK-25875][k8s] Merge code to set up driver co...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22897#discussion_r230119092 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverCommandFeatureStep.scala --- @@ -0,0 +1,137 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s.features + +import scala.collection.JavaConverters._ + +import io.fabric8.kubernetes.api.model.{ContainerBuilder, EnvVarBuilder} + +import org.apache.spark.deploy.k8s._ +import org.apache.spark.deploy.k8s.Config._ +import org.apache.spark.deploy.k8s.Constants._ +import org.apache.spark.deploy.k8s.submit._ +import org.apache.spark.internal.config._ +import org.apache.spark.launcher.SparkLauncher +import org.apache.spark.util.Utils + +/** + * Creates the driver command for running the user app, and propagates needed configuration so + * executors can also find the app code. + */ +private[spark] class DriverCommandFeatureStep(conf: KubernetesConf[KubernetesDriverSpecificConf]) + extends KubernetesFeatureConfigStep { + + private val driverConf = conf.roleSpecificConf + + override def configurePod(pod: SparkPod): SparkPod = { +driverConf.mainAppResource match { + case Some(JavaMainAppResource(_)) | None => +configureForJava(pod) + + case Some(PythonMainAppResource(res)) => +configureForPython(pod, res) + + case Some(RMainAppResource(res)) => +configureForR(pod, res) +} + } + + override def getAdditionalPodSystemProperties(): Map[String, String] = { +driverConf.mainAppResource match { + case Some(JavaMainAppResource(res)) => +additionalJavaProperties(res) + + case Some(PythonMainAppResource(res)) => +additionalPythonProperties(res) + + case Some(RMainAppResource(res)) => +additionalRProperties(res) + + case None => +Map.empty --- End diff -- Is this for `SparkLauncher.NO_RESOURCE`? Or otherwise should this even be a valid state? If it isn't then we should throw an exception here. I'm wondering if we should encode `SparkLauncher.NO_RESOURCE` as its own resource type. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22897: [SPARK-25875][k8s] Merge code to set up driver co...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22897#discussion_r230115469 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala --- @@ -58,16 +58,13 @@ private[spark] class BasicExecutorFeatureStep( (kubernetesConf.get(MEMORY_OVERHEAD_FACTOR) * executorMemoryMiB).toInt, MEMORY_OVERHEAD_MIN_MIB)) private val executorMemoryWithOverhead = executorMemoryMiB + memoryOverheadMiB - private val executorMemoryTotal = kubernetesConf.sparkConf -.getOption(APP_RESOURCE_TYPE.key).map{ res => - val additionalPySparkMemory = res match { -case "python" => - kubernetesConf.sparkConf -.get(PYSPARK_EXECUTOR_MEMORY).map(_.toInt).getOrElse(0) -case _ => 0 - } -executorMemoryWithOverhead + additionalPySparkMemory - }.getOrElse(executorMemoryWithOverhead) + private val executorMemoryTotal = kubernetesConf.get(APP_RESOURCE_TYPE) match { +case Some("python") => --- End diff -- I was under the impression that generally we don' want to match against option types - instead we should be using `option.map.getOrElse`? More just my impression of the Scala idiomatic style than anything. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22909: [SPARK-25897][k8s] Hook up k8s integration tests to sbt ...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/22909 Looks good to me but would like +1 from someone more familiar with SBT. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22909: [SPARK-25897][k8s] Hook up k8s integration tests ...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22909#discussion_r230113708 --- Diff: resource-managers/kubernetes/integration-tests/pom.xml --- @@ -145,14 +145,10 @@ test +none test - --- End diff -- Do we not need this anymore because the integration test is hidden behind the profile? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22608: [SPARK-25750][K8S][TESTS] Kerberos Support Integr...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22608#discussion_r230109316 --- Diff: resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/kerberos/KerberosPVWatcherCache.scala --- @@ -0,0 +1,119 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy.k8s.integrationtest.kerberos + +import io.fabric8.kubernetes.api.model.{PersistentVolume, PersistentVolumeClaim} +import io.fabric8.kubernetes.client.{KubernetesClientException, Watch, Watcher} +import io.fabric8.kubernetes.client.Watcher.Action +import org.scalatest.Matchers +import org.scalatest.concurrent.Eventually +import scala.collection.JavaConverters._ + +import org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.{INTERVAL, TIMEOUT} +import org.apache.spark.internal.Logging + +/** + * This class is responsible for ensuring that the persistent volume claims are bounded + * to the correct persistent volume and that they are both created before launching the --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22608: [SPARK-25750][K8S][TESTS] Kerberos Support Integr...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22608#discussion_r230109287 --- Diff: resource-managers/kubernetes/integration-tests/kerberos-yml/kerberos-set.yml --- @@ -0,0 +1,49 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +apiVersion: apps/v1 +kind: StatefulSet +metadata: + name: kerberos +spec: + replicas: 1 + selector: +matchLabels: + name: hdfs-kerberos + kerberosService: kerberos + job: kerberostest + template: +metadata: + annotations: +pod.beta.kubernetes.io/hostname: kerberos + labels: +name: hdfs-kerberos +kerberosService: kerberos +job: kerberostest +spec: + containers: + - command: ["sh"] +args: ["/start-kdc.sh"] +name: kerberos +imagePullPolicy: IfNotPresent +volumeMounts: +- mountPath: /var/keytabs + name: kerb-keytab + restartPolicy: Always + volumes: + - name: kerb-keytab +persistentVolumeClaim: --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22760: [SPARK-25751][K8S][TEST] Unit Testing for Kerberos Suppo...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/22760 I know that `HadoopBootstrapUtil` is being reworked soon - is it worth adding a unit test for that class here? @vanzin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22805: [SPARK-25809][K8S][TEST] New K8S integration testing bac...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/22805 Yes, this can be merged. Going into master now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22911: [SPARK-25815][k8s] Support kerberos in client mode, keyt...
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/22911 @ifilonenko @skonto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22805: [SPARK-25809][K8S][TEST] New K8S integration test...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22805#discussion_r229512765 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/SparkKubernetesClientFactory.scala --- @@ -63,6 +66,8 @@ private[spark] object SparkKubernetesClientFactory { .getOption(s"$kubernetesAuthConfPrefix.$CLIENT_CERT_FILE_CONF_SUFFIX") val dispatcher = new Dispatcher( ThreadUtils.newDaemonCachedThreadPool("kubernetes-dispatcher")) + +// TODO Create builder in a way that respects configurable context --- End diff -- Can we add a Spark ticket here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22146: [SPARK-24434][K8S] pod template files
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/22146 Ok I'm going to merge this into master. Thanks everyone for the feedback. Follow up discussions around validation can be addressed in follow up patches. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22146: [SPARK-24434][K8S] pod template files
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22146#discussion_r229367166 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala --- @@ -74,8 +74,16 @@ private[spark] object Constants { val ENV_R_PRIMARY = "R_PRIMARY" val ENV_R_ARGS = "R_APP_ARGS" + // Pod spec templates + val EXECUTOR_POD_SPEC_TEMPLATE_FILE_NAME = "pod-spec-template.yml" + val EXECUTOR_POD_SPEC_TEMPLATE_MOUNTHPATH = "/opt/spark/pod-template" + val POD_TEMPLATE_VOLUME = "podspec-volume" --- End diff -- Ping here --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22146: [SPARK-24434][K8S] pod template files
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22146#discussion_r229367415 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesDriverBuilder.scala --- @@ -16,11 +16,17 @@ */ package org.apache.spark.deploy.k8s.submit -import org.apache.spark.deploy.k8s.{KubernetesConf, KubernetesDriverSpec, KubernetesDriverSpecificConf, KubernetesRoleSpecificConf} -import org.apache.spark.deploy.k8s.features.{BasicDriverFeatureStep, DriverKubernetesCredentialsFeatureStep, DriverServiceFeatureStep, EnvSecretsFeatureStep, LocalDirsFeatureStep, MountSecretsFeatureStep, MountVolumesFeatureStep} +import java.io.File + +import io.fabric8.kubernetes.client.KubernetesClient + +import org.apache.spark.{SparkConf, SparkException} +import org.apache.spark.deploy.k8s._ +import org.apache.spark.deploy.k8s.features._ --- End diff -- Ping here --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22146: [SPARK-24434][K8S] pod template files
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/22146 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22146: [SPARK-24434][K8S] pod template files
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/22146 @shaneknapp can we get help diagnosing the permission denied error on the Jenkins box? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22146: [SPARK-24434][K8S] pod template files
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/22146 Retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22146: [SPARK-24434][K8S] pod template files
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22146#discussion_r229024855 --- Diff: docs/running-on-kubernetes.md --- @@ -799,4 +815,168 @@ specific to Spark on Kubernetes. This sets the major Python version of the docker image used to run the driver and executor containers. Can either be 2 or 3. + + spark.kubernetes.driver.podTemplateFile + (none) + + Specify the local file that contains the driver [pod template](#pod-template). For example + spark.kubernetes.driver.podTemplateFile=/path/to/driver-pod-template.yaml` + + + + spark.kubernetes.executor.podTemplateFile + (none) + + Specify the local file that contains the executor [pod template](#pod-template). For example + spark.kubernetes.executor.podTemplateFile=/path/to/executor-pod-template.yaml` + + + + + Pod template properties + +See the below table for the full list of pod specifications that will be overwritten by spark. + +### Pod Metadata + + +Pod metadata keyModified valueDescription + + name + Value of spark.kubernetes.driver.pod.name + +The driver pod name will be overwritten with either the configured or default value of +spark.kubernetes.driver.pod.name. The executor pod names will be unaffected. + + + + namespace + Value of spark.kubernetes.namespace + +Spark makes strong assumptions about the driver and executor namespaces. Both driver and executor namespaces will +be replaced by either the configured or default spark conf value. + + + + labels + Adds the labels from spark.kubernetes.{driver,executor}.label.* + +Spark will add additional labels specified by the spark configuration. + + + + annotations + Adds the annotations from spark.kubernetes.{driver,executor}.annotation.* + +Spark will add additional labels specified by the spark configuration. + + + + +### Pod Spec + + +Pod spec keyModified valueDescription + + imagePullSecrets + Adds image pull secrets from spark.kubernetes.container.image.pullSecrets + +Additional pull secrets will be added from the spark configuration to both executor pods. + + + + nodeSelector + Adds node selectors from spark.kubernetes.node.selector.* + +Additional node selectors will be added from the spark configuration to both executor pods. + + + + restartPolicy + "never" + +Spark assumes that both drivers and executors never restart. + + + + serviceAccount + Value of spark.kubernetes.authenticate.driver.serviceAccountName + +Spark will override serviceAccount with the value of the spark configuration for only +driver pods, and only if the spark configuration is specified. Executor pods will remain unaffected. + + + + serviceAccountName + Value of spark.kubernetes.authenticate.driver.serviceAccountName + +Spark will override serviceAccountName with the value of the spark configuration for only +driver pods, and only if the spark configuration is specified. Executor pods will remain unaffected. + + + + volumes + Adds volumes from spark.kubernetes.{driver,executor}.volumes.[VolumeType].[VolumeName].mount.path + +Spark will add volumes as specified by the spark conf, as well as additional volumes necessary for passing --- End diff -- One way we can avoid conflicting volumes entirely is by randomizing the name of the volumes added by features, e.g. appending some UUID or at least some large integer. I think keeping running documentation on all volumes we add from features is too much overhead. If we run into these conflicts often then we can do this, but I think it's fine not to block merging on that. Either way though I think again, the validation piece can be done separately from this PR. I wouldn't consider that documentation as blocking on this merging. Thoughts? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22146: [SPARK-24434][K8S] pod template files
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22146#discussion_r229013867 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala --- @@ -74,8 +74,16 @@ private[spark] object Constants { val ENV_R_PRIMARY = "R_PRIMARY" val ENV_R_ARGS = "R_APP_ARGS" + // Pod spec templates + val EXECUTOR_POD_SPEC_TEMPLATE_FILE_NAME = "pod-spec-template.yml" + val EXECUTOR_POD_SPEC_TEMPLATE_MOUNTHPATH = "/opt/spark/pod-template" --- End diff -- Quick ping here --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22146: [SPARK-24434][K8S] pod template files
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/22146#discussion_r229013753 --- Diff: docs/running-on-kubernetes.md --- @@ -185,6 +185,22 @@ To use a secret through an environment variable use the following options to the --conf spark.kubernetes.executor.secretKeyRef.ENV_NAME=name:key ``` +## Pod Template +Kubernetes allows defining pods from [template files](https://kubernetes.io/docs/concepts/workloads/pods/pod-overview/#pod-templates). +Spark users can similarly use template files to define the driver or executor pod configurations that Spark configurations do not support. +To do so, specify the spark properties `spark.kubernetes.driver.podTemplateFile` and `spark.kubernetes.executor.podTemplateFile` +to point to local files accessible to the `spark-submit` process. To allow the driver pod access the executor pod template --- End diff -- Don't think we considered this, but, an interesting proposal. I think that can be a follow up feature if requested. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org