[Spark SQL][How-To] Remove builtin function support from Spark

2024-04-17 Thread Matthew McMillian
Hello,

I'm very new to the Spark ecosystem, apologies if this question is a bit
simple.

I want to modify a custom fork of Spark to remove function support. For
example, I want to remove the query runners ability to call reflect and
java_method. I saw that there exists a data structure in spark-sql called
FunctionRegistry that seems to act as an allowlist on what functions Spark
can execute. If I remove a function of the registry, is that enough
guarantee that that function can "never" be invoked in Spark, or are there
other areas that would need to be changed as well?

Thanks,
Matthew McMillian


[Spark SQL][How-To] Remove builtin function support from Spark

2024-04-17 Thread Matthew McMillian
Hello,

I'm very new to the Spark ecosystem, apologies if this question is a bit
simple.

I want to modify a custom fork of Spark to remove function support. For
example, I want to remove the query runners ability to call reflect and
java_method. I saw that there exists a data structure in spark-sql called
FunctionRegistry that seems to act as an allowlist on what functions Spark
can execute. If I remove a function of the registry, is that enough
guarantee that that function can "never" be invoked in Spark, or are there
other areas that would need to be changed as well?

Thanks,
Matthew McMillian


should OutputCommitCoordinator fail stages for authorized committer failures when using s3a optimized committers?

2024-04-17 Thread Dylan McClelland
In https://issues.apache.org/jira/browse/SPARK-39195,
OutputCommitCoordinator was modified to fail a stage if an authorized
committer task fails.

We run our spark jobs on a k8s cluster managed by karpenter and mostly
built from spot instances. As a result, our executors are frequently
killed. With the above change, that leads to expensive stage failures at
the final write stage.

I think I understand why the above is needed when using
FileOutputCommitter, but it seems like we can handle things like the magic
s3a committer differently. For those, we could instead abort the task
attempt, which will the data files that are awaiting the final PUT
operation, and remove them from the list of files to be completed during
the job commit phase

Does this seem reasonable? I think the change could go in
OutputCommitCoordinator (as a case in the taskCompleted block), but there
are other options as well

Any other ideas on how stop individual failures of authorized committer
tasks from failing the whole job?