[Spark SQL]: Python Data Source API and spark.sql.execution.pyspark.python

Ilya Thu, 24 Jul 2025 05:00:14 -0700

Dear Spark Community,

Why Python Data Source API (pyspark.sql.datasource.Datasource) is not
using "spark.sql.execution.pyspark.python" config, but UDF do?


Datasource
1) executor always looks for "python3" ignoring
"spark.sql.execution.pyspark.python" config
2) so provided dependencies not loaded

Using Docker Image on both master/executors
spark:4.0.0-scala2.13-java21-python3-ubuntu

spark.addArtifact("pyspark_pex_env.pex", file=True) # ijson included
spark.conf.set("spark.sql.execution.pyspark.python", "pyspark_pex_env.pex")

spark.dataSource.register(MyDataSource)

ModuleNotFoundError: No module named 'ijson'
2025-07-24T09:26:21.941789290Z  SQLSTATE: 38000

JVM stacktrace:
2025-07-24T09:26:21.941800171Z org.apache.spark.sql.AnalysisException
2025-07-24T09:26:21.941802296Z at
org.apache.spark.sql.errors.QueryCompilationErrors$.pythonDataSourceError(QueryCompilationErrors.scala:2206)
2025-07-24T09:26:21.941804593Z at
org.apache.spark.sql.execution.datasources.v2.python.UserDefinedPythonDataSourceRunner.receiveFromPython(UserDefinedPythonDataSource.scala:279)
2025-07-24T09:26:21.941806864Z at
org.apache.spark.sql.execution.datasources.v2.python.UserDefinedPythonDataSourceRunner.receiveFromPython(UserDefinedPythonDataSource.scala:244)
2025-07-24T09:26:21.941808801Z at
org.apache.spark.sql.execution.python.PythonPlannerRunner.runInPython(PythonPlannerRunner.scala:118)
2025-07-24T09:26:21.941824039Z at
org.apache.spark.sql.execution.datasources.v2.python.UserDefinedPythonDataSource.createDataSourceInPython(UserDefinedPythonDataSource.scala:61)
2025-07-24T09:26:21.941826618Z at
org.apache.spark.sql.execution.datasources.v2.python.PythonDataSourceV2.getOrCreateDataSourceInPython(PythonDataSourceV2.scala:50)
2025-07-24T09:26:21.941828912Z at
org.apache.spark.sql.execution.datasources.v2.python.PythonDataSourceV2.inferSchema(PythonDataSourceV2.scala:56)
2025-07-24T09:26:21.941831393Z at
org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:96)
2025-07-24T09:26:21.941833963Z at
org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:147)
2025-07-24T09:26:21.941835876Z at
org.apache.spark.sql.catalyst.analysis.ResolveDataSource$$anonfun$apply$1.$anonfun$applyOrElse$1(ResolveDataSource.scala:60)
2025-07-24T09:26:21.941837708Z at scala.Option.flatMap(Option.scala:283)

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

[Spark SQL]: Python Data Source API and spark.sql.execution.pyspark.python

Reply via email to