Parser error when running PySpark on Windows connecting to GCS

Richard Smith Sat, 04 Nov 2023 05:38:41 -0700

Hi All,

I've just encountered and worked around a problem that is pretty obscureand unlikely to affect many people, but I thought I'd better report itanyway....

All the data I'm using is inside Google Cloud Storage buckets (pathstarts with gs://) and I'm running Spark 3.5.0 locally (for testing,real thing is on serverless Dataproc) on a Windows 10 laptop. The jobfails when reading metadata via the machine learning scripts.

The error is/org.apache.hadoop.shaded.com.google.rej2.PatternSyntaxException: errorparsing regexp: invalid escape sequence: '\m'/


I tracked it down to /site-packages/pyspark/ml/util.py/ line 578

metadataPath = os.path.join(path,"metadata")

which seems innocuous but what's happening is because I'm on Windows,os.path.join is appending double backslash, whilst the gcs path usesforward slashes like Linux.

I hacked the code to explicitly use forward slash if path contains gs:and the job now runs successfully.


Richard

Parser error when running PySpark on Windows connecting to GCS

Reply via email to