Hi All,

I've just encountered and worked around a problem that is pretty obscure and unlikely to affect many people, but I thought I'd better report it anyway....

All the data I'm using is inside Google Cloud Storage buckets (path starts with gs://) and I'm running Spark 3.5.0 locally (for testing, real thing is on serverless Dataproc) on a Windows 10 laptop. The job fails when reading metadata via the machine learning scripts.

The error is /org.apache.hadoop.shaded.com.google.rej2.PatternSyntaxException: error parsing regexp: invalid escape sequence: '\m'/

I tracked it down to /site-packages/pyspark/ml/util.py/ line 578

metadataPath = os.path.join(path,"metadata")

which seems innocuous but what's happening is because I'm on Windows, os.path.join is appending double backslash, whilst the gcs path uses forward slashes like Linux.

I hacked the code to explicitly use forward slash if path contains gs: and the job now runs successfully.

Richard

Reply via email to