Hi All,
I've just encountered and worked around a problem that is pretty obscure
and unlikely to affect many people, but I thought I'd better report it
anyway....
All the data I'm using is inside Google Cloud Storage buckets (path
starts with gs://) and I'm running Spark 3.5.0 locally (for testing,
real thing is on serverless Dataproc) on a Windows 10 laptop. The job
fails when reading metadata via the machine learning scripts.
The error is
/org.apache.hadoop.shaded.com.google.rej2.PatternSyntaxException: error
parsing regexp: invalid escape sequence: '\m'/
I tracked it down to /site-packages/pyspark/ml/util.py/ line 578
metadataPath = os.path.join(path,"metadata")
which seems innocuous but what's happening is because I'm on Windows,
os.path.join is appending double backslash, whilst the gcs path uses
forward slashes like Linux.
I hacked the code to explicitly use forward slash if path contains gs:
and the job now runs successfully.
Richard