General The reason why os.path.join is appending double backslash on Windows is because that is how Windows paths are represented. However, GCS paths (a Hadoop Compatible File System (HCFS) use forward slashes like in Linux. This can cause problems if you are trying to use a Windows path in a Spark job, *because Spark assumes that all paths are Linux paths*.
A way to avoid this problem is to use the os.path.normpath function to normalize the path before passing it to Spark. This will ensure that the path is in a format that is compatible with Spark. *In Python* import os # example path = "gs://etcbucket/data-file" normalized_path = os.path.normpath(path) # Pass the normalized path to Spark *In Scala* import java.io.File val path = "gs://etcbucket/data-file" val normalizedPath = new File(path).getCanonicalPath() // Pass the normalized path to Spark HTH Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Sat, 4 Nov 2023 at 12:28, Richard Smith <richard530sm...@btinternet.com.invalid> wrote: > Hi All, > > I've just encountered and worked around a problem that is pretty obscure > and unlikely to affect many people, but I thought I'd better report it > anyway.... > > All the data I'm using is inside Google Cloud Storage buckets (path starts > with gs://) and I'm running Spark 3.5.0 locally (for testing, real thing is > on serverless Dataproc) on a Windows 10 laptop. The job fails when reading > metadata via the machine learning scripts. > > The error is *org.apache.hadoop.shaded.com.google.rej2.PatternSyntaxException: > error parsing regexp: invalid escape sequence: '\m'* > > I tracked it down to *site-packages/pyspark/ml/util.py* line 578 > > metadataPath = os.path.join(path,"metadata") > > which seems innocuous but what's happening is because I'm on Windows, > os.path.join is appending double backslash, whilst the gcs path uses > forward slashes like Linux. > > I hacked the code to explicitly use forward slash if path contains gs: and > the job now runs successfully. > > Richard >