As in Java/Scala, in Python you'll need to escape the backslashes with \\. "\[" means just "[" in a string. I think you could also prefix the string literal with 'r' to disable Python's handling of escapes.
On Wed, Dec 2, 2020 at 9:34 AM Sachit Murarka <connectsac...@gmail.com> wrote: > Hi All, > > I am using Pyspark to get the value from a column on basis of regex. > > Following is the regex which I am using: > > (^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*) > > df = spark.createDataFrame([("[1234] [3333] [4444] [66]",), > ("abcd",)],["stringValue"]) > > result = df.withColumn('extracted value', > F.regexp_extract(F.col('stringValue'), > '(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)', > 1)) > > I have tried with spark.sql as well. It is giving empty output. > > I have tested this regex , it is working fine on an online regextester . > But it is not working in spark . I know spark needs Java based regex , > hence I tried escaping also , that gave exception: > : java.util.regex.PatternSyntaxException: Unknown inline modifier near > index 21 > > (^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*) > > > Can you please help here? > > Kind Regards, > Sachit Murarka >