Hi All,
I am using Pyspark to get the value from a column on basis of regex.
Following is the regex which I am using:
(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)
df = spark.createDataFrame([("[1234] [3333] [4444] [66]",),
("abcd",)],["stringValue"])
result = df.withColumn('extracted value',
F.regexp_extract(F.col('stringValue'),
'(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)',
1))
I have tried with spark.sql as well. It is giving empty output.
I have tested this regex , it is working fine on an online regextester .
But it is not working in spark . I know spark needs Java based regex ,
hence I tried escaping also , that gave exception:
: java.util.regex.PatternSyntaxException: Unknown inline modifier near
index 21
(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)
Can you please help here?
Kind Regards,
Sachit Murarka