Re: Regexp_extract not giving correct output

2020-12-02 Thread Sean Owen
This means there is something wrong with your regex vs what Java supports.
Do you mean "(?:" rather than "(?" around where the error is? This is not
related to Spark.

On Wed, Dec 2, 2020 at 9:45 AM Sachit Murarka 
wrote:

> Hi Sean,
>
> Thanks for quick response!
>
> I have tried with string literal 'r' as a prefix that also gave an empty
> result..
> spark.sql(r"select regexp_extract('[11] [22]
> [33]','(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)',1)
> as anyid").show()
>
> and as I mentioned when I am using 2 backslashes it is giving an exception
> as follows:
>   : java.util.regex.PatternSyntaxException: Unknown inline modifier near
> index 21
>
> (^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)
>
> Kind Regards,
> Sachit Murarka
>
>
> On Wed, Dec 2, 2020 at 9:07 PM Sean Owen  wrote:
>
>> As in Java/Scala, in Python you'll need to escape the backslashes with
>> \\. "\[" means just "[" in a string. I think you could also prefix the
>> string literal with 'r' to disable Python's handling of escapes.
>>
>> On Wed, Dec 2, 2020 at 9:34 AM Sachit Murarka 
>> wrote:
>>
>>> Hi All,
>>>
>>> I am using Pyspark to get the value from a column on basis of regex.
>>>
>>> Following is the regex which I am using:
>>>
>>> (^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)
>>>
>>> df = spark.createDataFrame([("[1234] [] [] [66]",),
>>> ("abcd",)],["stringValue"])
>>>
>>> result = df.withColumn('extracted value',
>>> F.regexp_extract(F.col('stringValue'),
>>> '(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)',
>>> 1))
>>>
>>> I have tried with spark.sql as well. It is giving empty output.
>>>
>>> I have tested this regex , it is working fine on an online regextester .
>>> But it is not working in spark . I know spark needs Java based regex ,
>>> hence I tried escaping also , that gave exception:
>>> : java.util.regex.PatternSyntaxException: Unknown inline modifier near
>>> index 21
>>>
>>> (^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)
>>>
>>>
>>> Can you please help here?
>>>
>>> Kind Regards,
>>> Sachit Murarka
>>>
>>


Re: Regexp_extract not giving correct output

2020-12-02 Thread Sachit Murarka
Hi Sean,

Thanks for quick response!

I have tried with string literal 'r' as a prefix that also gave an empty
result..
spark.sql(r"select regexp_extract('[11] [22]
[33]','(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)',1)
as anyid").show()

and as I mentioned when I am using 2 backslashes it is giving an exception
as follows:
  : java.util.regex.PatternSyntaxException: Unknown inline modifier near
index 21
(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)

Kind Regards,
Sachit Murarka


On Wed, Dec 2, 2020 at 9:07 PM Sean Owen  wrote:

> As in Java/Scala, in Python you'll need to escape the backslashes with \\.
> "\[" means just "[" in a string. I think you could also prefix the string
> literal with 'r' to disable Python's handling of escapes.
>
> On Wed, Dec 2, 2020 at 9:34 AM Sachit Murarka 
> wrote:
>
>> Hi All,
>>
>> I am using Pyspark to get the value from a column on basis of regex.
>>
>> Following is the regex which I am using:
>>
>> (^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)
>>
>> df = spark.createDataFrame([("[1234] [] [] [66]",),
>> ("abcd",)],["stringValue"])
>>
>> result = df.withColumn('extracted value',
>> F.regexp_extract(F.col('stringValue'),
>> '(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)',
>> 1))
>>
>> I have tried with spark.sql as well. It is giving empty output.
>>
>> I have tested this regex , it is working fine on an online regextester .
>> But it is not working in spark . I know spark needs Java based regex ,
>> hence I tried escaping also , that gave exception:
>> : java.util.regex.PatternSyntaxException: Unknown inline modifier near
>> index 21
>>
>> (^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)
>>
>>
>> Can you please help here?
>>
>> Kind Regards,
>> Sachit Murarka
>>
>


Re: Regexp_extract not giving correct output

2020-12-02 Thread Sean Owen
As in Java/Scala, in Python you'll need to escape the backslashes with \\.
"\[" means just "[" in a string. I think you could also prefix the string
literal with 'r' to disable Python's handling of escapes.

On Wed, Dec 2, 2020 at 9:34 AM Sachit Murarka 
wrote:

> Hi All,
>
> I am using Pyspark to get the value from a column on basis of regex.
>
> Following is the regex which I am using:
>
> (^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)
>
> df = spark.createDataFrame([("[1234] [] [] [66]",),
> ("abcd",)],["stringValue"])
>
> result = df.withColumn('extracted value',
> F.regexp_extract(F.col('stringValue'),
> '(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)',
> 1))
>
> I have tried with spark.sql as well. It is giving empty output.
>
> I have tested this regex , it is working fine on an online regextester .
> But it is not working in spark . I know spark needs Java based regex ,
> hence I tried escaping also , that gave exception:
> : java.util.regex.PatternSyntaxException: Unknown inline modifier near
> index 21
>
> (^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)
>
>
> Can you please help here?
>
> Kind Regards,
> Sachit Murarka
>