Re: [PySpark] Tagging descriptions

Rishi Shah Thu, 04 Jun 2020 13:14:55 -0700

Thanks everyone. While working on Tagging I stumbled upon another setback..
There are about 5000 regex I am dealing with, out of with couple of
hundreds have variable length lookbehind (originally these worked in a
JVM). In order to use this with Python/Pyspark udf - we need to either
modify these regex rules so that it can work with Python or move this to
scala/java based implementation..


Does anyone have any experience with variable length lookbehind
(quantifiers/alternations of variable length) in python/pyspark? Any
suggestions are much appreciated!

Thanks,
-Rishi

On Thu, May 14, 2020 at 2:57 PM Netanel Malka <netanel...@gmail.com> wrote:

> For elasticsearch you can use the elastic official connector.
> https://www.elastic.co/what-is/elasticsearch-hadoop
>
> Elastic spark connector docs:
> https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html
>
>
>
> On Thu, May 14, 2020, 21:14 Amol Umbarkar <amolumbar...@gmail.com> wrote:
>
>> Check out sparkNLP for tokenization. I am not sure about solar or elastic
>> search though
>>
>> On Thu, May 14, 2020 at 9:02 PM Rishi Shah <rishishah.s...@gmail.com>
>> wrote:
>>
>>> This is great, thanks you Zhang & Amol !!
>>>
>>> Yes we can have multiple tags per row and multiple regex applied to
>>> single row as well. Would you have any example of working with spark &
>>> search engines like Solar, ElasticSearch? Does Spark ML provide
>>> tokenization support as expected (I am yet to try SparkML, still a
>>> beginner)?
>>>
>>> Any other reference material you found useful while working on similar
>>> problem? appreciate all the help!
>>>
>>> Thanks,
>>> -Rishi
>>>
>>>
>>> On Thu, May 14, 2020 at 6:11 AM Amol Umbarkar <amolumbar...@gmail.com>
>>> wrote:
>>>
>>>> Rishi,
>>>> Just adding to zhang's questions.
>>>>
>>>> Are you expecting multiple tags per row?
>>>> Do you check multiple regex for a single tag?
>>>>
>>>> Let's say you had only one tag then theoretically you should be do this
>>>> -
>>>>
>>>> 1 Remove stop words or any irrelevant stuff
>>>> 2 split text into equal sized chunk column (eg - if max length is
>>>> 1000chars, split into 20 columns of 50 chars)
>>>> 3 distribute work for each column that would result in binary
>>>> (true/false) for a single tag
>>>> 4 merge the 20 resulting columns
>>>> 5 repeat for other tags or do them in parallel 3 and 4 for them
>>>>
>>>> Note on 3: If you expect single tag per row, then you can repeat 3
>>>> column by column and skip rows that have got tags in prior step.
>>>>
>>>> Secondly, if you expect similarity in text (of some kind) then you
>>>> could jus work on unique text values (might require shuffle, hence
>>>> expensive) and then join the end result back to the original data.  You
>>>> could use hash of some kind to join back. Though I would go for this
>>>> approach only if the chances of similarity in text are very high (it could
>>>> be in your case for being transactional data).
>>>>
>>>> Not the full answer to your question but hope this helps you brainstorm
>>>> more.
>>>>
>>>> Thanks,
>>>> Amol
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, May 13, 2020 at 10:17 AM Rishi Shah <rishishah.s...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks ZHANG! Please find details below:
>>>>>
>>>>> # of rows: ~25B, row size would be somewhere around ~3-5MB (it's a
>>>>> parquet formatted data so, need to worry about only the columns to be
>>>>> tagged)
>>>>>
>>>>> avg length of the text to be parsed : ~300
>>>>>
>>>>> Unfortunately don't have sample data or regex which I can share
>>>>> freely. However about data being parsed - assume these are purchases made
>>>>> online and we are trying to parse the transaction details. Like purchases
>>>>> made on amazon can be tagged to amazon as well as other vendors etc.
>>>>>
>>>>> Appreciate your response!
>>>>>
>>>>>
>>>>>
>>>>> On Tue, May 12, 2020 at 6:23 AM ZHANG Wei <wezh...@outlook.com> wrote:
>>>>>
>>>>>> May I get some requirement details?
>>>>>>
>>>>>> Such as:
>>>>>> 1. The row count and one row data size
>>>>>> 2. The avg length of text to be parsed by RegEx
>>>>>> 3. The sample format of text to be parsed
>>>>>> 4. The sample of current RegEx
>>>>>>
>>>>>> --
>>>>>> Cheers,
>>>>>> -z
>>>>>>
>>>>>> On Mon, 11 May 2020 18:40:49 -0400
>>>>>> Rishi Shah <rishishah.s...@gmail.com> wrote:
>>>>>>
>>>>>> > Hi All,
>>>>>> >
>>>>>> > I have a tagging problem at hand where we currently use regular
>>>>>> expressions
>>>>>> > to tag records. Is there a recommended way to distribute & tag?
>>>>>> Data is
>>>>>> > about 10TB large.
>>>>>> >
>>>>>> > --
>>>>>> > Regards,
>>>>>> >
>>>>>> > Rishi Shah
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>>
>>>>> Rishi Shah
>>>>>
>>>>
>>>
>>> --
>>> Regards,
>>>
>>> Rishi Shah
>>>
>>

-- 
Regards,

Rishi Shah

Re: [PySpark] Tagging descriptions

Reply via email to