Check out sparkNLP for tokenization. I am not sure about solar or elastic
search though

On Thu, May 14, 2020 at 9:02 PM Rishi Shah <rishishah.s...@gmail.com> wrote:

> This is great, thanks you Zhang & Amol !!
>
> Yes we can have multiple tags per row and multiple regex applied to single
> row as well. Would you have any example of working with spark & search
> engines like Solar, ElasticSearch? Does Spark ML provide tokenization
> support as expected (I am yet to try SparkML, still a beginner)?
>
> Any other reference material you found useful while working on similar
> problem? appreciate all the help!
>
> Thanks,
> -Rishi
>
>
> On Thu, May 14, 2020 at 6:11 AM Amol Umbarkar <amolumbar...@gmail.com>
> wrote:
>
>> Rishi,
>> Just adding to zhang's questions.
>>
>> Are you expecting multiple tags per row?
>> Do you check multiple regex for a single tag?
>>
>> Let's say you had only one tag then theoretically you should be do this -
>>
>> 1 Remove stop words or any irrelevant stuff
>> 2 split text into equal sized chunk column (eg - if max length is
>> 1000chars, split into 20 columns of 50 chars)
>> 3 distribute work for each column that would result in binary
>> (true/false) for a single tag
>> 4 merge the 20 resulting columns
>> 5 repeat for other tags or do them in parallel 3 and 4 for them
>>
>> Note on 3: If you expect single tag per row, then you can repeat 3 column
>> by column and skip rows that have got tags in prior step.
>>
>> Secondly, if you expect similarity in text (of some kind) then you could
>> jus work on unique text values (might require shuffle, hence expensive) and
>> then join the end result back to the original data.  You could use hash of
>> some kind to join back. Though I would go for this approach only if the
>> chances of similarity in text are very high (it could be in your case for
>> being transactional data).
>>
>> Not the full answer to your question but hope this helps you brainstorm
>> more.
>>
>> Thanks,
>> Amol
>>
>>
>>
>>
>>
>> On Wed, May 13, 2020 at 10:17 AM Rishi Shah <rishishah.s...@gmail.com>
>> wrote:
>>
>>> Thanks ZHANG! Please find details below:
>>>
>>> # of rows: ~25B, row size would be somewhere around ~3-5MB (it's a
>>> parquet formatted data so, need to worry about only the columns to be
>>> tagged)
>>>
>>> avg length of the text to be parsed : ~300
>>>
>>> Unfortunately don't have sample data or regex which I can share freely.
>>> However about data being parsed - assume these are purchases made online
>>> and we are trying to parse the transaction details. Like purchases made on
>>> amazon can be tagged to amazon as well as other vendors etc.
>>>
>>> Appreciate your response!
>>>
>>>
>>>
>>> On Tue, May 12, 2020 at 6:23 AM ZHANG Wei <wezh...@outlook.com> wrote:
>>>
>>>> May I get some requirement details?
>>>>
>>>> Such as:
>>>> 1. The row count and one row data size
>>>> 2. The avg length of text to be parsed by RegEx
>>>> 3. The sample format of text to be parsed
>>>> 4. The sample of current RegEx
>>>>
>>>> --
>>>> Cheers,
>>>> -z
>>>>
>>>> On Mon, 11 May 2020 18:40:49 -0400
>>>> Rishi Shah <rishishah.s...@gmail.com> wrote:
>>>>
>>>> > Hi All,
>>>> >
>>>> > I have a tagging problem at hand where we currently use regular
>>>> expressions
>>>> > to tag records. Is there a recommended way to distribute & tag? Data
>>>> is
>>>> > about 10TB large.
>>>> >
>>>> > --
>>>> > Regards,
>>>> >
>>>> > Rishi Shah
>>>>
>>>
>>>
>>> --
>>> Regards,
>>>
>>> Rishi Shah
>>>
>>
>
> --
> Regards,
>
> Rishi Shah
>

Reply via email to