Re: [PySpark] Tagging descriptions

Amol Umbarkar Thu, 14 May 2020 03:12:03 -0700

Rishi,
Just adding to zhang's questions.

Are you expecting multiple tags per row?
Do you check multiple regex for a single tag?


Let's say you had only one tag then theoretically you should be do this -

1 Remove stop words or any irrelevant stuff
2 split text into equal sized chunk column (eg - if max length is
1000chars, split into 20 columns of 50 chars)
3 distribute work for each column that would result in binary (true/false)
for a single tag
4 merge the 20 resulting columns
5 repeat for other tags or do them in parallel 3 and 4 for them

Note on 3: If you expect single tag per row, then you can repeat 3 column
by column and skip rows that have got tags in prior step.

Secondly, if you expect similarity in text (of some kind) then you could
jus work on unique text values (might require shuffle, hence expensive) and
then join the end result back to the original data.  You could use hash of
some kind to join back. Though I would go for this approach only if the
chances of similarity in text are very high (it could be in your case for
being transactional data).

Not the full answer to your question but hope this helps you brainstorm
more.

Thanks,
Amol





On Wed, May 13, 2020 at 10:17 AM Rishi Shah <rishishah.s...@gmail.com>
wrote:

> Thanks ZHANG! Please find details below:
>
> # of rows: ~25B, row size would be somewhere around ~3-5MB (it's a parquet
> formatted data so, need to worry about only the columns to be tagged)
>
> avg length of the text to be parsed : ~300
>
> Unfortunately don't have sample data or regex which I can share freely.
> However about data being parsed - assume these are purchases made online
> and we are trying to parse the transaction details. Like purchases made on
> amazon can be tagged to amazon as well as other vendors etc.
>
> Appreciate your response!
>
>
>
> On Tue, May 12, 2020 at 6:23 AM ZHANG Wei <wezh...@outlook.com> wrote:
>
>> May I get some requirement details?
>>
>> Such as:
>> 1. The row count and one row data size
>> 2. The avg length of text to be parsed by RegEx
>> 3. The sample format of text to be parsed
>> 4. The sample of current RegEx
>>
>> --
>> Cheers,
>> -z
>>
>> On Mon, 11 May 2020 18:40:49 -0400
>> Rishi Shah <rishishah.s...@gmail.com> wrote:
>>
>> > Hi All,
>> >
>> > I have a tagging problem at hand where we currently use regular
>> expressions
>> > to tag records. Is there a recommended way to distribute & tag? Data is
>> > about 10TB large.
>> >
>> > --
>> > Regards,
>> >
>> > Rishi Shah
>>
>
>
> --
> Regards,
>
> Rishi Shah
>

Re: [PySpark] Tagging descriptions

Reply via email to