Check out sparkNLP for tokenization. I am not sure about solar or elastic search though
On Thu, May 14, 2020 at 9:02 PM Rishi Shah <rishishah.s...@gmail.com> wrote: > This is great, thanks you Zhang & Amol !! > > Yes we can have multiple tags per row and multiple regex applied to single > row as well. Would you have any example of working with spark & search > engines like Solar, ElasticSearch? Does Spark ML provide tokenization > support as expected (I am yet to try SparkML, still a beginner)? > > Any other reference material you found useful while working on similar > problem? appreciate all the help! > > Thanks, > -Rishi > > > On Thu, May 14, 2020 at 6:11 AM Amol Umbarkar <amolumbar...@gmail.com> > wrote: > >> Rishi, >> Just adding to zhang's questions. >> >> Are you expecting multiple tags per row? >> Do you check multiple regex for a single tag? >> >> Let's say you had only one tag then theoretically you should be do this - >> >> 1 Remove stop words or any irrelevant stuff >> 2 split text into equal sized chunk column (eg - if max length is >> 1000chars, split into 20 columns of 50 chars) >> 3 distribute work for each column that would result in binary >> (true/false) for a single tag >> 4 merge the 20 resulting columns >> 5 repeat for other tags or do them in parallel 3 and 4 for them >> >> Note on 3: If you expect single tag per row, then you can repeat 3 column >> by column and skip rows that have got tags in prior step. >> >> Secondly, if you expect similarity in text (of some kind) then you could >> jus work on unique text values (might require shuffle, hence expensive) and >> then join the end result back to the original data. You could use hash of >> some kind to join back. Though I would go for this approach only if the >> chances of similarity in text are very high (it could be in your case for >> being transactional data). >> >> Not the full answer to your question but hope this helps you brainstorm >> more. >> >> Thanks, >> Amol >> >> >> >> >> >> On Wed, May 13, 2020 at 10:17 AM Rishi Shah <rishishah.s...@gmail.com> >> wrote: >> >>> Thanks ZHANG! Please find details below: >>> >>> # of rows: ~25B, row size would be somewhere around ~3-5MB (it's a >>> parquet formatted data so, need to worry about only the columns to be >>> tagged) >>> >>> avg length of the text to be parsed : ~300 >>> >>> Unfortunately don't have sample data or regex which I can share freely. >>> However about data being parsed - assume these are purchases made online >>> and we are trying to parse the transaction details. Like purchases made on >>> amazon can be tagged to amazon as well as other vendors etc. >>> >>> Appreciate your response! >>> >>> >>> >>> On Tue, May 12, 2020 at 6:23 AM ZHANG Wei <wezh...@outlook.com> wrote: >>> >>>> May I get some requirement details? >>>> >>>> Such as: >>>> 1. The row count and one row data size >>>> 2. The avg length of text to be parsed by RegEx >>>> 3. The sample format of text to be parsed >>>> 4. The sample of current RegEx >>>> >>>> -- >>>> Cheers, >>>> -z >>>> >>>> On Mon, 11 May 2020 18:40:49 -0400 >>>> Rishi Shah <rishishah.s...@gmail.com> wrote: >>>> >>>> > Hi All, >>>> > >>>> > I have a tagging problem at hand where we currently use regular >>>> expressions >>>> > to tag records. Is there a recommended way to distribute & tag? Data >>>> is >>>> > about 10TB large. >>>> > >>>> > -- >>>> > Regards, >>>> > >>>> > Rishi Shah >>>> >>> >>> >>> -- >>> Regards, >>> >>> Rishi Shah >>> >> > > -- > Regards, > > Rishi Shah >