Thanks everyone. While working on Tagging I stumbled upon another setback.. There are about 5000 regex I am dealing with, out of with couple of hundreds have variable length lookbehind (originally these worked in a JVM). In order to use this with Python/Pyspark udf - we need to either modify these regex rules so that it can work with Python or move this to scala/java based implementation..
Does anyone have any experience with variable length lookbehind (quantifiers/alternations of variable length) in python/pyspark? Any suggestions are much appreciated! Thanks, -Rishi On Thu, May 14, 2020 at 2:57 PM Netanel Malka <netanel...@gmail.com> wrote: > For elasticsearch you can use the elastic official connector. > https://www.elastic.co/what-is/elasticsearch-hadoop > > Elastic spark connector docs: > https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html > > > > On Thu, May 14, 2020, 21:14 Amol Umbarkar <amolumbar...@gmail.com> wrote: > >> Check out sparkNLP for tokenization. I am not sure about solar or elastic >> search though >> >> On Thu, May 14, 2020 at 9:02 PM Rishi Shah <rishishah.s...@gmail.com> >> wrote: >> >>> This is great, thanks you Zhang & Amol !! >>> >>> Yes we can have multiple tags per row and multiple regex applied to >>> single row as well. Would you have any example of working with spark & >>> search engines like Solar, ElasticSearch? Does Spark ML provide >>> tokenization support as expected (I am yet to try SparkML, still a >>> beginner)? >>> >>> Any other reference material you found useful while working on similar >>> problem? appreciate all the help! >>> >>> Thanks, >>> -Rishi >>> >>> >>> On Thu, May 14, 2020 at 6:11 AM Amol Umbarkar <amolumbar...@gmail.com> >>> wrote: >>> >>>> Rishi, >>>> Just adding to zhang's questions. >>>> >>>> Are you expecting multiple tags per row? >>>> Do you check multiple regex for a single tag? >>>> >>>> Let's say you had only one tag then theoretically you should be do this >>>> - >>>> >>>> 1 Remove stop words or any irrelevant stuff >>>> 2 split text into equal sized chunk column (eg - if max length is >>>> 1000chars, split into 20 columns of 50 chars) >>>> 3 distribute work for each column that would result in binary >>>> (true/false) for a single tag >>>> 4 merge the 20 resulting columns >>>> 5 repeat for other tags or do them in parallel 3 and 4 for them >>>> >>>> Note on 3: If you expect single tag per row, then you can repeat 3 >>>> column by column and skip rows that have got tags in prior step. >>>> >>>> Secondly, if you expect similarity in text (of some kind) then you >>>> could jus work on unique text values (might require shuffle, hence >>>> expensive) and then join the end result back to the original data. You >>>> could use hash of some kind to join back. Though I would go for this >>>> approach only if the chances of similarity in text are very high (it could >>>> be in your case for being transactional data). >>>> >>>> Not the full answer to your question but hope this helps you brainstorm >>>> more. >>>> >>>> Thanks, >>>> Amol >>>> >>>> >>>> >>>> >>>> >>>> On Wed, May 13, 2020 at 10:17 AM Rishi Shah <rishishah.s...@gmail.com> >>>> wrote: >>>> >>>>> Thanks ZHANG! Please find details below: >>>>> >>>>> # of rows: ~25B, row size would be somewhere around ~3-5MB (it's a >>>>> parquet formatted data so, need to worry about only the columns to be >>>>> tagged) >>>>> >>>>> avg length of the text to be parsed : ~300 >>>>> >>>>> Unfortunately don't have sample data or regex which I can share >>>>> freely. However about data being parsed - assume these are purchases made >>>>> online and we are trying to parse the transaction details. Like purchases >>>>> made on amazon can be tagged to amazon as well as other vendors etc. >>>>> >>>>> Appreciate your response! >>>>> >>>>> >>>>> >>>>> On Tue, May 12, 2020 at 6:23 AM ZHANG Wei <wezh...@outlook.com> wrote: >>>>> >>>>>> May I get some requirement details? >>>>>> >>>>>> Such as: >>>>>> 1. The row count and one row data size >>>>>> 2. The avg length of text to be parsed by RegEx >>>>>> 3. The sample format of text to be parsed >>>>>> 4. The sample of current RegEx >>>>>> >>>>>> -- >>>>>> Cheers, >>>>>> -z >>>>>> >>>>>> On Mon, 11 May 2020 18:40:49 -0400 >>>>>> Rishi Shah <rishishah.s...@gmail.com> wrote: >>>>>> >>>>>> > Hi All, >>>>>> > >>>>>> > I have a tagging problem at hand where we currently use regular >>>>>> expressions >>>>>> > to tag records. Is there a recommended way to distribute & tag? >>>>>> Data is >>>>>> > about 10TB large. >>>>>> > >>>>>> > -- >>>>>> > Regards, >>>>>> > >>>>>> > Rishi Shah >>>>>> >>>>> >>>>> >>>>> -- >>>>> Regards, >>>>> >>>>> Rishi Shah >>>>> >>>> >>> >>> -- >>> Regards, >>> >>> Rishi Shah >>> >> -- Regards, Rishi Shah