Re: [PySpark] Tagging descriptions

2020-06-04 Thread Rishi Shah
Thanks everyone. While working on Tagging I stumbled upon another setback.. There are about 5000 regex I am dealing with, out of with couple of hundreds have variable length lookbehind (originally these worked in a JVM). In order to use this with Python/Pyspark udf - we need to either modify these

Re: [PySpark] Tagging descriptions

2020-05-14 Thread Netanel Malka
For elasticsearch you can use the elastic official connector. https://www.elastic.co/what-is/elasticsearch-hadoop Elastic spark connector docs: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html On Thu, May 14, 2020, 21:14 Amol Umbarkar wrote: > Check out sparkNLP for

Re: [PySpark] Tagging descriptions

2020-05-14 Thread Amol Umbarkar
Check out sparkNLP for tokenization. I am not sure about solar or elastic search though On Thu, May 14, 2020 at 9:02 PM Rishi Shah wrote: > This is great, thanks you Zhang & Amol !! > > Yes we can have multiple tags per row and multiple regex applied to single > row as well. Would you have any

Re: [PySpark] Tagging descriptions

2020-05-14 Thread Rishi Shah
This is great, thanks you Zhang & Amol !! Yes we can have multiple tags per row and multiple regex applied to single row as well. Would you have any example of working with spark & search engines like Solar, ElasticSearch? Does Spark ML provide tokenization support as expected (I am yet to try

Re: [PySpark] Tagging descriptions

2020-05-14 Thread Amol Umbarkar
Rishi, Just adding to zhang's questions. Are you expecting multiple tags per row? Do you check multiple regex for a single tag? Let's say you had only one tag then theoretically you should be do this - 1 Remove stop words or any irrelevant stuff 2 split text into equal sized chunk column (eg -

Re: [PySpark] Tagging descriptions

2020-05-13 Thread ZHANG Wei
AFAICT, from the data size (25B rows, key cell 300 chars string), looks like a common Spark job. But the regex might be complex, I guess there are lots of items to match as (apple|banana|cola|...) from the purchase list. Regex matching is a high CPU computing task. If the current performance with

Re: [PySpark] Tagging descriptions

2020-05-12 Thread Rishi Shah
Thanks ZHANG! Please find details below: # of rows: ~25B, row size would be somewhere around ~3-5MB (it's a parquet formatted data so, need to worry about only the columns to be tagged) avg length of the text to be parsed : ~300 Unfortunately don't have sample data or regex which I can share

Re: [PySpark] Tagging descriptions

2020-05-12 Thread ZHANG Wei
May I get some requirement details? Such as: 1. The row count and one row data size 2. The avg length of text to be parsed by RegEx 3. The sample format of text to be parsed 4. The sample of current RegEx -- Cheers, -z On Mon, 11 May 2020 18:40:49 -0400 Rishi Shah wrote: > Hi All, > > I

[PySpark] Tagging descriptions

2020-05-11 Thread Rishi Shah
Hi All, I have a tagging problem at hand where we currently use regular expressions to tag records. Is there a recommended way to distribute & tag? Data is about 10TB large. -- Regards, Rishi Shah