Re: Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Sean Owen
No, it means # HTTP calls = # executor slots. But even then, you're
welcome to, say, use thread pools to execute even more concurrently as
most are I/O bound. Your code can do what you want.

On Thu, May 14, 2020 at 6:14 PM Chetan Khatri
 wrote:
>
> Thanks, that means number of executor = number of http calls, I can make. I 
> can't boost more number of http calls in single executors, I mean - I can't 
> go beyond the threashold of number of executors.
>
> On Thu, May 14, 2020 at 6:26 PM Sean Owen  wrote:
>>
>> Default is not 200, but the number of executor slots. Yes you can only 
>> simultaneously execute as many tasks as slots regardless of partitions.
>>
>> On Thu, May 14, 2020, 5:19 PM Chetan Khatri  
>> wrote:
>>>
>>> Thanks Sean, Jerry.
>>>
>>> Default Spark DataFrame partitions are 200 right? does it have relationship 
>>> with number of cores? 8 cores - 4 workers. is not it like I can do only 8 * 
>>> 4 = 32 http calls. Because in Spark number of partitions = number cores is 
>>> untrue.
>>>
>>> Thanks
>>>
>>> On Thu, May 14, 2020 at 6:11 PM Sean Owen  wrote:

 Yes any code that you write in code that you apply with Spark runs in
 the executors. You would be running as many HTTP clients as you have
 partitions.

 On Thu, May 14, 2020 at 4:31 PM Jerry Vinokurov  
 wrote:
 >
 > I believe that if you do this within the context of an operation that is 
 > already parallelized such as a map, the work will be distributed to 
 > executors and they will do it in parallel. I could be wrong about this 
 > as I never investigated this specific use case, though.
 >
 > On Thu, May 14, 2020 at 5:24 PM Chetan Khatri 
 >  wrote:
 >>
 >> Thanks for the quick response.
 >>
 >> I am curious to know whether would it be parallel pulling data for 100+ 
 >> HTTP request or it will only go on Driver node? the post body would be 
 >> part of DataFrame. Think as I have a data frame of employee_id, 
 >> employee_name now the http GET call has to be made for each employee_id 
 >> and DataFrame is dynamic for each spark job run.
 >>
 >> Does it make sense?
 >>
 >> Thanks
 >>
 >>
 >> On Thu, May 14, 2020 at 5:12 PM Jerry Vinokurov  
 >> wrote:
 >>>
 >>> Hi Chetan,
 >>>
 >>> You can pretty much use any client to do this. When I was using Spark 
 >>> at a previous job, we used OkHttp, but I'm sure there are plenty of 
 >>> others. In our case, we had a startup phase in which we gathered 
 >>> metadata via a REST API and then broadcast it to the workers. I think 
 >>> if you need all the workers to have access to whatever you're getting 
 >>> from the API, that's the way to do it.
 >>>
 >>> Jerry
 >>>
 >>> On Thu, May 14, 2020 at 5:03 PM Chetan Khatri 
 >>>  wrote:
 
  Hi Spark Users,
 
  How can I invoke the Rest API call from Spark Code which is not only 
  running on Spark Driver but distributed / parallel?
 
  Spark with Scala is my tech stack.
 
  Thanks
 
 
 >>>
 >>>
 >>> --
 >>> http://www.google.com/profiles/grapesmoker
 >
 >
 >
 > --
 > http://www.google.com/profiles/grapesmoker

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Chetan Khatri
Thanks, that means number of executor = number of http calls, I can make. I
can't boost more number of http calls in single executors, I mean - I can't
go beyond the threashold of number of executors.

On Thu, May 14, 2020 at 6:26 PM Sean Owen  wrote:

> Default is not 200, but the number of executor slots. Yes you can only
> simultaneously execute as many tasks as slots regardless of partitions.
>
> On Thu, May 14, 2020, 5:19 PM Chetan Khatri 
> wrote:
>
>> Thanks Sean, Jerry.
>>
>> Default Spark DataFrame partitions are 200 right? does it have
>> relationship with number of cores? 8 cores - 4 workers. is not it like I
>> can do only 8 * 4 = 32 http calls. Because in Spark number of partitions =
>> number cores is untrue.
>>
>> Thanks
>>
>> On Thu, May 14, 2020 at 6:11 PM Sean Owen  wrote:
>>
>>> Yes any code that you write in code that you apply with Spark runs in
>>> the executors. You would be running as many HTTP clients as you have
>>> partitions.
>>>
>>> On Thu, May 14, 2020 at 4:31 PM Jerry Vinokurov 
>>> wrote:
>>> >
>>> > I believe that if you do this within the context of an operation that
>>> is already parallelized such as a map, the work will be distributed to
>>> executors and they will do it in parallel. I could be wrong about this as I
>>> never investigated this specific use case, though.
>>> >
>>> > On Thu, May 14, 2020 at 5:24 PM Chetan Khatri <
>>> chetan.opensou...@gmail.com> wrote:
>>> >>
>>> >> Thanks for the quick response.
>>> >>
>>> >> I am curious to know whether would it be parallel pulling data for
>>> 100+ HTTP request or it will only go on Driver node? the post body would be
>>> part of DataFrame. Think as I have a data frame of employee_id,
>>> employee_name now the http GET call has to be made for each employee_id and
>>> DataFrame is dynamic for each spark job run.
>>> >>
>>> >> Does it make sense?
>>> >>
>>> >> Thanks
>>> >>
>>> >>
>>> >> On Thu, May 14, 2020 at 5:12 PM Jerry Vinokurov <
>>> grapesmo...@gmail.com> wrote:
>>> >>>
>>> >>> Hi Chetan,
>>> >>>
>>> >>> You can pretty much use any client to do this. When I was using
>>> Spark at a previous job, we used OkHttp, but I'm sure there are plenty of
>>> others. In our case, we had a startup phase in which we gathered metadata
>>> via a REST API and then broadcast it to the workers. I think if you need
>>> all the workers to have access to whatever you're getting from the API,
>>> that's the way to do it.
>>> >>>
>>> >>> Jerry
>>> >>>
>>> >>> On Thu, May 14, 2020 at 5:03 PM Chetan Khatri <
>>> chetan.opensou...@gmail.com> wrote:
>>> 
>>>  Hi Spark Users,
>>> 
>>>  How can I invoke the Rest API call from Spark Code which is not
>>> only running on Spark Driver but distributed / parallel?
>>> 
>>>  Spark with Scala is my tech stack.
>>> 
>>>  Thanks
>>> 
>>> 
>>> >>>
>>> >>>
>>> >>> --
>>> >>> http://www.google.com/profiles/grapesmoker
>>> >
>>> >
>>> >
>>> > --
>>> > http://www.google.com/profiles/grapesmoker
>>>
>>


Re: Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Sean Owen
Default is not 200, but the number of executor slots. Yes you can only
simultaneously execute as many tasks as slots regardless of partitions.

On Thu, May 14, 2020, 5:19 PM Chetan Khatri 
wrote:

> Thanks Sean, Jerry.
>
> Default Spark DataFrame partitions are 200 right? does it have
> relationship with number of cores? 8 cores - 4 workers. is not it like I
> can do only 8 * 4 = 32 http calls. Because in Spark number of partitions =
> number cores is untrue.
>
> Thanks
>
> On Thu, May 14, 2020 at 6:11 PM Sean Owen  wrote:
>
>> Yes any code that you write in code that you apply with Spark runs in
>> the executors. You would be running as many HTTP clients as you have
>> partitions.
>>
>> On Thu, May 14, 2020 at 4:31 PM Jerry Vinokurov 
>> wrote:
>> >
>> > I believe that if you do this within the context of an operation that
>> is already parallelized such as a map, the work will be distributed to
>> executors and they will do it in parallel. I could be wrong about this as I
>> never investigated this specific use case, though.
>> >
>> > On Thu, May 14, 2020 at 5:24 PM Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>> >>
>> >> Thanks for the quick response.
>> >>
>> >> I am curious to know whether would it be parallel pulling data for
>> 100+ HTTP request or it will only go on Driver node? the post body would be
>> part of DataFrame. Think as I have a data frame of employee_id,
>> employee_name now the http GET call has to be made for each employee_id and
>> DataFrame is dynamic for each spark job run.
>> >>
>> >> Does it make sense?
>> >>
>> >> Thanks
>> >>
>> >>
>> >> On Thu, May 14, 2020 at 5:12 PM Jerry Vinokurov 
>> wrote:
>> >>>
>> >>> Hi Chetan,
>> >>>
>> >>> You can pretty much use any client to do this. When I was using Spark
>> at a previous job, we used OkHttp, but I'm sure there are plenty of others.
>> In our case, we had a startup phase in which we gathered metadata via a
>> REST API and then broadcast it to the workers. I think if you need all the
>> workers to have access to whatever you're getting from the API, that's the
>> way to do it.
>> >>>
>> >>> Jerry
>> >>>
>> >>> On Thu, May 14, 2020 at 5:03 PM Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>> 
>>  Hi Spark Users,
>> 
>>  How can I invoke the Rest API call from Spark Code which is not only
>> running on Spark Driver but distributed / parallel?
>> 
>>  Spark with Scala is my tech stack.
>> 
>>  Thanks
>> 
>> 
>> >>>
>> >>>
>> >>> --
>> >>> http://www.google.com/profiles/grapesmoker
>> >
>> >
>> >
>> > --
>> > http://www.google.com/profiles/grapesmoker
>>
>


Re: Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Chetan Khatri
Thanks Sean, Jerry.

Default Spark DataFrame partitions are 200 right? does it have relationship
with number of cores? 8 cores - 4 workers. is not it like I can do only 8 *
4 = 32 http calls. Because in Spark number of partitions = number cores is
untrue.

Thanks

On Thu, May 14, 2020 at 6:11 PM Sean Owen  wrote:

> Yes any code that you write in code that you apply with Spark runs in
> the executors. You would be running as many HTTP clients as you have
> partitions.
>
> On Thu, May 14, 2020 at 4:31 PM Jerry Vinokurov 
> wrote:
> >
> > I believe that if you do this within the context of an operation that is
> already parallelized such as a map, the work will be distributed to
> executors and they will do it in parallel. I could be wrong about this as I
> never investigated this specific use case, though.
> >
> > On Thu, May 14, 2020 at 5:24 PM Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
> >>
> >> Thanks for the quick response.
> >>
> >> I am curious to know whether would it be parallel pulling data for 100+
> HTTP request or it will only go on Driver node? the post body would be part
> of DataFrame. Think as I have a data frame of employee_id, employee_name
> now the http GET call has to be made for each employee_id and DataFrame is
> dynamic for each spark job run.
> >>
> >> Does it make sense?
> >>
> >> Thanks
> >>
> >>
> >> On Thu, May 14, 2020 at 5:12 PM Jerry Vinokurov 
> wrote:
> >>>
> >>> Hi Chetan,
> >>>
> >>> You can pretty much use any client to do this. When I was using Spark
> at a previous job, we used OkHttp, but I'm sure there are plenty of others.
> In our case, we had a startup phase in which we gathered metadata via a
> REST API and then broadcast it to the workers. I think if you need all the
> workers to have access to whatever you're getting from the API, that's the
> way to do it.
> >>>
> >>> Jerry
> >>>
> >>> On Thu, May 14, 2020 at 5:03 PM Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
> 
>  Hi Spark Users,
> 
>  How can I invoke the Rest API call from Spark Code which is not only
> running on Spark Driver but distributed / parallel?
> 
>  Spark with Scala is my tech stack.
> 
>  Thanks
> 
> 
> >>>
> >>>
> >>> --
> >>> http://www.google.com/profiles/grapesmoker
> >
> >
> >
> > --
> > http://www.google.com/profiles/grapesmoker
>


Re: Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Sean Owen
Yes any code that you write in code that you apply with Spark runs in
the executors. You would be running as many HTTP clients as you have
partitions.

On Thu, May 14, 2020 at 4:31 PM Jerry Vinokurov  wrote:
>
> I believe that if you do this within the context of an operation that is 
> already parallelized such as a map, the work will be distributed to executors 
> and they will do it in parallel. I could be wrong about this as I never 
> investigated this specific use case, though.
>
> On Thu, May 14, 2020 at 5:24 PM Chetan Khatri  
> wrote:
>>
>> Thanks for the quick response.
>>
>> I am curious to know whether would it be parallel pulling data for 100+ HTTP 
>> request or it will only go on Driver node? the post body would be part of 
>> DataFrame. Think as I have a data frame of employee_id, employee_name now 
>> the http GET call has to be made for each employee_id and DataFrame is 
>> dynamic for each spark job run.
>>
>> Does it make sense?
>>
>> Thanks
>>
>>
>> On Thu, May 14, 2020 at 5:12 PM Jerry Vinokurov  
>> wrote:
>>>
>>> Hi Chetan,
>>>
>>> You can pretty much use any client to do this. When I was using Spark at a 
>>> previous job, we used OkHttp, but I'm sure there are plenty of others. In 
>>> our case, we had a startup phase in which we gathered metadata via a REST 
>>> API and then broadcast it to the workers. I think if you need all the 
>>> workers to have access to whatever you're getting from the API, that's the 
>>> way to do it.
>>>
>>> Jerry
>>>
>>> On Thu, May 14, 2020 at 5:03 PM Chetan Khatri  
>>> wrote:

 Hi Spark Users,

 How can I invoke the Rest API call from Spark Code which is not only 
 running on Spark Driver but distributed / parallel?

 Spark with Scala is my tech stack.

 Thanks


>>>
>>>
>>> --
>>> http://www.google.com/profiles/grapesmoker
>
>
>
> --
> http://www.google.com/profiles/grapesmoker

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Using Spark Accumulators with Structured Streaming

2020-05-14 Thread Something Something
In my structured streaming job I am updating Spark Accumulators in the
updateAcrossEvents method but they are always 0 when I try to print them in
my StreamingListener. Here's the code:

.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(
updateAcrossEvents
  )

The accumulators get incremented in 'updateAcrossEvents'. I've a
StreamingListener which writes values of the accumulators in
'onQueryProgress' method but in this method the Accumulators are ALWAYS
ZERO!

When I added log statements in the updateAcrossEvents, I could see that
these accumulators are getting incremented as expected.

This only happens when I run in the 'Cluster' mode. In Local mode it works
fine which implies that the Accumulators are not getting distributed
correctly - or something like that!

Note: I've seen quite a few answers on the Web that tell me to perform an
"Action". That's not a solution here. This is a 'Stateful Structured
Streaming' job. Yes, I am also 'registering' them in SparkContext.


Re: Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Jerry Vinokurov
I believe that if you do this within the context of an operation that is
already parallelized such as a map, the work will be distributed to
executors and they will do it in parallel. I could be wrong about this as I
never investigated this specific use case, though.

On Thu, May 14, 2020 at 5:24 PM Chetan Khatri 
wrote:

> Thanks for the quick response.
>
> I am curious to know whether would it be parallel pulling data for 100+
> HTTP request or it will only go on Driver node? the post body would be part
> of DataFrame. Think as I have a data frame of employee_id, employee_name
> now the http GET call has to be made for each employee_id and DataFrame is
> dynamic for each spark job run.
>
> Does it make sense?
>
> Thanks
>
>
> On Thu, May 14, 2020 at 5:12 PM Jerry Vinokurov 
> wrote:
>
>> Hi Chetan,
>>
>> You can pretty much use any client to do this. When I was using Spark at
>> a previous job, we used OkHttp, but I'm sure there are plenty of others. In
>> our case, we had a startup phase in which we gathered metadata via a REST
>> API and then broadcast it to the workers. I think if you need all the
>> workers to have access to whatever you're getting from the API, that's the
>> way to do it.
>>
>> Jerry
>>
>> On Thu, May 14, 2020 at 5:03 PM Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Hi Spark Users,
>>>
>>> How can I invoke the Rest API call from Spark Code which is not only
>>> running on Spark Driver but distributed / parallel?
>>>
>>> Spark with Scala is my tech stack.
>>>
>>> Thanks
>>>
>>>
>>>
>>
>> --
>> http://www.google.com/profiles/grapesmoker
>>
>

-- 
http://www.google.com/profiles/grapesmoker


Re: Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Chetan Khatri
Thanks for the quick response.

I am curious to know whether would it be parallel pulling data for 100+
HTTP request or it will only go on Driver node? the post body would be part
of DataFrame. Think as I have a data frame of employee_id, employee_name
now the http GET call has to be made for each employee_id and DataFrame is
dynamic for each spark job run.

Does it make sense?

Thanks


On Thu, May 14, 2020 at 5:12 PM Jerry Vinokurov 
wrote:

> Hi Chetan,
>
> You can pretty much use any client to do this. When I was using Spark at a
> previous job, we used OkHttp, but I'm sure there are plenty of others. In
> our case, we had a startup phase in which we gathered metadata via a REST
> API and then broadcast it to the workers. I think if you need all the
> workers to have access to whatever you're getting from the API, that's the
> way to do it.
>
> Jerry
>
> On Thu, May 14, 2020 at 5:03 PM Chetan Khatri 
> wrote:
>
>> Hi Spark Users,
>>
>> How can I invoke the Rest API call from Spark Code which is not only
>> running on Spark Driver but distributed / parallel?
>>
>> Spark with Scala is my tech stack.
>>
>> Thanks
>>
>>
>>
>
> --
> http://www.google.com/profiles/grapesmoker
>


Re: Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Jerry Vinokurov
Hi Chetan,

You can pretty much use any client to do this. When I was using Spark at a
previous job, we used OkHttp, but I'm sure there are plenty of others. In
our case, we had a startup phase in which we gathered metadata via a REST
API and then broadcast it to the workers. I think if you need all the
workers to have access to whatever you're getting from the API, that's the
way to do it.

Jerry

On Thu, May 14, 2020 at 5:03 PM Chetan Khatri 
wrote:

> Hi Spark Users,
>
> How can I invoke the Rest API call from Spark Code which is not only
> running on Spark Driver but distributed / parallel?
>
> Spark with Scala is my tech stack.
>
> Thanks
>
>
>

-- 
http://www.google.com/profiles/grapesmoker


Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Chetan Khatri
Hi Spark Users,

How can I invoke the Rest API call from Spark Code which is not only
running on Spark Driver but distributed / parallel?

Spark with Scala is my tech stack.

Thanks


Re: [PySpark] Tagging descriptions

2020-05-14 Thread Netanel Malka
For elasticsearch you can use the elastic official connector.
https://www.elastic.co/what-is/elasticsearch-hadoop

Elastic spark connector docs:
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html



On Thu, May 14, 2020, 21:14 Amol Umbarkar  wrote:

> Check out sparkNLP for tokenization. I am not sure about solar or elastic
> search though
>
> On Thu, May 14, 2020 at 9:02 PM Rishi Shah 
> wrote:
>
>> This is great, thanks you Zhang & Amol !!
>>
>> Yes we can have multiple tags per row and multiple regex applied to
>> single row as well. Would you have any example of working with spark &
>> search engines like Solar, ElasticSearch? Does Spark ML provide
>> tokenization support as expected (I am yet to try SparkML, still a
>> beginner)?
>>
>> Any other reference material you found useful while working on similar
>> problem? appreciate all the help!
>>
>> Thanks,
>> -Rishi
>>
>>
>> On Thu, May 14, 2020 at 6:11 AM Amol Umbarkar 
>> wrote:
>>
>>> Rishi,
>>> Just adding to zhang's questions.
>>>
>>> Are you expecting multiple tags per row?
>>> Do you check multiple regex for a single tag?
>>>
>>> Let's say you had only one tag then theoretically you should be do this -
>>>
>>> 1 Remove stop words or any irrelevant stuff
>>> 2 split text into equal sized chunk column (eg - if max length is
>>> 1000chars, split into 20 columns of 50 chars)
>>> 3 distribute work for each column that would result in binary
>>> (true/false) for a single tag
>>> 4 merge the 20 resulting columns
>>> 5 repeat for other tags or do them in parallel 3 and 4 for them
>>>
>>> Note on 3: If you expect single tag per row, then you can repeat 3
>>> column by column and skip rows that have got tags in prior step.
>>>
>>> Secondly, if you expect similarity in text (of some kind) then you could
>>> jus work on unique text values (might require shuffle, hence expensive) and
>>> then join the end result back to the original data.  You could use hash of
>>> some kind to join back. Though I would go for this approach only if the
>>> chances of similarity in text are very high (it could be in your case for
>>> being transactional data).
>>>
>>> Not the full answer to your question but hope this helps you brainstorm
>>> more.
>>>
>>> Thanks,
>>> Amol
>>>
>>>
>>>
>>>
>>>
>>> On Wed, May 13, 2020 at 10:17 AM Rishi Shah 
>>> wrote:
>>>
 Thanks ZHANG! Please find details below:

 # of rows: ~25B, row size would be somewhere around ~3-5MB (it's a
 parquet formatted data so, need to worry about only the columns to be
 tagged)

 avg length of the text to be parsed : ~300

 Unfortunately don't have sample data or regex which I can share freely.
 However about data being parsed - assume these are purchases made online
 and we are trying to parse the transaction details. Like purchases made on
 amazon can be tagged to amazon as well as other vendors etc.

 Appreciate your response!



 On Tue, May 12, 2020 at 6:23 AM ZHANG Wei  wrote:

> May I get some requirement details?
>
> Such as:
> 1. The row count and one row data size
> 2. The avg length of text to be parsed by RegEx
> 3. The sample format of text to be parsed
> 4. The sample of current RegEx
>
> --
> Cheers,
> -z
>
> On Mon, 11 May 2020 18:40:49 -0400
> Rishi Shah  wrote:
>
> > Hi All,
> >
> > I have a tagging problem at hand where we currently use regular
> expressions
> > to tag records. Is there a recommended way to distribute & tag? Data
> is
> > about 10TB large.
> >
> > --
> > Regards,
> >
> > Rishi Shah
>


 --
 Regards,

 Rishi Shah

>>>
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>


Re: [PySpark] Tagging descriptions

2020-05-14 Thread Amol Umbarkar
Check out sparkNLP for tokenization. I am not sure about solar or elastic
search though

On Thu, May 14, 2020 at 9:02 PM Rishi Shah  wrote:

> This is great, thanks you Zhang & Amol !!
>
> Yes we can have multiple tags per row and multiple regex applied to single
> row as well. Would you have any example of working with spark & search
> engines like Solar, ElasticSearch? Does Spark ML provide tokenization
> support as expected (I am yet to try SparkML, still a beginner)?
>
> Any other reference material you found useful while working on similar
> problem? appreciate all the help!
>
> Thanks,
> -Rishi
>
>
> On Thu, May 14, 2020 at 6:11 AM Amol Umbarkar 
> wrote:
>
>> Rishi,
>> Just adding to zhang's questions.
>>
>> Are you expecting multiple tags per row?
>> Do you check multiple regex for a single tag?
>>
>> Let's say you had only one tag then theoretically you should be do this -
>>
>> 1 Remove stop words or any irrelevant stuff
>> 2 split text into equal sized chunk column (eg - if max length is
>> 1000chars, split into 20 columns of 50 chars)
>> 3 distribute work for each column that would result in binary
>> (true/false) for a single tag
>> 4 merge the 20 resulting columns
>> 5 repeat for other tags or do them in parallel 3 and 4 for them
>>
>> Note on 3: If you expect single tag per row, then you can repeat 3 column
>> by column and skip rows that have got tags in prior step.
>>
>> Secondly, if you expect similarity in text (of some kind) then you could
>> jus work on unique text values (might require shuffle, hence expensive) and
>> then join the end result back to the original data.  You could use hash of
>> some kind to join back. Though I would go for this approach only if the
>> chances of similarity in text are very high (it could be in your case for
>> being transactional data).
>>
>> Not the full answer to your question but hope this helps you brainstorm
>> more.
>>
>> Thanks,
>> Amol
>>
>>
>>
>>
>>
>> On Wed, May 13, 2020 at 10:17 AM Rishi Shah 
>> wrote:
>>
>>> Thanks ZHANG! Please find details below:
>>>
>>> # of rows: ~25B, row size would be somewhere around ~3-5MB (it's a
>>> parquet formatted data so, need to worry about only the columns to be
>>> tagged)
>>>
>>> avg length of the text to be parsed : ~300
>>>
>>> Unfortunately don't have sample data or regex which I can share freely.
>>> However about data being parsed - assume these are purchases made online
>>> and we are trying to parse the transaction details. Like purchases made on
>>> amazon can be tagged to amazon as well as other vendors etc.
>>>
>>> Appreciate your response!
>>>
>>>
>>>
>>> On Tue, May 12, 2020 at 6:23 AM ZHANG Wei  wrote:
>>>
 May I get some requirement details?

 Such as:
 1. The row count and one row data size
 2. The avg length of text to be parsed by RegEx
 3. The sample format of text to be parsed
 4. The sample of current RegEx

 --
 Cheers,
 -z

 On Mon, 11 May 2020 18:40:49 -0400
 Rishi Shah  wrote:

 > Hi All,
 >
 > I have a tagging problem at hand where we currently use regular
 expressions
 > to tag records. Is there a recommended way to distribute & tag? Data
 is
 > about 10TB large.
 >
 > --
 > Regards,
 >
 > Rishi Shah

>>>
>>>
>>> --
>>> Regards,
>>>
>>> Rishi Shah
>>>
>>
>
> --
> Regards,
>
> Rishi Shah
>


Re: [PySpark] Tagging descriptions

2020-05-14 Thread Rishi Shah
This is great, thanks you Zhang & Amol !!

Yes we can have multiple tags per row and multiple regex applied to single
row as well. Would you have any example of working with spark & search
engines like Solar, ElasticSearch? Does Spark ML provide tokenization
support as expected (I am yet to try SparkML, still a beginner)?

Any other reference material you found useful while working on similar
problem? appreciate all the help!

Thanks,
-Rishi


On Thu, May 14, 2020 at 6:11 AM Amol Umbarkar 
wrote:

> Rishi,
> Just adding to zhang's questions.
>
> Are you expecting multiple tags per row?
> Do you check multiple regex for a single tag?
>
> Let's say you had only one tag then theoretically you should be do this -
>
> 1 Remove stop words or any irrelevant stuff
> 2 split text into equal sized chunk column (eg - if max length is
> 1000chars, split into 20 columns of 50 chars)
> 3 distribute work for each column that would result in binary (true/false)
> for a single tag
> 4 merge the 20 resulting columns
> 5 repeat for other tags or do them in parallel 3 and 4 for them
>
> Note on 3: If you expect single tag per row, then you can repeat 3 column
> by column and skip rows that have got tags in prior step.
>
> Secondly, if you expect similarity in text (of some kind) then you could
> jus work on unique text values (might require shuffle, hence expensive) and
> then join the end result back to the original data.  You could use hash of
> some kind to join back. Though I would go for this approach only if the
> chances of similarity in text are very high (it could be in your case for
> being transactional data).
>
> Not the full answer to your question but hope this helps you brainstorm
> more.
>
> Thanks,
> Amol
>
>
>
>
>
> On Wed, May 13, 2020 at 10:17 AM Rishi Shah 
> wrote:
>
>> Thanks ZHANG! Please find details below:
>>
>> # of rows: ~25B, row size would be somewhere around ~3-5MB (it's a
>> parquet formatted data so, need to worry about only the columns to be
>> tagged)
>>
>> avg length of the text to be parsed : ~300
>>
>> Unfortunately don't have sample data or regex which I can share freely.
>> However about data being parsed - assume these are purchases made online
>> and we are trying to parse the transaction details. Like purchases made on
>> amazon can be tagged to amazon as well as other vendors etc.
>>
>> Appreciate your response!
>>
>>
>>
>> On Tue, May 12, 2020 at 6:23 AM ZHANG Wei  wrote:
>>
>>> May I get some requirement details?
>>>
>>> Such as:
>>> 1. The row count and one row data size
>>> 2. The avg length of text to be parsed by RegEx
>>> 3. The sample format of text to be parsed
>>> 4. The sample of current RegEx
>>>
>>> --
>>> Cheers,
>>> -z
>>>
>>> On Mon, 11 May 2020 18:40:49 -0400
>>> Rishi Shah  wrote:
>>>
>>> > Hi All,
>>> >
>>> > I have a tagging problem at hand where we currently use regular
>>> expressions
>>> > to tag records. Is there a recommended way to distribute & tag? Data is
>>> > about 10TB large.
>>> >
>>> > --
>>> > Regards,
>>> >
>>> > Rishi Shah
>>>
>>
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>

-- 
Regards,

Rishi Shah


Re: [PySpark] Tagging descriptions

2020-05-14 Thread Amol Umbarkar
Rishi,
Just adding to zhang's questions.

Are you expecting multiple tags per row?
Do you check multiple regex for a single tag?

Let's say you had only one tag then theoretically you should be do this -

1 Remove stop words or any irrelevant stuff
2 split text into equal sized chunk column (eg - if max length is
1000chars, split into 20 columns of 50 chars)
3 distribute work for each column that would result in binary (true/false)
for a single tag
4 merge the 20 resulting columns
5 repeat for other tags or do them in parallel 3 and 4 for them

Note on 3: If you expect single tag per row, then you can repeat 3 column
by column and skip rows that have got tags in prior step.

Secondly, if you expect similarity in text (of some kind) then you could
jus work on unique text values (might require shuffle, hence expensive) and
then join the end result back to the original data.  You could use hash of
some kind to join back. Though I would go for this approach only if the
chances of similarity in text are very high (it could be in your case for
being transactional data).

Not the full answer to your question but hope this helps you brainstorm
more.

Thanks,
Amol





On Wed, May 13, 2020 at 10:17 AM Rishi Shah 
wrote:

> Thanks ZHANG! Please find details below:
>
> # of rows: ~25B, row size would be somewhere around ~3-5MB (it's a parquet
> formatted data so, need to worry about only the columns to be tagged)
>
> avg length of the text to be parsed : ~300
>
> Unfortunately don't have sample data or regex which I can share freely.
> However about data being parsed - assume these are purchases made online
> and we are trying to parse the transaction details. Like purchases made on
> amazon can be tagged to amazon as well as other vendors etc.
>
> Appreciate your response!
>
>
>
> On Tue, May 12, 2020 at 6:23 AM ZHANG Wei  wrote:
>
>> May I get some requirement details?
>>
>> Such as:
>> 1. The row count and one row data size
>> 2. The avg length of text to be parsed by RegEx
>> 3. The sample format of text to be parsed
>> 4. The sample of current RegEx
>>
>> --
>> Cheers,
>> -z
>>
>> On Mon, 11 May 2020 18:40:49 -0400
>> Rishi Shah  wrote:
>>
>> > Hi All,
>> >
>> > I have a tagging problem at hand where we currently use regular
>> expressions
>> > to tag records. Is there a recommended way to distribute & tag? Data is
>> > about 10TB large.
>> >
>> > --
>> > Regards,
>> >
>> > Rishi Shah
>>
>
>
> --
> Regards,
>
> Rishi Shah
>