Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Chetan Khatri
Hi Spark Users, How can I invoke the Rest API call from Spark Code which is not only running on Spark Driver but distributed / parallel? Spark with Scala is my tech stack. Thanks

Re: Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Chetan Khatri
Thanks for the quick response. I am curious to know whether would it be parallel pulling data for 100+ HTTP request or it will only go on Driver node? the post body would be part of DataFrame. Think as I have a data frame of employee_id, employee_name now the http GET call has to be made for each

Using Spark Accumulators with Structured Streaming

2020-05-14 Thread Something Something
In my structured streaming job I am updating Spark Accumulators in the updateAcrossEvents method but they are always 0 when I try to print them in my StreamingListener. Here's the code: .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())( updateAcrossEvents ) The

Re: [PySpark] Tagging descriptions

2020-05-14 Thread Netanel Malka
For elasticsearch you can use the elastic official connector. https://www.elastic.co/what-is/elasticsearch-hadoop Elastic spark connector docs: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html On Thu, May 14, 2020, 21:14 Amol Umbarkar wrote: > Check out sparkNLP for

Re: [PySpark] Tagging descriptions

2020-05-14 Thread Amol Umbarkar
Check out sparkNLP for tokenization. I am not sure about solar or elastic search though On Thu, May 14, 2020 at 9:02 PM Rishi Shah wrote: > This is great, thanks you Zhang & Amol !! > > Yes we can have multiple tags per row and multiple regex applied to single > row as well. Would you have any

Re: Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Jerry Vinokurov
Hi Chetan, You can pretty much use any client to do this. When I was using Spark at a previous job, we used OkHttp, but I'm sure there are plenty of others. In our case, we had a startup phase in which we gathered metadata via a REST API and then broadcast it to the workers. I think if you need

Re: Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Jerry Vinokurov
I believe that if you do this within the context of an operation that is already parallelized such as a map, the work will be distributed to executors and they will do it in parallel. I could be wrong about this as I never investigated this specific use case, though. On Thu, May 14, 2020 at 5:24

Re: [PySpark] Tagging descriptions

2020-05-14 Thread Amol Umbarkar
Rishi, Just adding to zhang's questions. Are you expecting multiple tags per row? Do you check multiple regex for a single tag? Let's say you had only one tag then theoretically you should be do this - 1 Remove stop words or any irrelevant stuff 2 split text into equal sized chunk column (eg -

Re: Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Sean Owen
No, it means # HTTP calls = # executor slots. But even then, you're welcome to, say, use thread pools to execute even more concurrently as most are I/O bound. Your code can do what you want. On Thu, May 14, 2020 at 6:14 PM Chetan Khatri wrote: > > Thanks, that means number of executor = number

Re: Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Sean Owen
Yes any code that you write in code that you apply with Spark runs in the executors. You would be running as many HTTP clients as you have partitions. On Thu, May 14, 2020 at 4:31 PM Jerry Vinokurov wrote: > > I believe that if you do this within the context of an operation that is > already

Re: Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Sean Owen
Default is not 200, but the number of executor slots. Yes you can only simultaneously execute as many tasks as slots regardless of partitions. On Thu, May 14, 2020, 5:19 PM Chetan Khatri wrote: > Thanks Sean, Jerry. > > Default Spark DataFrame partitions are 200 right? does it have >

Re: Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Chetan Khatri
Thanks Sean, Jerry. Default Spark DataFrame partitions are 200 right? does it have relationship with number of cores? 8 cores - 4 workers. is not it like I can do only 8 * 4 = 32 http calls. Because in Spark number of partitions = number cores is untrue. Thanks On Thu, May 14, 2020 at 6:11 PM

Re: Calling HTTP Rest APIs from Spark Job

2020-05-14 Thread Chetan Khatri
Thanks, that means number of executor = number of http calls, I can make. I can't boost more number of http calls in single executors, I mean - I can't go beyond the threashold of number of executors. On Thu, May 14, 2020 at 6:26 PM Sean Owen wrote: > Default is not 200, but the number of

Re: [PySpark] Tagging descriptions

2020-05-14 Thread Rishi Shah
This is great, thanks you Zhang & Amol !! Yes we can have multiple tags per row and multiple regex applied to single row as well. Would you have any example of working with spark & search engines like Solar, ElasticSearch? Does Spark ML provide tokenization support as expected (I am yet to try