Re: Custom Data Source for getting data from Rest based services

2017-12-28 Thread vaish02
We extensively use pubmed & clinical trial databases for our work and it
involves making large amount of parametric rest api queries, usually if the
data download is large the requests get timed out ad we have to run queries
in very small batches . We also extensively use large number(thousands) of
NLP queries for our ML work. 
  Given that our content is quite large and we are constrained by the
public database interfaces, such a framework would be very beneficial for
our use case. Since I just stumbled on this post will try to use this
package in context of our framework and let you know the difference between
using the library vs the way we do it conventionally. Thanks for sharing it
with the community.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Custom Data Source for getting data from Rest based services

2017-12-24 Thread Jean Georges Perrin
If you need Java code, you can have a look @: 
https://github.com/jgperrin/net.jgp.labs.spark.datasources 


and:
https://databricks.com/session/extending-apache-sparks-ingestion-building-your-own-java-data-source
 


> On Dec 24, 2017, at 2:56 AM, Subarna Bhattacharyya 
>  wrote:
> 
> Hi Sourav,
> Looks like this would be a good utility for the development of large scale
> data driven product based on Data services. 
> 
> We are an early stage startup called Climformatics and  we are building a
> customized high resolution climate prediction tool. This effort requires
> synthesis of large scale data input from multiple data sources. This tool
> can help in getting large volume of data from multiple data services through
> api calls which are somewhat limited to their bulk use.
> 
> One feature that would help us further is if you could have a handle on
> setting the limits on how many data points can be grabbed at once, since the
> data sources that we access are often limited by the number of service calls
> that one can do at a time (say per minute).
> 
> Also we need a way to pass the parameter inputs (for multiple calls) through
> the url path itself. Many of the data sources we use need the parameters are
> to be included in the uri path itself instead of passing them as key/value
> parameter. An example is https://www.wunderground.com/weather/api/d/docs.
> 
> We would try to give a closer look to the github link you provided and get
> back to you with feedback.
> 
> Thanks,
> Sincerely,
> Subarna
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 



Re: Custom Data Source for getting data from Rest based services

2017-12-23 Thread Subarna Bhattacharyya
Hi Sourav,
Looks like this would be a good utility for the development of large scale
data driven product based on Data services. 

We are an early stage startup called Climformatics and  we are building a
customized high resolution climate prediction tool. This effort requires
synthesis of large scale data input from multiple data sources. This tool
can help in getting large volume of data from multiple data services through
api calls which are somewhat limited to their bulk use.

One feature that would help us further is if you could have a handle on
setting the limits on how many data points can be grabbed at once, since the
data sources that we access are often limited by the number of service calls
that one can do at a time (say per minute).

Also we need a way to pass the parameter inputs (for multiple calls) through
the url path itself. Many of the data sources we use need the parameters are
to be included in the uri path itself instead of passing them as key/value
parameter. An example is https://www.wunderground.com/weather/api/d/docs.

We would try to give a closer look to the github link you provided and get
back to you with feedback.

Thanks,
Sincerely,
Subarna



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Custom Data Source for getting data from Rest based services

2017-11-27 Thread Sourav Mazumder
It would be great if you can elaborate on the bulk provisioning use case.

Regards,
Sourav

On Sun, Nov 26, 2017 at 11:53 PM, shankar.roy  wrote:

> This would be a useful feature.
> We can leverage it while doing bulk provisioning.
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Custom Data Source for getting data from Rest based services

2017-11-27 Thread smazumder
@sathich

Here are my thoughts on your points -

1. Yes this should be able to handle any complex json structure returned by
the target rest API. Essentially what it would be returning is Rows of that
complex structure. Then one can use Spark SQL to further flatten it using
the functions like inline, explode, etc.

2. In my current implementation I have kept an option as "callStrictlyOnce".
This will ensure that the REST API is called only once for each set of
parameter values and the result would be persisted/cached for next time use.

3. I'm not sure what exactly you have in mind regarding extending this to
Spark Streaming. As such this cannot be used as a Spark Streaming receiver
right now as this does not implement the necessary interfaces for a custom
streaming receiver. But you can use this within your Spark Streaming
application as a regular Data Source to merge the data you are receiving
from streaming source.

Regards,
Sourav



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Custom Data Source for getting data from Rest based services

2017-11-26 Thread shankar.roy
This would be a useful feature.
We can leverage it while doing bulk provisioning.




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Custom Data Source for getting data from Rest based services

2017-11-22 Thread sathich
Hi Sourav,
This is quite an useful addition  to the spark family, this is a usecase
that comes more often than talked about.
* to get a 3rd party mapping data(geo coordinates) , 
* access database data through rest.
* download data from from bulk data api service   


It will be really useful to be able to interact with application layer
through restapi send over data to the rest api(case of post request which
you already mentioned) 

I have few follow up thoughts
1) What's your thought when a resapi returns more complex nested json data ,
will this seamlessly  map to a dataframe as  dataframes are more flatter in
nature. 
2) how can this dataframe be kept in distributed cache in spark workers to
be available , to encourage re-use of slow-changing data (does broadcast
work on a dataframe?) . This is related to your b) 
3) Last case in my mind is how can this be extended for streaming , and
control the frequency  of the resapi call and perform a join of two
dataframes, one is slow-moving(may be a lookup table in db getting accessed
over rest) and fast moving event stream.


Thanks
Sathi

 








--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org