Re: DataSourceWriter V2 Api questions

2019-12-06 Thread Jungtaek Lim
> Essentially each batch load goes into a separate mongoDB table and will >>> result in view redefinition after a successful load. >>> >>> And finally to avoid too many tables in a view, you may have to come up >>> with a separate process to merge the underly

Re: DataSourceWriter V2 Api questions

2019-12-05 Thread Wenchen Fan
ng tables on a periodic basis. >> >> It gets messy and probably moves you towards a write-once only tables, >> etc. >> >> >> >> Finally using views in a generic mongoDB connector may not be good and >> flexible enough. >> >> &

Re: DataSourceWriter V2 Api questions

2018-10-18 Thread Jungtaek Lim
; > > > > > *From: *Russell Spitzer > *Date: *Tuesday, September 11, 2018 at 9:58 AM > *To: *"Thakrar, Jayesh" > *Cc: *Arun Mahadevan , Jungtaek Lim , > Wenchen Fan , Reynold Xin , > Ross Lawley , Ryan Blue , dev < > dev@spark.apache.org>, "dbis...@

Re: DataSourceWriter V2 Api questions

2018-09-13 Thread Thakrar, Jayesh
n mailto:cloud0...@gmail.com>>, Reynold Xin mailto:r...@databricks.com>>, Ross Lawley mailto:ross.law...@gmail.com>>, Ryan Blue mailto:rb...@netflix.com>>, dev mailto:dev@spark.apache.org>>, mailto:dbis...@us.ibm.com>> Subject: Re: DataSourceWriter V2 Api

Re: DataSourceWriter V2 Api questions

2018-09-11 Thread Russell Spitzer
gt; > > > *From: *Russell Spitzer > *Date: *Tuesday, September 11, 2018 at 9:08 AM > *To: *Arun Mahadevan > *Cc: *Jungtaek Lim , Wenchen Fan , > Reynold Xin , Ross Lawley , > Ryan Blue , dev , < > dbis...@us.ibm.com> > > > *Subject: *Re: DataSourceWriter V2 Api

Re: DataSourceWriter V2 Api questions

2018-09-11 Thread Thakrar, Jayesh
at 9:08 AM To: Arun Mahadevan Cc: Jungtaek Lim , Wenchen Fan , Reynold Xin , Ross Lawley , Ryan Blue , dev , Subject: Re: DataSourceWriter V2 Api questions I'm still not sure how the staging table helps for databases which do not have such atomicity guarantees. For example in Cassandra if you

Re: DataSourceWriter V2 Api questions

2018-09-11 Thread Russell Spitzer
gt;>>>>>> writer. Actually it can be achieved only with HDFS sink (or other >>>>>>>>>> filesystem based sinks) and other external storage are normally not >>>>>>>>>> feasible to implement it because there's no way to couple a >>&

Re: DataSourceWriter V2 Api questions

2018-09-11 Thread Arun Mahadevan
gt;>> >>>>>>>>> XA is also not a trivial one to get it correctly with current >>>>>>>>> execution model: Spark doesn't require writer tasks to run at the >>>>>>>>> same time >>>>>>>>>

Re: DataSourceWriter V2 Api questions

2018-09-11 Thread Ross Lawley
>>>>>>>>> the >>>>>>>>> final commit. >>>>>>>>> >>>>>>>>> XA is also not a trivial one to get it correctly with current >>>>>>>>> execution model: Spark

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Jungtaek Lim
uld >>>>>>>> also integrate 2PC with its checkpointing mechanism to guarantee >>>>>>>> completeness of batch. And it might require different integration for >>>>>>>> continuous mode. >>>>>>>> >>>>>>>> Jungtaek Lim (HeartSaVioR

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Russell Spitzer
) 오전 4:37, Arun Mahadevan 님이 작성: >>>>>>> >>>>>>>> In some cases the implementations may be ok with eventual >>>>>>>> consistency (and does not care if the output is written out atomically) >>>>>>>> >>>>>>>> XA can be o

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Wenchen Fan
; May be we need to discuss improvements at the Datasource V2 API >>>>>>> level (e.g. individual tasks would "prepare" for commit and once the >>>>>>> driver >>>>>>> receives "prepared" from all the tasks, a "commit"

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Jungtaek Lim
quot; would be invoked at >>>>>> each >>>>>> of the individual tasks). Right now the responsibility of the final >>>>>> "commit" is with the driver and it may not always be possible for the >>>>>> driver to tak

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Ryan Blue
quot;commit" would be invoked at >>>>>> each >>>>>> of the individual tasks). Right now the responsibility of the final >>>>>> "commit" is with the driver and it may not always be possible for the >>>>>> driver to take over the transact

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Ryan Blue
;>>>> receives "prepared" from all the tasks, a "commit" would be invoked at >>>>>> each >>>>>> of the individual tasks). Right now the responsibility of the final >>>>>> "commit" is with the driver and it may not always

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Arun Mahadevan
iswal wrote: >>>>> >>>>>> This is a pretty big challenge in general for data sources -- for the >>>>>> vast majority of data stores, the boundary of a transaction is per >>>>>> client. >>>>>> That is, you ca

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Jungtaek Lim
arted by the tasks. >>>>> >>>>> >>>>> On Mon, 10 Sep 2018 at 11:48, Dilip Biswal wrote: >>>>> >>>>>> This is a pretty big challenge in general for data sources -- for the >>>>>> vast majority of

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Reynold Xin
ig challenge in general for data sources -- for the >>>>> vast majority of data stores, the boundary of a transaction is per client. >>>>> That is, you can't have two clients doing writes and coordinating a single >>>>> transaction. That's certainly the case for

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Jungtaek Lim
;> That is, you can't have two clients doing writes and coordinating a single >>>> transaction. That's certainly the case for almost all relational databases. >>>> Spark, on the other hand, will have multiple clients (consider each task a >>>> client) writing to the

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Dilip Biswal
Thanks a lot Reynold and Jungtaek Lim. It definitely helped me understand this better.   Regards,Dilip BiswalTel: 408-463-4980dbis...@us.ibm.com     - Original message -From: Reynold Xin To: kabh...@gmail.comCc: ar...@apache.org, dbis...@us.ibm.com, dev , Ryan Blue , Ross Lawley Subject:

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Reynold Xin
DB>> Perhaps we can explore two-phase commit protocol (aka XA) for this >>> ? Not sure how easy it is to implement this though :-) >>> >>> Regards, >>> Dilip Biswal >>> Tel: 408-463-4980 <(408)%20463-4980> >>> dbis...@us.ibm.com >>>

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Jungtaek Lim
;> >> Regards, >> Dilip Biswal >> Tel: 408-463-4980 <(408)%20463-4980> >> dbis...@us.ibm.com >> >> >> >> - Original message - >> From: Reynold Xin >> To: Ryan Blue >> Cc: ross.law...@gmail.com, dev >&

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Arun Mahadevan
inal message - > From: Reynold Xin > To: Ryan Blue > Cc: ross.law...@gmail.com, dev > Subject: Re: DataSourceWriter V2 Api questions > Date: Mon, Sep 10, 2018 10:26 AM > > I don't think the problem is just whether we have a starting point for > write. As a matter

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Dilip Biswal
This is a pretty big challenge in general for data sources -- for the vast majority of data stores, the boundary of a transaction is per client. That is, you can't have two clients doing writes and coordinating a single transaction. That's certainly the case for almost all relational databases.

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Russell Spitzer
I think I mentioned on the Design Doc that with the Cassandra connector we have similar issues. There is no "transaction" or "staging table" capable of really doing that the api requires. On Mon, Sep 10, 2018 at 12:26 PM Reynold Xin wrote: > I don't think the problem is just whether we have a

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Reynold Xin
I don't think the problem is just whether we have a starting point for write. As a matter of fact there's always a starting point for write, whether it is explicit or implicit. This is a pretty big challenge in general for data sources -- for the vast majority of data stores, the boundary of a

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Ryan Blue
Ross, I think the intent is to create a single transaction on the driver, write as part of it in each task, and then commit the transaction once the tasks complete. Is that possible in your implementation? I think that part of this is made more difficult by not having a clear starting point for a

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Reynold Xin
Typically people do it via transactions, or staging tables. On Mon, Sep 10, 2018 at 2:07 AM Ross Lawley wrote: > Hi all, > > I've been prototyping an implementation of the DataSource V2 writer for > the MongoDB Spark Connector and I have a couple of questions about how its > intended to be