> Essentially each batch load goes into a separate mongoDB table and will
>>> result in view redefinition after a successful load.
>>>
>>> And finally to avoid too many tables in a view, you may have to come up
>>> with a separate process to merge the underly
ng tables on a periodic basis.
>>
>> It gets messy and probably moves you towards a write-once only tables,
>> etc.
>>
>>
>>
>> Finally using views in a generic mongoDB connector may not be good and
>> flexible enough.
>>
>>
&
;
>
>
>
>
> *From: *Russell Spitzer
> *Date: *Tuesday, September 11, 2018 at 9:58 AM
> *To: *"Thakrar, Jayesh"
> *Cc: *Arun Mahadevan , Jungtaek Lim ,
> Wenchen Fan , Reynold Xin ,
> Ross Lawley , Ryan Blue , dev <
> dev@spark.apache.org>, "dbis...@
n
mailto:cloud0...@gmail.com>>, Reynold Xin
mailto:r...@databricks.com>>, Ross Lawley
mailto:ross.law...@gmail.com>>, Ryan Blue
mailto:rb...@netflix.com>>, dev
mailto:dev@spark.apache.org>>,
mailto:dbis...@us.ibm.com>>
Subject: Re: DataSourceWriter V2 Api
gt;
>
>
> *From: *Russell Spitzer
> *Date: *Tuesday, September 11, 2018 at 9:08 AM
> *To: *Arun Mahadevan
> *Cc: *Jungtaek Lim , Wenchen Fan ,
> Reynold Xin , Ross Lawley ,
> Ryan Blue , dev , <
> dbis...@us.ibm.com>
>
>
> *Subject: *Re: DataSourceWriter V2 Api
at 9:08 AM
To: Arun Mahadevan
Cc: Jungtaek Lim , Wenchen Fan ,
Reynold Xin , Ross Lawley , Ryan
Blue , dev ,
Subject: Re: DataSourceWriter V2 Api questions
I'm still not sure how the staging table helps for databases which do not have
such atomicity guarantees. For example in Cassandra if you
gt;>>>>>> writer. Actually it can be achieved only with HDFS sink (or other
>>>>>>>>>> filesystem based sinks) and other external storage are normally not
>>>>>>>>>> feasible to implement it because there's no way to couple a
>>&
gt;>>
>>>>>>>>> XA is also not a trivial one to get it correctly with current
>>>>>>>>> execution model: Spark doesn't require writer tasks to run at the
>>>>>>>>> same time
>>>>>>>>>
>>>>>>>>> the
>>>>>>>>> final commit.
>>>>>>>>>
>>>>>>>>> XA is also not a trivial one to get it correctly with current
>>>>>>>>> execution model: Spark
uld
>>>>>>>> also integrate 2PC with its checkpointing mechanism to guarantee
>>>>>>>> completeness of batch. And it might require different integration for
>>>>>>>> continuous mode.
>>>>>>>>
>>>>>>>> Jungtaek Lim (HeartSaVioR
) 오전 4:37, Arun Mahadevan 님이 작성:
>>>>>>>
>>>>>>>> In some cases the implementations may be ok with eventual
>>>>>>>> consistency (and does not care if the output is written out atomically)
>>>>>>>>
>>>>>>>> XA can be o
; May be we need to discuss improvements at the Datasource V2 API
>>>>>>> level (e.g. individual tasks would "prepare" for commit and once the
>>>>>>> driver
>>>>>>> receives "prepared" from all the tasks, a "commit"
quot; would be invoked at
>>>>>> each
>>>>>> of the individual tasks). Right now the responsibility of the final
>>>>>> "commit" is with the driver and it may not always be possible for the
>>>>>> driver to tak
quot;commit" would be invoked at
>>>>>> each
>>>>>> of the individual tasks). Right now the responsibility of the final
>>>>>> "commit" is with the driver and it may not always be possible for the
>>>>>> driver to take over the transact
;>>>> receives "prepared" from all the tasks, a "commit" would be invoked at
>>>>>> each
>>>>>> of the individual tasks). Right now the responsibility of the final
>>>>>> "commit" is with the driver and it may not always
iswal wrote:
>>>>>
>>>>>> This is a pretty big challenge in general for data sources -- for the
>>>>>> vast majority of data stores, the boundary of a transaction is per
>>>>>> client.
>>>>>> That is, you ca
arted by the tasks.
>>>>>
>>>>>
>>>>> On Mon, 10 Sep 2018 at 11:48, Dilip Biswal wrote:
>>>>>
>>>>>> This is a pretty big challenge in general for data sources -- for the
>>>>>> vast majority of
ig challenge in general for data sources -- for the
>>>>> vast majority of data stores, the boundary of a transaction is per client.
>>>>> That is, you can't have two clients doing writes and coordinating a single
>>>>> transaction. That's certainly the case for
;> That is, you can't have two clients doing writes and coordinating a single
>>>> transaction. That's certainly the case for almost all relational databases.
>>>> Spark, on the other hand, will have multiple clients (consider each task a
>>>> client) writing to the
Thanks a lot Reynold and Jungtaek Lim. It definitely helped me understand this better.
Regards,Dilip BiswalTel: 408-463-4980dbis...@us.ibm.com
- Original message -From: Reynold Xin To: kabh...@gmail.comCc: ar...@apache.org, dbis...@us.ibm.com, dev , Ryan Blue , Ross Lawley Subject:
DB>> Perhaps we can explore two-phase commit protocol (aka XA) for this
>>> ? Not sure how easy it is to implement this though :-)
>>>
>>> Regards,
>>> Dilip Biswal
>>> Tel: 408-463-4980 <(408)%20463-4980>
>>> dbis...@us.ibm.com
>>>
;>
>> Regards,
>> Dilip Biswal
>> Tel: 408-463-4980 <(408)%20463-4980>
>> dbis...@us.ibm.com
>>
>>
>>
>> - Original message -
>> From: Reynold Xin
>> To: Ryan Blue
>> Cc: ross.law...@gmail.com, dev
>&
inal message -
> From: Reynold Xin
> To: Ryan Blue
> Cc: ross.law...@gmail.com, dev
> Subject: Re: DataSourceWriter V2 Api questions
> Date: Mon, Sep 10, 2018 10:26 AM
>
> I don't think the problem is just whether we have a starting point for
> write. As a matter
This is a pretty big challenge in general for data sources -- for the vast majority of data stores, the boundary of a transaction is per client. That is, you can't have two clients doing writes and coordinating a single transaction. That's certainly the case for almost all relational databases.
I think I mentioned on the Design Doc that with the Cassandra connector we
have similar issues. There is no "transaction" or "staging table" capable
of really doing that the api requires.
On Mon, Sep 10, 2018 at 12:26 PM Reynold Xin wrote:
> I don't think the problem is just whether we have a
I don't think the problem is just whether we have a starting point for
write. As a matter of fact there's always a starting point for write,
whether it is explicit or implicit.
This is a pretty big challenge in general for data sources -- for the vast
majority of data stores, the boundary of a
Ross, I think the intent is to create a single transaction on the driver,
write as part of it in each task, and then commit the transaction once the
tasks complete. Is that possible in your implementation?
I think that part of this is made more difficult by not having a clear
starting point for a
Typically people do it via transactions, or staging tables.
On Mon, Sep 10, 2018 at 2:07 AM Ross Lawley wrote:
> Hi all,
>
> I've been prototyping an implementation of the DataSource V2 writer for
> the MongoDB Spark Connector and I have a couple of questions about how its
> intended to be
28 matches
Mail list logo