Speakers needed for Apache DC Roadshow

2018-09-11 Thread Rich Bowen
We need your help to make the Apache Washington DC Roadshow on Dec 4th a success. What do we need most? Speakers! We're bringing a unique DC flavor to this event by mixing Open Source Software with talks about Apache projects as well as OSS CyberSecurity, OSS in Government and and OSS Career

Re: DataSourceWriter V2 Api questions

2018-09-11 Thread Russell Spitzer
That only works assuming that Spark is the only client of the table. It will be impossible to force an outside user to respect the special metadata table when reading so they will still see all of the data in transit. Additionally this would force the incoming data to only be written into new

Re: DataSourceWriter V2 Api questions

2018-09-11 Thread Thakrar, Jayesh
So if Spark and the destination datastore are both non-transactional, you will have to resort to an external mechanism for “transactionality”. Here are some options for both RDBMS and non-transaction datastore destination. For now assuming that Spark is used in batch mode (and not streaming

Re: DataSourceWriter V2 Api questions

2018-09-11 Thread Russell Spitzer
I'm still not sure how the staging table helps for databases which do not have such atomicity guarantees. For example in Cassandra if you wrote all of the data temporarily to a staging table, we would still have the same problem in moving the data from the staging table into the real table. We

Is CBO broken?

2018-09-11 Thread emlyn
I was trying to enable CBO on one of our jobs (using Spark 2.3.1 with partitioned parquet data) but it seemed that the rowCount statistics were being ignored. I found this JIRA which seems to describe the same issue: https://issues.apache.org/jira/browse/SPARK-25185, but it has no response so far.

Re: DataSourceWriter V2 Api questions

2018-09-11 Thread Arun Mahadevan
>Some being said it is exactly-once when the output is eventually exactly-once, whereas others being said there should be no side effect, like consumer shouldn't see partial write. I guess 2PC is former, since some partitions can commit earlier while other partitions fail to commit for some time.

Off Heap Memory

2018-09-11 Thread Jack Kolokasis
Hello,     I recently start studying the Spark's memory management system. More spesifically I want to understand how spark use the off-Heap memory. Interanlly I saw, that there are two types of offHeap memory. (offHeapExecutionMemoryPool and offHeapStorageMemoryPool).     How Spark use the

Re: DataSourceWriter V2 Api questions

2018-09-11 Thread Ross Lawley
Hi, Thanks all for the comments and discussion regarding the API! It sounds like the current expectation for database systems is to populate a staging table in the tasks and the driver moves that data when commit is called. That would work for many usecases that our users have with the MongoDB