Re: [DISCUSS] SPIP: APIs for Table Metadata Operations

2018-08-13 Thread Ryan Blue
. Anyone else want to raise an issue with the proposal, or is it about time to bring up a vote thread? rb On Thu, Jul 26, 2018 at 5:00 PM Ryan Blue wrote: > I don’t think that we want to block this work until we have a public and > stable Expression. Like our decision to expose Internal

Re: [DISCUSS] SPIP: APIs for Table Metadata Operations

2018-08-15 Thread Ryan Blue
I think I found a good solution to the problem of using Expression in the TableCatalog API and in the DeleteSupport API. For DeleteSupport, there is already a stable and public subset of Expression named Filter that can be used to pass filters. The reason why DeleteSupport would use Expression is

Re: [DISCUSS] SPIP: APIs for Table Metadata Operations

2018-08-15 Thread Ryan Blue
PI, similar to what we did for > dsv1. > > If we are depending on Expressions on the more common APIs in dsv2 > already, we should revisit that. > > > > > On Mon, Aug 13, 2018 at 1:59 PM Ryan Blue wrote: > >> Reynold, did you get a chance to look at my response about

Re: [DISCUSS] USING syntax for Datasource V2

2018-08-20 Thread Ryan Blue
e should be supported anyway, I was > thinking we could just orthogonally proceed. If you guys think other issues > should be resolved first, I think we (at least I will) should take a look > for the set of catalog APIs. > > -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] USING syntax for Datasource V2

2018-08-21 Thread Ryan Blue
entation > > Thanks for your time, > Russ > > On Mon, Aug 20, 2018 at 11:33 AM Ryan Blue > wrote: > >> Thanks for posting this discussion to the dev list, it would be great to >> hear what everyone thinks about the idea that USING should be a >> catalog-specific

[RESULT] [VOTE] SPIP: Standardize SQL logical plans

2018-07-20 Thread Ryan Blue
This vote passes with 4 binding +1s and 9 community +1s. Thanks for taking the time to vote, everyone! Binding votes: Wenchen Fan Xiao Li Reynold Xin Felix Cheung Non-binding votes: Ryan Blue John Zhuge Takeshi Yamamuro Marco Gaido Russel Spitzer Alessandro Solimando Henry Robinson Dongjoon

[VOTE] SPIP: Standardize SQL logical plans

2018-07-17 Thread Ryan Blue
rk should adopt the SPIP [-1]: Spark should not adopt the SPIP because . . . Thanks for voting, everyone! -- Ryan Blue

Re: [DISCUSS] SPIP: Standardize SQL logical plans

2018-07-17 Thread Ryan Blue
ple can > jump > > in during the development. I'm interested in the new API and like to > work on > > it after the vote passes. > > > > Thanks, > > Wenchen > > > > On Fri, Jul 13, 2018 at 7:25 AM Ryan Blue wrote: > >> > >> Thanks! I'm a

Re: [VOTE] SPIP: Standardize SQL logical plans

2018-07-17 Thread Ryan Blue
+1 (not binding) On Tue, Jul 17, 2018 at 10:59 AM Ryan Blue wrote: > Hi everyone, > > From discussion on the proposal doc and the discussion thread, I think we > have consensus around the plan to standardize logical write operations for > DataSourceV2. I would like

[DISCUSS] Multiple catalog support

2018-07-23 Thread Ryan Blue
continue to use the property to determine the table’s data source or format implementation. Other table catalog implementations would be free to interpret the format string as they choose or to use it to choose a data source implementation as in the default catalog. rb ​ -- Ryan Blue Software Engineer Netflix

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-18 Thread Ryan Blue
;>>>>> Note. RC2 was cancelled because of one blocking issue SPARK-24781 >>>>>>> during release preparation. >>>>>>> >>>>>>> FAQ >>>>>>> >>>>>>> = >>>>>>> How can I help test this release? >>>>>>> = >>>>>>> >>>>>>> If you are a Spark user, you can help us test this release by taking >>>>>>> an existing Spark workload and running on this release candidate, >>>>>>> then >>>>>>> reporting any regressions. >>>>>>> >>>>>>> If you're working in PySpark you can set up a virtual env and install >>>>>>> the current RC and see if anything important breaks, in the >>>>>>> Java/Scala >>>>>>> you can add the staging repository to your projects resolvers and >>>>>>> test >>>>>>> with the RC (make sure to clean up the artifact cache before/after so >>>>>>> you don't end up building with a out of date RC going forward). >>>>>>> >>>>>>> === >>>>>>> What should happen to JIRA tickets still targeting 2.3.2? >>>>>>> === >>>>>>> >>>>>>> The current list of open tickets targeted at 2.3.2 can be found at: >>>>>>> https://issues.apache.org/jira/projects/SPARK and search for "Target >>>>>>> Version/s" = 2.3.2 >>>>>>> >>>>>>> Committers should look at those and triage. Extremely important bug >>>>>>> fixes, documentation, and API tweaks that impact compatibility should >>>>>>> be worked on immediately. Everything else please retarget to an >>>>>>> appropriate release. >>>>>>> >>>>>>> == >>>>>>> But my bug isn't fixed? >>>>>>> == >>>>>>> >>>>>>> In order to make timely releases, we will typically not hold the >>>>>>> release unless the bug in question is a regression from the previous >>>>>>> release. That being said, if there is something which is a regression >>>>>>> that has not been correctly targeted please ping me or a committer to >>>>>>> >>>>>> help target the issue. >>>>>>> >>>>>> >>>>>>> >>>>>>> -- >>>>>>> John Zhuge >>>>>>> >>>>>> -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Multiple catalog support

2018-07-25 Thread Ryan Blue
Quick update: I've updated my PR to add the table catalog API to implement this proposal. Here's the PR: https://github.com/apache/spark/pull/21306 On Mon, Jul 23, 2018 at 5:01 PM Ryan Blue wrote: > Lately, I’ve been working on implementing the new SQL logical plans. I’m > currently b

[DISCUSS] SPIP: APIs for Table Metadata Operations

2018-07-24 Thread Ryan Blue
IP is for the APIs and does not cover how multiple catalogs would be exposed. I started a separate discussion thread on how to access multiple catalogs and maintain compatibility with Spark’s current behavior (how to get the catalog instance in the above example). Please use this thread to discuss the proposed APIs. Thanks, everyone! rb ​ -- Ryan Blue Software Engineer Netflix

Re: data source api v2 refactoring

2018-08-31 Thread Ryan Blue
the above: >> >> 1. Creates an explicit Table abstraction, and an explicit Scan >> abstraction. >> >> 2. Have an explicit Stream level and makes it clear pushdowns and options >> are handled there, rather than at the individual scan (ReadSupport) level. >> Data source implementations don't need to worry about pushdowns or options >> changing mid-stream. For batch, those happen when the scan object is >> created. >> >> >> >> This email is just a high level sketch. I've asked Wenchen to prototype >> this, to see if it is actually feasible and the degree of hacks it removes, >> or creates. >> >> >> -- Ryan Blue Software Engineer Netflix

Re: [discuss] replacing SPIP template with Heilmeier's Catechism?

2018-08-31 Thread Ryan Blue
th the > above? > > At a high level, I think the Heilmeier's Catechism emphasizes less about > the "how", and more the "why" and "what", which is what I'd argue SPIPs > should be about. The hows should be left in design docs for larger projects. > > > -- Ryan Blue Software Engineer Netflix

Re: data source api v2 refactoring

2018-09-01 Thread Ryan Blue
with ScanConfig. > For streaming source, stream is the one to take care of the pushdown > result. For batch source, it's the scan. > > It's a little tricky because stream is an abstraction for streaming source > only. Better ideas are welcome! > > On Sat, Sep 1, 2018 at 7:26 AM Ry

Fwd: data source api v2 refactoring

2018-09-04 Thread Ryan Blue
Latest from Wenchen in case it was dropped. -- Forwarded message - From: Wenchen Fan Date: Mon, Sep 3, 2018 at 6:16 AM Subject: Re: data source api v2 refactoring To: Cc: Ryan Blue , Reynold Xin , < dev@spark.apache.org> Hi Mridul, I'm not sure what's going on, my

Re: [DISCUSS] SPIP: APIs for Table Metadata Operations

2018-09-04 Thread Ryan Blue
1 gives Spark the opportunity > to enforce column references are valid (but not the actual function names), > whereas option 2 would be up to the data sources to validate. > > > > On Wed, Aug 15, 2018 at 2:27 PM Ryan Blue wrote: > >> I think I found a good solution to th

Re: data source api v2 refactoring

2018-09-06 Thread Ryan Blue
> trait Table { > LogicalWrite newAppendWrite(); > > LogicalWrite newDeleteWrite(deleteExprs); > } > > > It looks to me that the API is simpler without WriteConfig, what do you > think? > > Thanks, > Wenchen > > On Wed, Sep 5, 2018 at 4:24 AM Ryan Blue > wrote:

Re: Branch 2.4 is cut

2018-09-09 Thread Ryan Blue
gt; > > -- > Shane Knapp > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > -- Ryan Blue Software Engineer Netflix

Re: time for Apache Spark 3.0?

2018-09-06 Thread Ryan Blue
dates for >>>>>> consideration): >>>>>> >>> >>>>>> >>> 1. Support Scala 2.12. >>>>>> >>> >>>>>> >>> 2. Remove interfaces, configs, and modules (e.g. Bagel) >>>>>> deprecated in Spark 2.x. >>>>>> >>> >>>>>> >>> 3. Shade all dependencies. >>>>>> >>> >>>>>> >>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL >>>>>> compliant, to prevent users from shooting themselves in the foot, e.g. >>>>>> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it >>>>>> less painful for users to upgrade here, I’d suggest creating a flag for >>>>>> backward compatibility mode. >>>>>> >>> >>>>>> >>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL >>>>>> more standard compliant, and have a flag for backward compatibility. >>>>>> >>> >>>>>> >>> 6. Miscellaneous other small changes documented in JIRA already >>>>>> (e.g. “JavaPairRDD flatMapValues requires function returning Iterable, >>>>>> not >>>>>> Iterator”, “Prevent column name duplication in temporary view”). >>>>>> >>> >>>>>> >>> >>>>>> >>> Now the reality of a major version bump is that the world often >>>>>> thinks in terms of what exciting features are coming. I do think there >>>>>> are >>>>>> a number of major changes happening already that can be part of the 3.0 >>>>>> release, if they make it in: >>>>>> >>> >>>>>> >>> 1. Scala 2.12 support (listing it twice) >>>>>> >>> 2. Continuous Processing non-experimental >>>>>> >>> 3. Kubernetes support non-experimental >>>>>> >>> 4. A more flushed out version of data source API v2 (I don’t >>>>>> think it is realistic to stabilize that in one release) >>>>>> >>> 5. Hadoop 3.0 support >>>>>> >>> 6. ... >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> Similar to the 2.0 discussion, this thread should focus on the >>>>>> framework and whether it’d make sense to create Spark 3.0 as the next >>>>>> release, rather than the individual feature requests. Those are important >>>>>> but are best done in their own separate threads. >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Regards, >>>> Vaquar Khan >>>> +1 -224-436-0783 >>>> Greater Chicago >>>> >>> >>> >>> >>> -- >>> Regards, >>> Vaquar Khan >>> +1 -224-436-0783 >>> Greater Chicago >>> >> >> >> >> -- >> Regards, >> Vaquar Khan >> +1 -224-436-0783 >> Greater Chicago >> > -- Ryan Blue Software Engineer Netflix

Re: time for Apache Spark 3.0?

2018-09-06 Thread Ryan Blue
a major version update to get it? > > I generally support moving on to 3.x so we can also jettison a lot of > older dependencies, code, fix some long standing issues, etc. > > (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4) > > On Thu, Sep 6, 2018 at 9:10 AM

Re: time for Apache Spark 3.0?

2018-09-06 Thread Ryan Blue
also jettison a lot of >> older dependencies, code, fix some long standing issues, etc. >> >> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4) >> >> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue >> wrote: >> >>> My concern is that the v2 data

Re: time for Apache Spark 3.0?

2018-09-06 Thread Ryan Blue
gt; state. > > > On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue > wrote: > >> It would be great to get more features out incrementally. For >> experimental features, do we have more relaxed constraints? >> >> On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin wrote: >

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Ryan Blue
and is >> discoverable - thereby breaking the documented contract. >> >> I was wondering how other databases systems plan to implement this API >> and meet the contract as per the Javadoc? >> >> Many thanks >> >> Ross >> > -- Ryan Blue Software Engineer Netflix

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Ryan Blue
;>>>> receives "prepared" from all the tasks, a "commit" would be invoked at >>>>>> each >>>>>> of the individual tasks). Right now the responsibility of the final >>>>>> "commit" is with the driver and it may not always

Re: DataSourceWriter V2 Api questions

2018-09-10 Thread Ryan Blue
quot;commit" would be invoked at >>>>>> each >>>>>> of the individual tasks). Right now the responsibility of the final >>>>>> "commit" is with the driver and it may not always be possible for the >>>>>> driver to take over the transact

Re: data source api v2 refactoring

2018-09-07 Thread Ryan Blue
8 at 3:02 PM Hyukjin Kwon wrote: > >> BTW, do we hold Datasource V2 related PRs for now until we finish this >> refactoring just for clarification? >> >> 2018년 9월 7일 (금) 오전 12:52, Ryan Blue 님이 작성: >> >>> Wenchen, >>> >>> I'm not really su

Re: [DISCUSS] SPIP: Standardize SQL logical plans

2018-07-12 Thread Ryan Blue
hanks, > Wenchen > > On Fri, Apr 20, 2018 at 5:01 AM Ryan Blue > wrote: > >> Hi everyone, >> >> A few weeks ago, I wrote up a proposal to standardize SQL logical plans >> <https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?

Re: [Spark][Scheduler] Spark DAGScheduler scheduling performance hindered on JobSubmitted Event

2018-03-06 Thread Ryan Blue
g.apache.spark.scheduler.DAGScheduler#createResultStage >> >> >> >> I can see the effect of doing this may be that Job Submissions may not be >> FIFO depending on how much time Step 1 mentioned above is going to consume. >> >> >> >> Does above solution suffice for the problem described? And is there any >> other side effect of this solution? >> >> >> >> Regards >> >> Ajith >> > > -- Ryan Blue Software Engineer Netflix

Re: Maintenance releases for SPARK-23852?

2018-04-11 Thread Ryan Blue
My guess is that we wouldn't want to upgrade to a new minor version of > Parquet for a Spark maintenance release, so asking for a Parquet > maintenance release makes sense. > > What does everyone think? > > Best, > Henry > -- Ryan Blue Software Engineer Netflix

Re: DataSourceV2 write input requirements

2018-04-06 Thread Ryan Blue
Ted Yu <yuzhih...@gmail.com> wrote: > >> +1 >> >> ---- Original message >> From: Ryan Blue <rb...@netflix.com> >> Date: 3/30/18 2:28 PM (GMT-08:00) >> To: Patrick Woody <patrick.woo...@gmail.com> >> Cc: Russell Spitzer <

Re: [discuss][data source v2] remove type parameter in DataReader/WriterFactory

2018-04-18 Thread Ryan Blue
raits go away. And the ORC data source can also be simplified >> to >> >> class OrcReaderFactory(...) extends DataReaderFactory { >> def createUnsafeRowReader ... >> >> def createColumnarBatchReader ... >> } >> >> class OrcDataSourceReader extends DataSourceReader { >> def createReadFactories = ... // logic to prepare the parameters and >> create factories >> } >> >> We also have a potential benefit of supporting hybrid storage data >> source, which may keep real-time data in row format, and history data in >> columnar format. Then they can make some DataReaderFactory output >> InternalRow and some output ColumnarBatch. >> >> Thoughts? >> > > -- Ryan Blue Software Engineer Netflix

Re: Correlated subqueries in the DataFrame API

2018-04-19 Thread Ryan Blue
ache.org/jira/browse/SPARK-18455>, but it's not clear > to me whether they are "design-appropriate" for the DataFrame API. > > Are correlated subqueries a thing we can expect to have in the DataFrame > API? > > Nick > > -- Ryan Blue Software Engineer Netflix

[DISCUSS] SPIP: Standardize SQL logical plans

2018-04-19 Thread Ryan Blue
to get any remaining discussion going or get anyone that missed this to read through the docs. Thanks! rb -- Ryan Blue Software Engineer Netflix

Re: DataSourceV2 write input requirements

2018-03-26 Thread Ryan Blue
validation/assumptions of the table before attempting the write. > > Thanks! > Pat > -- Ryan Blue Software Engineer Netflix

Re: DataSourceV2 write input requirements

2018-03-26 Thread Ryan Blue
an determine the order of Expression's by looking at what >> requiredOrdering() >> returns. >> >> On Mon, Mar 26, 2018 at 5:45 PM, Ryan Blue <rb...@netflix.com.invalid> >> wrote: >> >>> Hi Pat, >>> >>> Thanks for starting the discussion on thi

Re: DataSourceV2 write input requirements

2018-03-27 Thread Ryan Blue
n a while, but does Clustering support allow > requesting that partitions contain elements in order as well? That would be > a useful trick for me. IE > Request/Require(SortedOn(Col1)) > Partition 1 -> ((A,1), (A, 2), (B,1) , (B,2) , (C,1) , (C,2)) > > On Tue, Mar 27,

Re: DataSourceV2 write input requirements

2018-03-27 Thread Ryan Blue
ific one called > HashClusteredDistribution. > > So currently only Aggregate can benefit from SupportsReportPartitioning > and save shuffle. We can add a new interface to expose the hash function to > make it work for Join. > > On Tue, Mar 27, 2018 at 9:33 AM, Ryan Blue <rb...@netf

Re: DataSourceV2 write input requirements

2018-03-27 Thread Ryan Blue
gt;> On Mon, Mar 26, 2018 at 6:11 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>> >>>> Interesting. >>>> >>>> Should requiredClustering return a Set of Expression's ? >>>> This way, we can determine the order of Expression's by looking at

Re: DataSourceV2 write input requirements

2018-03-30 Thread Ryan Blue
izer which can decide which method to >> use rather than having the data source itself do it. This is probably in a >> far future version of the api. >> >> On Thu, Mar 29, 2018 at 9:10 AM Ryan Blue <rb...@netflix.com> wrote: >> >>> Cassandra can in

Re: DataSourceV2 write input requirements

2018-03-30 Thread Ryan Blue
ping that through the CBO effort we will continue to >>>> get more detailed statistics. Like on read we could be using sketch data >>>> structures to get estimates on unique values and density for each column. >>>> You may be right that the real way for this to be handl

Re: DataSourceV2 write input requirements

2018-03-28 Thread Ryan Blue
t; On Tue, Mar 27, 2018 at 7:59 PM, Russell Spitzer < > russell.spit...@gmail.com> wrote: > >> Thanks for the clarification, definitely would want to require Sort but >> only recommend partitioning ... I think that would be useful to request >> based on details about the inc

Re: DataSourceV2 write input requirements

2018-03-29 Thread Ryan Blue
rhead. > > For the second, I wouldn't assume that a data source requiring a certain > write format would give any guarantees around reading the same data? In the > cases where it is a complete overwrite it would, but for independent writes > it could still be useful for statistics or c

[DISCUSS] Catalog APIs and multi-catalog support

2018-03-29 Thread Ryan Blue
jEoo/edit?usp=sharing> . Comments and feedback are welcome! Feel free to comment on the doc or reply to this thread. rb -- Ryan Blue Software Engineer Netflix

Re: Changing how we compute release hashes

2018-03-16 Thread Ryan Blue
he format of the SHA512 hash, can we add > a SHA256 hash to our releases in this format? > > I suppose if it’s not easy to update or add hashes to our existing > releases, it may be too difficult to change anything here. But I’m not > sure, so I thought I’d ask. > > Nick > ​ > -- Ryan Blue Software Engineer Netflix

Re: Any reason for not exposing internalCreateDataFrame or isStreaming beyond sql package?

2018-03-22 Thread Ryan Blue
usual API, its not possible (or > difficult) to create custom structured streaming sources. > > > > Consequently, one has to create streaming sources in packages under > org.apache.spark.sql. > > > > Any pointers or info is greatly appreciated. > -- Ryan Blue Software Engineer Netflix

Re: pyspark DataFrameWriter ignores customized settings?

2018-03-20 Thread Ryan Blue
save using DataFrameWriter, resulting 512k-block-size > > df_txt.write.mode('overwrite').format('parquet').save('hdfs: > //spark1/tmp/temp_with_df') > > > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

[DISCUSS] SPIP: Standardize SQL logical plans (SPARK-23521)

2018-02-26 Thread Ryan Blue
te set of those high-level logical operations, most of which are already defined in SQL or implemented by some write path in Spark. rb ​ -- Ryan Blue Software Engineer Netflix

Re: Time for 2.3.2?

2018-06-28 Thread Ryan Blue
gt;>>>>> stream-stream >>>>>> > join. Users can hit this bug if one of the join side is partitioned >>>>>> by a >>>>>> > subset of the join keys. >>>>>> > >>>>>> > SPARK-24552: Task at

Re: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

2018-10-19 Thread Ryan Blue
elson, Assaf >>> wrote: >>> >>> Could you add a fuller code example? I tried to reproduce it in my >>> environment and I am getting just one instance of the reader… >>> >>> >>> >>> Thanks, >>> >>> Assaf >

Re: DataSourceV2 hangouts sync

2018-10-29 Thread Ryan Blue
end up with so many people that we can't actually get the discussion going. Here's a link to the stream: https://stream.meet.google.com/stream/6be59d80-04c7-44dc-9042-4f3b597fc8ba Thanks! rb On Thu, Oct 25, 2018 at 1:09 PM Ryan Blue wrote: > Hi everyone, > > There's been some great d

Re: DataSourceV2 hangouts sync

2018-10-26 Thread Ryan Blue
; > I didn't know I live in the same timezone with you Wenchen :D. > Monday or Wednesday at 5PM PDT sounds good to me too FWIW. > > 2018년 10월 26일 (금) 오전 8:29, Ryan Blue 님이 작성: > >> Good point. How about Monday or Wednesday at 5PM PDT then? >> >> Everyone, please repl

Re: [VOTE] SPARK 2.4.0 (RC5)

2018-10-30 Thread Ryan Blue
; >> >> What should happen to JIRA tickets still targeting 2.4.0? >>> >> >> === >>> >> >> >>> >> >> The current list of open tickets targeted at 2.4.0 can be found at: >>> >> >> https://issues.apache.org/jira/projects/SPARK and search for >>> "Target Version/s" = 2.4.0 >>> >> >> >>> >> >> Committers should look at those and triage. Extremely important bug >>> >> >> fixes, documentation, and API tweaks that impact compatibility >>> should >>> >> >> be worked on immediately. Everything else please retarget to an >>> >> >> appropriate release. >>> >> >> >>> >> >> == >>> >> >> But my bug isn't fixed? >>> >> >> == >>> >> >> >>> >> >> In order to make timely releases, we will typically not hold the >>> >> >> release unless the bug in question is a regression from the >>> previous >>> >> >> release. That being said, if there is something which is a >>> regression >>> >> >> that has not been correctly targeted please ping me or a committer >>> to >>> >> >> help target the issue. >>> >> > >>> >> > >>> - >>> >> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >> > >>> >> >>> >> >>> >> - >>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >> >>> >>> - >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>> >> >> -- >> [image: Spark+AI Summit North America 2019] >> <http://t.sidekickopen24.com/s1t/c/5/f18dQhb0S7lM8dDMPbW2n0x6l2B9nMJN7t5X-FfhMynN2z8MDjQsyTKW56dzQQ1-_gV6102?t=https%3A%2F%2Fdatabricks.com%2Fsparkaisummit%2Fnorth-america=undefined=406b8c9a-b648-4923-9ed1-9a51ffe213fa> >> > -- Ryan Blue Software Engineer Netflix

DataSourceV2 hangouts sync

2018-10-25 Thread Ryan Blue
-- Ryan Blue Software Engineer Netflix

Re: DataSourceV2 hangouts sync

2018-11-01 Thread Ryan Blue
Thanks to everyone that attended the sync! We had some good discussions. Here are my notes for anyone that missed it or couldn’t join the live stream. If anyone wants to add to this, please send additional thoughts or corrections. *Attendees:* - Ryan Blue - Netflix - Using v2 to integrate

Re: DataSourceV2 hangouts sync

2018-10-25 Thread Ryan Blue
day at my side, it will be great if we can > pick a day from Monday to Thursday. > > On Fri, Oct 26, 2018 at 8:08 AM Ryan Blue wrote: > >> Since not many people have replied with a time window, how about we aim >> for 5PM PDT? That should work for Wenchen and most peo

Re: DataSourceV2 hangouts sync

2018-10-25 Thread Ryan Blue
eting is definitely helpful to discuss, move certain effort >>>>> forward and keep people on the same page. Glad to see this kind of working >>>>> group happening. >>>>> >>>>> On Thu, Oct 25, 2018 at 5:58 PM John Zhuge wrote: >>>&g

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Ryan Blue
though Apache Spark provides the binary distributions, it would be > great if this succeeds out of the box. > > > > Bests, > > Dongjoon. > > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

DataSourceV2 capability API

2018-11-08 Thread Ryan Blue
le. To fix this problem, I would use a table capability, like read-missing-columns-as-null. Any comments on this approach? rb -- Ryan Blue Software Engineer Netflix

Re: DataSourceV2 capability API

2018-11-09 Thread Ryan Blue
to > throw exceptions when they don't support a specific operation. > > > On Fri, Nov 9, 2018 at 11:54 AM Ryan Blue wrote: > >> Do you have an example in mind where we might add a capability and break >> old versions of data sources? >> >> These are really for

Re: DataSourceV2 capability API

2018-11-09 Thread Ryan Blue
pporting that property, and thus throwing an > exception. > > > On Fri, Nov 9, 2018 at 9:11 AM Ryan Blue wrote: > >> I'd have two places. First, a class that defines properties supported and >> identified by Spark, like the SQLConf definitions. Second, in documentation >>

Re: DataSourceV2 capability API

2018-11-09 Thread Ryan Blue
Another solution to the decimal case is using the capability API: use a capability to signal that the table knows about `supports-decimal`. So before the decimal support check, it would check `table.isSupported("type-capabilities")`. On Fri, Nov 9, 2018 at 12:45 PM Ryan B

Re: DataSourceV2 capability API

2018-11-08 Thread Ryan Blue
ll evolve (e.g. how many different > capabilities there will be). > > > On Thu, Nov 8, 2018 at 12:50 PM Ryan Blue > wrote: > >> Hi everyone, >> >> I’d like to propose an addition to DataSourceV2 tables, a capability API. >> This API would allow Spark t

DataSourceV2 sync tomorrow

2018-11-13 Thread Ryan Blue
tPartition[] parts = stream.planInputPartitions(start) // returns when needsReconfiguration is true or all tasks finish runTasks(parts, factory, end) // the stream's current offset has been updated at the last epoch } -- Ryan Blue Software Engineer Netflix

Re: DataSourceV2 sync tomorrow

2018-11-15 Thread Ryan Blue
a couple of tests, it looks like live streams only work within an organization. In the future, I won’t add a live stream since no one but people from Netflix can join. Last, here are the notes: *Attendees* Ryan Blue - Netflix John Zhuge - Netflix Yuanjian Li - Baidu - Interested in Catalog API Felix

Re: Test and support only LTS JDK release?

2018-11-06 Thread Ryan Blue
this in Spark community. >> >> Thanks, >> >> DB Tsai | Siri Open Source Technologies [not a contribution] |  >> Apple, Inc >> >> > -- > Robert Stupp > @snazy > > -- Ryan Blue Software Engineer Netflix

Re: Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-06 Thread Ryan Blue
hnologies [not a contribution] |  > Apple, Inc > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > > > -- Ryan Blue Software Engineer Netflix

Re: DataSourceV2 capability API

2018-11-09 Thread Ryan Blue
ned? > > > -- > *From:* Ryan Blue > *Sent:* Thursday, November 8, 2018 2:09 PM > *To:* Reynold Xin > *Cc:* Spark Dev List > *Subject:* Re: DataSourceV2 capability API > > > Yes, we currently use traits that have methods. Something like

Re: Behavior of SaveMode.Append when table is not present

2018-11-09 Thread Ryan Blue
ting data* > > However it does not specify behavior when the table does not exist. > Does that throw exception or create the table or a NO-OP? > > Thanks, > Shubham > -- Ryan Blue Software Engineer Netflix

Re: DataSourceV2 sync tomorrow

2018-11-14 Thread Ryan Blue
the meet up. I'll also plan on joining earlier than I did last time, in case we the meet/hangout needs to be up for people to view the live stream. rb On Tue, Nov 13, 2018 at 4:00 PM Ryan Blue wrote: > Hi everyone, > I just wanted to send out a reminder that there’s a DSv2 sync tomorrow a

Re: DataSourceV2 sync tomorrow

2018-11-14 Thread Ryan Blue
ies of a micro-batch) and may be then the >> 'latest' offset is not needed at all. >> >> - Arun >> >> >> On Tue, 13 Nov 2018 at 16:01, Ryan Blue >> wrote: >> >>> Hi everyone, >>> I just wanted to send out a reminder that there’

Spark SQL parser and DDL

2018-10-04 Thread Ryan Blue
that converts from the parsed SQL plan to CatalogTable-based v1 plans. It is also cleaner to have the logic for converting to CatalogTable in DataSourceAnalysis instead of in the parser itself. Are there objections to this approach for integrating v2 plans? -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Syntax for table DDL

2018-10-04 Thread Ryan Blue
add Hive compatible syntax later. > > On Tue, Oct 2, 2018 at 11:50 PM Ryan Blue > wrote: > >> I'd say that it was important to be compatible with Hive in the past, but >> that's becoming less important over time. Spark is well established with >> Hadoop users and I think the f

Re: Data source V2 in spark 2.4.0

2018-10-04 Thread Ryan Blue
inded message, I will probably have more as I continue > to explore this. > > Thanks, >Assaf. > > > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Syntax for table DDL

2018-10-02 Thread Ryan Blue
. > > I am personally following this PR with a lot of interest, thanks for all > the work along this direction. > > Best regards, > Alessandro > > On Mon, 1 Oct 2018 at 20:21, Ryan Blue wrote: > >> What do you mean by consistent with the syntax in SqlBase.g4? These >>

[DISCUSS] Syntax for table DDL

2018-09-28 Thread Ryan Blue
if you have suggestions based on a different SQL engine or want this syntax to be different for another reason. Thanks! rb -- Ryan Blue Software Engineer Netflix

Re: Data source V2 in spark 2.4.0

2018-10-01 Thread Ryan Blue
---- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Syntax for table DDL

2018-10-01 Thread Ryan Blue
lowing the Hive DDL syntax: > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Column > > Ryan Blue 于2018年9月28日周五 下午3:47写道: > >> Hi everyone, >> >> I’m currently working on new table DDL statements for v2 tables. F

Re: data source api v2 refactoring

2018-09-19 Thread Ryan Blue
> > I ask because those are the most widely used data sources and have a lot > of effort and thinking behind them, and if they have ported over to V2, > then they can serve as excellent production examples of V2 API. > > > > Thanks, > > Jayesh > > > > *F

Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-19 Thread Ryan Blue
m generically in the API, > allowing pass-through commands to manipulate them, or by some other > means. > > Regards, > Dale. > > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > > > -- Ryan Blue Software Engineer Netflix

Re: Datasource v2 Select Into support

2018-09-19 Thread Ryan Blue
er >> +- Project [Mort AS Mort#7, 1000 AS 1000#8] >>+- OneRowRelation >> >> My DefaultSource V2 implementation extends DataSourceV2 with ReadSupport >> with ReadSupportWithSchema with WriteSupport >> >> I'm wondering if there is something I'm not implementing, or if there is >> a bug in my implementation or its an issue with Spark? >> >> Any pointers would be great, >> >> Ross >> > -- Ryan Blue Software Engineer Netflix

Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-19 Thread Ryan Blue
partition loading in Hive and Oracle. > > > > So in short, I agree that partition management should be an optional > interface. > > > > *From: *Ryan Blue > *Reply-To: *"rb...@netflix.com" > *Date: *Wednesday, September 19, 2018 at 2:58 PM >

Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-20 Thread Ryan Blue
t; >>>>>>>> > If you are a Spark user, you can help us test this release by >>>>>>>> taking >>>>>>>> > an existing Spark workload and running on this release candidate, >>>>>>>> then >>>>>>>> > reporting any regressions. >>>>>>>> > >>>>>>>> > If you're working in PySpark you can set up a virtual env and >>>>>>>> install >>>>>>>> > the current RC and see if anything important breaks, in the >>>>>>>> Java/Scala >>>>>>>> > you can add the staging repository to your projects resolvers and >>>>>>>> test >>>>>>>> > with the RC (make sure to clean up the artifact cache >>>>>>>> before/after so >>>>>>>> > you don't end up building with a out of date RC going forward). >>>>>>>> > >>>>>>>> > === >>>>>>>> > What should happen to JIRA tickets still targeting 2.3.2? >>>>>>>> > === >>>>>>>> > >>>>>>>> > The current list of open tickets targeted at 2.3.2 can be found >>>>>>>> at: >>>>>>>> > https://issues.apache.org/jira/projects/SPARK and search for >>>>>>>> "Target Version/s" = 2.3.2 >>>>>>>> > >>>>>>>> > Committers should look at those and triage. Extremely important >>>>>>>> bug >>>>>>>> > fixes, documentation, and API tweaks that impact compatibility >>>>>>>> should >>>>>>>> > be worked on immediately. Everything else please retarget to an >>>>>>>> > appropriate release. >>>>>>>> > >>>>>>>> > == >>>>>>>> > But my bug isn't fixed? >>>>>>>> > == >>>>>>>> > >>>>>>>> > In order to make timely releases, we will typically not hold the >>>>>>>> > release unless the bug in question is a regression from the >>>>>>>> previous >>>>>>>> > release. That being said, if there is something which is a >>>>>>>> regression >>>>>>>> > that has not been correctly targeted please ping me or a >>>>>>>> committer to >>>>>>>> > help target the issue. >>>>>>>> >>>>>>>> >>>>>>>> - >>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>>> >>>>>>>> >>>> >>>> -- >>>> --- >>>> Takeshi Yamamuro >>>> >>> >>> >>> -- >>> John >>> >> -- Ryan Blue Software Engineer Netflix

Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-20 Thread Ryan Blue
> Hi, Ryan. > > Could you share the result on 2.3.1 since this is 2.3.2 RC? That would be > helpful to narrow down the scope. > > Bests, > Dongjoon. > > On Thu, Sep 20, 2018 at 11:56 Ryan Blue wrote: > >> -0 >> >> My DataSourceV2 implementation for Iceberg is f

Re: [Discuss] Datasource v2 support for Kerberos

2018-09-24 Thread Ryan Blue
ion > items as well. > > > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > ----- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: [Discuss] Datasource v2 support for Kerberos

2018-09-25 Thread Ryan Blue
ttp://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: [Discuss] Datasource v2 support for Kerberos

2018-09-19 Thread Ryan Blue
amespace that needs it. >> >> If the data source requires TLS support then we also need to support >> passing >> all the configuration values under "spark.ssl.*" >> >> What do people think? Placeholder Issue has been added at SPARK-25329. >> >> >> >> -- >> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> -- Ryan Blue Software Engineer Netflix

Re: Trigger full GC during executor idle time?

2018-12-31 Thread Ryan Blue
the tune of 2-6%. Has anyone >> considered this before? >> >> Sean >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-16 Thread Ryan Blue
a long time (say .. until > Spark 4.0.0?). > > > > I know somehow it happened to be sensitive but to be just literally > honest to myself, I think we should make a try. > > > > > -- > Marcelo > -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Identifiers with multi-catalog support

2019-01-17 Thread Ryan Blue
Any discussion on how Spark should manage identifiers when multiple catalogs are supported? I know this is an area where a lot of people are interested in making progress, and it is a blocker for both multi-catalog support and CTAS in DSv2. On Sun, Jan 13, 2019 at 2:22 PM Ryan Blue wrote: >

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Ryan Blue
we are super 100% dependent on Hive... >> >> >> -- >> *From:* Ryan Blue >> *Sent:* Tuesday, January 15, 2019 9:53 AM >> *To:* Xiao Li >> *Cc:* Yuming Wang; dev >> *Subject:* Re: [DISCUSS] Upgrade built-in Hive to 2.3.4 >

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Ryan Blue
st PR <https://github.com/apache/spark/pull/23552> does not >> contain the changes of hive-thriftserver. Please ignore the failed test in >> hive-thriftserver. >> >> The second PR <https://github.com/apache/spark/pull/23553> is complete >> changes. >> >> >> >> I have created a Spark distribution for Apache Hadoop 2.7, you might >> download it via Google Drive >> <https://drive.google.com/open?id=1cq2I8hUTs9F4JkFyvRfdOJ5BlxV0ujgt> or Baidu >> Pan <https://pan.baidu.com/s/1b090Ctuyf1CDYS7c0puBqQ>. >> >> Please help review and test. Thanks. >> > -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Identifiers with multi-catalog support

2019-01-22 Thread Ryan Blue
scheme will need to play nice with column identifier as > well. > > > > > -- > > *From:* Ryan Blue > *Sent:* Thursday, January 17, 2019 11:38 AM > *To:* Spark Dev List > *Subject:* Re: [DISCUSS] Identifiers with multi-catalog support > &

Re: Self join

2018-12-11 Thread Ryan Blue
nges in the design, we can do that. > > Thoughts on this? > > Thanks, > Marco > -- Ryan Blue Software Engineer Netflix

Re: Pushdown in DataSourceV2 question

2018-12-11 Thread Ryan Blue
f = spark.read.json("s3://sample_bucket/people.json") >>>>> > df.printSchema() >>>>> > df.filter($"age" > 20).explain() >>>>> > >>>>> > root >>>>> > |-- age: long (nullable = true) >>>>> > |-- name: string (nullable = true) >>>>> > >>>>> > == Physical Plan == >>>>> > *Project [age#47L, name#48] >>>>> > +- *Filter (isnotnull(age#47L) && (age#47L > 20)) >>>>> >+- *FileScan json [age#47L,name#48] Batched: false, Format: JSON, >>>>> Location: InMemoryFileIndex[s3://sample_bucket/people.json], >>>>> PartitionFilters: [], PushedFilters: [IsNotNull(age), >>>>> GreaterThan(age,20)], >>>>> ReadSchema: struct >>>>> > >>>>> > # Comments >>>>> > As you can see, PushedFilter is shown even if input data is JSON. >>>>> > Actually this pushdown is not used. >>>>> > >>>>> > I'm wondering if it has been already discussed or not. >>>>> > If not, this is a chance to have such feature in DataSourceV2 >>>>> because it would require some API level changes. >>>>> > >>>>> > >>>>> > Warm regards, >>>>> > >>>>> > Noritaka Sekiyama >>>>> > >>>>> >>>>> - >>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>> >>>>> -- Ryan Blue Software Engineer Netflix

Re: Self join

2018-12-12 Thread Ryan Blue
people that are looking at it now are the ones already familiar with the problem. rb On Wed, Dec 12, 2018 at 2:05 AM Marco Gaido wrote: > Thank you all for your answers. > > @Ryan Blue sure, let me state the problem more > clearly: imagine you have 2 dataframes with a co

Re: [SPARK-26160] Make assertNotBucketed call in DataFrameWriter::save optional

2018-12-10 Thread Ryan Blue
eply to the > sender that you have received this communication in error and then delete > it. > > Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, > pode conter informação privilegiada ou confidencial e é para uso exclusivo > da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário > indicado, fica notificado de que a leitura, utilização, divulgação e/ou > cópia sem autorização pode estar proibida em virtude da legislação vigente. > Se recebeu esta mensagem por erro, rogamos-lhe que nos o comunique > imediatamente por esta mesma via e proceda a sua destruição > -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Default values and data sources

2018-12-20 Thread Ryan Blue
l that we should follow RDBMS/SQL standard >> regarding the behavior? >> >> > pass the default through to the underlying data source >> >> This is one way to implement the behavior. >> >> On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue wrote: >> >>>

[DISCUSS] Function plugins

2018-12-14 Thread Ryan Blue
ave to solve challenges with function naming (whether there is a db component). Right now I’d like to think through the overall idea and not get too focused on those details. Thanks, rb -- Ryan Blue Software Engineer Netflix

<    1   2   3   4   >