. Anyone else want to raise
an issue with the proposal, or is it about time to bring up a vote thread?
rb
On Thu, Jul 26, 2018 at 5:00 PM Ryan Blue wrote:
> I don’t think that we want to block this work until we have a public and
> stable Expression. Like our decision to expose Internal
I think I found a good solution to the problem of using Expression in the
TableCatalog API and in the DeleteSupport API.
For DeleteSupport, there is already a stable and public subset of
Expression named Filter that can be used to pass filters. The reason why
DeleteSupport would use Expression is
PI, similar to what we did for
> dsv1.
>
> If we are depending on Expressions on the more common APIs in dsv2
> already, we should revisit that.
>
>
>
>
> On Mon, Aug 13, 2018 at 1:59 PM Ryan Blue wrote:
>
>> Reynold, did you get a chance to look at my response about
e should be supported anyway, I was
> thinking we could just orthogonally proceed. If you guys think other issues
> should be resolved first, I think we (at least I will) should take a look
> for the set of catalog APIs.
>
>
--
Ryan Blue
Software Engineer
Netflix
entation
>
> Thanks for your time,
> Russ
>
> On Mon, Aug 20, 2018 at 11:33 AM Ryan Blue
> wrote:
>
>> Thanks for posting this discussion to the dev list, it would be great to
>> hear what everyone thinks about the idea that USING should be a
>> catalog-specific
This vote passes with 4 binding +1s and 9 community +1s.
Thanks for taking the time to vote, everyone!
Binding votes:
Wenchen Fan
Xiao Li
Reynold Xin
Felix Cheung
Non-binding votes:
Ryan Blue
John Zhuge
Takeshi Yamamuro
Marco Gaido
Russel Spitzer
Alessandro Solimando
Henry Robinson
Dongjoon
rk should adopt the SPIP
[-1]: Spark should not adopt the SPIP because . . .
Thanks for voting, everyone!
--
Ryan Blue
ple can
> jump
> > in during the development. I'm interested in the new API and like to
> work on
> > it after the vote passes.
> >
> > Thanks,
> > Wenchen
> >
> > On Fri, Jul 13, 2018 at 7:25 AM Ryan Blue wrote:
> >>
> >> Thanks! I'm a
+1 (not binding)
On Tue, Jul 17, 2018 at 10:59 AM Ryan Blue wrote:
> Hi everyone,
>
> From discussion on the proposal doc and the discussion thread, I think we
> have consensus around the plan to standardize logical write operations for
> DataSourceV2. I would like
continue to use
the property to determine the table’s data source or format implementation.
Other table catalog implementations would be free to interpret the format
string as they choose or to use it to choose a data source implementation
as in the default catalog.
rb
--
Ryan Blue
Software Engineer
Netflix
;>>>>> Note. RC2 was cancelled because of one blocking issue SPARK-24781
>>>>>>> during release preparation.
>>>>>>>
>>>>>>> FAQ
>>>>>>>
>>>>>>> =
>>>>>>> How can I help test this release?
>>>>>>> =
>>>>>>>
>>>>>>> If you are a Spark user, you can help us test this release by taking
>>>>>>> an existing Spark workload and running on this release candidate,
>>>>>>> then
>>>>>>> reporting any regressions.
>>>>>>>
>>>>>>> If you're working in PySpark you can set up a virtual env and install
>>>>>>> the current RC and see if anything important breaks, in the
>>>>>>> Java/Scala
>>>>>>> you can add the staging repository to your projects resolvers and
>>>>>>> test
>>>>>>> with the RC (make sure to clean up the artifact cache before/after so
>>>>>>> you don't end up building with a out of date RC going forward).
>>>>>>>
>>>>>>> ===
>>>>>>> What should happen to JIRA tickets still targeting 2.3.2?
>>>>>>> ===
>>>>>>>
>>>>>>> The current list of open tickets targeted at 2.3.2 can be found at:
>>>>>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>>>>>> Version/s" = 2.3.2
>>>>>>>
>>>>>>> Committers should look at those and triage. Extremely important bug
>>>>>>> fixes, documentation, and API tweaks that impact compatibility should
>>>>>>> be worked on immediately. Everything else please retarget to an
>>>>>>> appropriate release.
>>>>>>>
>>>>>>> ==
>>>>>>> But my bug isn't fixed?
>>>>>>> ==
>>>>>>>
>>>>>>> In order to make timely releases, we will typically not hold the
>>>>>>> release unless the bug in question is a regression from the previous
>>>>>>> release. That being said, if there is something which is a regression
>>>>>>> that has not been correctly targeted please ping me or a committer to
>>>>>>>
>>>>>> help target the issue.
>>>>>>>
>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> John Zhuge
>>>>>>>
>>>>>>
--
Ryan Blue
Software Engineer
Netflix
Quick update: I've updated my PR to add the table catalog API to implement
this proposal. Here's the PR: https://github.com/apache/spark/pull/21306
On Mon, Jul 23, 2018 at 5:01 PM Ryan Blue wrote:
> Lately, I’ve been working on implementing the new SQL logical plans. I’m
> currently b
IP is for the APIs and does not cover how multiple catalogs would be
exposed. I started a separate discussion thread on how to access multiple
catalogs and maintain compatibility with Spark’s current behavior (how to
get the catalog instance in the above example).
Please use this thread to discuss the proposed APIs. Thanks, everyone!
rb
--
Ryan Blue
Software Engineer
Netflix
the above:
>>
>> 1. Creates an explicit Table abstraction, and an explicit Scan
>> abstraction.
>>
>> 2. Have an explicit Stream level and makes it clear pushdowns and options
>> are handled there, rather than at the individual scan (ReadSupport) level.
>> Data source implementations don't need to worry about pushdowns or options
>> changing mid-stream. For batch, those happen when the scan object is
>> created.
>>
>>
>>
>> This email is just a high level sketch. I've asked Wenchen to prototype
>> this, to see if it is actually feasible and the degree of hacks it removes,
>> or creates.
>>
>>
>>
--
Ryan Blue
Software Engineer
Netflix
th the
> above?
>
> At a high level, I think the Heilmeier's Catechism emphasizes less about
> the "how", and more the "why" and "what", which is what I'd argue SPIPs
> should be about. The hows should be left in design docs for larger projects.
>
>
>
--
Ryan Blue
Software Engineer
Netflix
with ScanConfig.
> For streaming source, stream is the one to take care of the pushdown
> result. For batch source, it's the scan.
>
> It's a little tricky because stream is an abstraction for streaming source
> only. Better ideas are welcome!
>
> On Sat, Sep 1, 2018 at 7:26 AM Ry
Latest from Wenchen in case it was dropped.
-- Forwarded message -
From: Wenchen Fan
Date: Mon, Sep 3, 2018 at 6:16 AM
Subject: Re: data source api v2 refactoring
To:
Cc: Ryan Blue , Reynold Xin , <
dev@spark.apache.org>
Hi Mridul,
I'm not sure what's going on, my
1 gives Spark the opportunity
> to enforce column references are valid (but not the actual function names),
> whereas option 2 would be up to the data sources to validate.
>
>
>
> On Wed, Aug 15, 2018 at 2:27 PM Ryan Blue wrote:
>
>> I think I found a good solution to th
> trait Table {
> LogicalWrite newAppendWrite();
>
> LogicalWrite newDeleteWrite(deleteExprs);
> }
>
>
> It looks to me that the API is simpler without WriteConfig, what do you
> think?
>
> Thanks,
> Wenchen
>
> On Wed, Sep 5, 2018 at 4:24 AM Ryan Blue
> wrote:
gt;
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>
--
Ryan Blue
Software Engineer
Netflix
dates for
>>>>>> consideration):
>>>>>> >>>
>>>>>> >>> 1. Support Scala 2.12.
>>>>>> >>>
>>>>>> >>> 2. Remove interfaces, configs, and modules (e.g. Bagel)
>>>>>> deprecated in Spark 2.x.
>>>>>> >>>
>>>>>> >>> 3. Shade all dependencies.
>>>>>> >>>
>>>>>> >>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL
>>>>>> compliant, to prevent users from shooting themselves in the foot, e.g.
>>>>>> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it
>>>>>> less painful for users to upgrade here, I’d suggest creating a flag for
>>>>>> backward compatibility mode.
>>>>>> >>>
>>>>>> >>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL
>>>>>> more standard compliant, and have a flag for backward compatibility.
>>>>>> >>>
>>>>>> >>> 6. Miscellaneous other small changes documented in JIRA already
>>>>>> (e.g. “JavaPairRDD flatMapValues requires function returning Iterable,
>>>>>> not
>>>>>> Iterator”, “Prevent column name duplication in temporary view”).
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> Now the reality of a major version bump is that the world often
>>>>>> thinks in terms of what exciting features are coming. I do think there
>>>>>> are
>>>>>> a number of major changes happening already that can be part of the 3.0
>>>>>> release, if they make it in:
>>>>>> >>>
>>>>>> >>> 1. Scala 2.12 support (listing it twice)
>>>>>> >>> 2. Continuous Processing non-experimental
>>>>>> >>> 3. Kubernetes support non-experimental
>>>>>> >>> 4. A more flushed out version of data source API v2 (I don’t
>>>>>> think it is realistic to stabilize that in one release)
>>>>>> >>> 5. Hadoop 3.0 support
>>>>>> >>> 6. ...
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> Similar to the 2.0 discussion, this thread should focus on the
>>>>>> framework and whether it’d make sense to create Spark 3.0 as the next
>>>>>> release, rather than the individual feature requests. Those are important
>>>>>> but are best done in their own separate threads.
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Vaquar Khan
>>>> +1 -224-436-0783
>>>> Greater Chicago
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Vaquar Khan
>>> +1 -224-436-0783
>>> Greater Chicago
>>>
>>
>>
>>
>> --
>> Regards,
>> Vaquar Khan
>> +1 -224-436-0783
>> Greater Chicago
>>
>
--
Ryan Blue
Software Engineer
Netflix
a major version update to get it?
>
> I generally support moving on to 3.x so we can also jettison a lot of
> older dependencies, code, fix some long standing issues, etc.
>
> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>
> On Thu, Sep 6, 2018 at 9:10 AM
also jettison a lot of
>> older dependencies, code, fix some long standing issues, etc.
>>
>> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>>
>> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue
>> wrote:
>>
>>> My concern is that the v2 data
gt; state.
>
>
> On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue
> wrote:
>
>> It would be great to get more features out incrementally. For
>> experimental features, do we have more relaxed constraints?
>>
>> On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin wrote:
>
and is
>> discoverable - thereby breaking the documented contract.
>>
>> I was wondering how other databases systems plan to implement this API
>> and meet the contract as per the Javadoc?
>>
>> Many thanks
>>
>> Ross
>>
>
--
Ryan Blue
Software Engineer
Netflix
;>>>> receives "prepared" from all the tasks, a "commit" would be invoked at
>>>>>> each
>>>>>> of the individual tasks). Right now the responsibility of the final
>>>>>> "commit" is with the driver and it may not always
quot;commit" would be invoked at
>>>>>> each
>>>>>> of the individual tasks). Right now the responsibility of the final
>>>>>> "commit" is with the driver and it may not always be possible for the
>>>>>> driver to take over the transact
8 at 3:02 PM Hyukjin Kwon wrote:
>
>> BTW, do we hold Datasource V2 related PRs for now until we finish this
>> refactoring just for clarification?
>>
>> 2018년 9월 7일 (금) 오전 12:52, Ryan Blue 님이 작성:
>>
>>> Wenchen,
>>>
>>> I'm not really su
hanks,
> Wenchen
>
> On Fri, Apr 20, 2018 at 5:01 AM Ryan Blue
> wrote:
>
>> Hi everyone,
>>
>> A few weeks ago, I wrote up a proposal to standardize SQL logical plans
>> <https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?
g.apache.spark.scheduler.DAGScheduler#createResultStage
>>
>>
>>
>> I can see the effect of doing this may be that Job Submissions may not be
>> FIFO depending on how much time Step 1 mentioned above is going to consume.
>>
>>
>>
>> Does above solution suffice for the problem described? And is there any
>> other side effect of this solution?
>>
>>
>>
>> Regards
>>
>> Ajith
>>
>
>
--
Ryan Blue
Software Engineer
Netflix
My guess is that we wouldn't want to upgrade to a new minor version of
> Parquet for a Spark maintenance release, so asking for a Parquet
> maintenance release makes sense.
>
> What does everyone think?
>
> Best,
> Henry
>
--
Ryan Blue
Software Engineer
Netflix
Ted Yu <yuzhih...@gmail.com> wrote:
>
>> +1
>>
>> ---- Original message
>> From: Ryan Blue <rb...@netflix.com>
>> Date: 3/30/18 2:28 PM (GMT-08:00)
>> To: Patrick Woody <patrick.woo...@gmail.com>
>> Cc: Russell Spitzer <
raits go away. And the ORC data source can also be simplified
>> to
>>
>> class OrcReaderFactory(...) extends DataReaderFactory {
>> def createUnsafeRowReader ...
>>
>> def createColumnarBatchReader ...
>> }
>>
>> class OrcDataSourceReader extends DataSourceReader {
>> def createReadFactories = ... // logic to prepare the parameters and
>> create factories
>> }
>>
>> We also have a potential benefit of supporting hybrid storage data
>> source, which may keep real-time data in row format, and history data in
>> columnar format. Then they can make some DataReaderFactory output
>> InternalRow and some output ColumnarBatch.
>>
>> Thoughts?
>>
>
>
--
Ryan Blue
Software Engineer
Netflix
ache.org/jira/browse/SPARK-18455>, but it's not clear
> to me whether they are "design-appropriate" for the DataFrame API.
>
> Are correlated subqueries a thing we can expect to have in the DataFrame
> API?
>
> Nick
>
>
--
Ryan Blue
Software Engineer
Netflix
to get any remaining discussion going or get anyone that
missed this to read through the docs.
Thanks!
rb
--
Ryan Blue
Software Engineer
Netflix
validation/assumptions of the table before attempting the write.
>
> Thanks!
> Pat
>
--
Ryan Blue
Software Engineer
Netflix
an determine the order of Expression's by looking at what
>> requiredOrdering()
>> returns.
>>
>> On Mon, Mar 26, 2018 at 5:45 PM, Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> Hi Pat,
>>>
>>> Thanks for starting the discussion on thi
n a while, but does Clustering support allow
> requesting that partitions contain elements in order as well? That would be
> a useful trick for me. IE
> Request/Require(SortedOn(Col1))
> Partition 1 -> ((A,1), (A, 2), (B,1) , (B,2) , (C,1) , (C,2))
>
> On Tue, Mar 27,
ific one called
> HashClusteredDistribution.
>
> So currently only Aggregate can benefit from SupportsReportPartitioning
> and save shuffle. We can add a new interface to expose the hash function to
> make it work for Join.
>
> On Tue, Mar 27, 2018 at 9:33 AM, Ryan Blue <rb...@netf
gt;> On Mon, Mar 26, 2018 at 6:11 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> Interesting.
>>>>
>>>> Should requiredClustering return a Set of Expression's ?
>>>> This way, we can determine the order of Expression's by looking at
izer which can decide which method to
>> use rather than having the data source itself do it. This is probably in a
>> far future version of the api.
>>
>> On Thu, Mar 29, 2018 at 9:10 AM Ryan Blue <rb...@netflix.com> wrote:
>>
>>> Cassandra can in
ping that through the CBO effort we will continue to
>>>> get more detailed statistics. Like on read we could be using sketch data
>>>> structures to get estimates on unique values and density for each column.
>>>> You may be right that the real way for this to be handl
t; On Tue, Mar 27, 2018 at 7:59 PM, Russell Spitzer <
> russell.spit...@gmail.com> wrote:
>
>> Thanks for the clarification, definitely would want to require Sort but
>> only recommend partitioning ... I think that would be useful to request
>> based on details about the inc
rhead.
>
> For the second, I wouldn't assume that a data source requiring a certain
> write format would give any guarantees around reading the same data? In the
> cases where it is a complete overwrite it would, but for independent writes
> it could still be useful for statistics or c
jEoo/edit?usp=sharing>
.
Comments and feedback are welcome! Feel free to comment on the doc or reply
to this thread.
rb
--
Ryan Blue
Software Engineer
Netflix
he format of the SHA512 hash, can we add
> a SHA256 hash to our releases in this format?
>
> I suppose if it’s not easy to update or add hashes to our existing
> releases, it may be too difficult to change anything here. But I’m not
> sure, so I thought I’d ask.
>
> Nick
>
>
--
Ryan Blue
Software Engineer
Netflix
usual API, its not possible (or
> difficult) to create custom structured streaming sources.
>
>
>
> Consequently, one has to create streaming sources in packages under
> org.apache.spark.sql.
>
>
>
> Any pointers or info is greatly appreciated.
>
--
Ryan Blue
Software Engineer
Netflix
save using DataFrameWriter, resulting 512k-block-size
>
> df_txt.write.mode('overwrite').format('parquet').save('hdfs:
> //spark1/tmp/temp_with_df')
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
te set of those high-level logical
operations, most of which are already defined in SQL or implemented by some
write path in Spark.
rb
--
Ryan Blue
Software Engineer
Netflix
gt;>>>>> stream-stream
>>>>>> > join. Users can hit this bug if one of the join side is partitioned
>>>>>> by a
>>>>>> > subset of the join keys.
>>>>>> >
>>>>>> > SPARK-24552: Task at
elson, Assaf
>>> wrote:
>>>
>>> Could you add a fuller code example? I tried to reproduce it in my
>>> environment and I am getting just one instance of the reader…
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Assaf
>
end up with so many people that we can't
actually get the discussion going. Here's a link to the stream:
https://stream.meet.google.com/stream/6be59d80-04c7-44dc-9042-4f3b597fc8ba
Thanks!
rb
On Thu, Oct 25, 2018 at 1:09 PM Ryan Blue wrote:
> Hi everyone,
>
> There's been some great d
;
> I didn't know I live in the same timezone with you Wenchen :D.
> Monday or Wednesday at 5PM PDT sounds good to me too FWIW.
>
> 2018년 10월 26일 (금) 오전 8:29, Ryan Blue 님이 작성:
>
>> Good point. How about Monday or Wednesday at 5PM PDT then?
>>
>> Everyone, please repl
; >> >> What should happen to JIRA tickets still targeting 2.4.0?
>>> >> >> ===
>>> >> >>
>>> >> >> The current list of open tickets targeted at 2.4.0 can be found at:
>>> >> >> https://issues.apache.org/jira/projects/SPARK and search for
>>> "Target Version/s" = 2.4.0
>>> >> >>
>>> >> >> Committers should look at those and triage. Extremely important bug
>>> >> >> fixes, documentation, and API tweaks that impact compatibility
>>> should
>>> >> >> be worked on immediately. Everything else please retarget to an
>>> >> >> appropriate release.
>>> >> >>
>>> >> >> ==
>>> >> >> But my bug isn't fixed?
>>> >> >> ==
>>> >> >>
>>> >> >> In order to make timely releases, we will typically not hold the
>>> >> >> release unless the bug in question is a regression from the
>>> previous
>>> >> >> release. That being said, if there is something which is a
>>> regression
>>> >> >> that has not been correctly targeted please ping me or a committer
>>> to
>>> >> >> help target the issue.
>>> >> >
>>> >> >
>>> -
>>> >> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >> >
>>> >>
>>> >>
>>> >> -
>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> [image: Spark+AI Summit North America 2019]
>> <http://t.sidekickopen24.com/s1t/c/5/f18dQhb0S7lM8dDMPbW2n0x6l2B9nMJN7t5X-FfhMynN2z8MDjQsyTKW56dzQQ1-_gV6102?t=https%3A%2F%2Fdatabricks.com%2Fsparkaisummit%2Fnorth-america=undefined=406b8c9a-b648-4923-9ed1-9a51ffe213fa>
>>
>
--
Ryan Blue
Software Engineer
Netflix
--
Ryan Blue
Software Engineer
Netflix
Thanks to everyone that attended the sync! We had some good discussions.
Here are my notes for anyone that missed it or couldn’t join the live
stream. If anyone wants to add to this, please send additional thoughts or
corrections.
*Attendees:*
- Ryan Blue - Netflix - Using v2 to integrate
day at my side, it will be great if we can
> pick a day from Monday to Thursday.
>
> On Fri, Oct 26, 2018 at 8:08 AM Ryan Blue wrote:
>
>> Since not many people have replied with a time window, how about we aim
>> for 5PM PDT? That should work for Wenchen and most peo
eting is definitely helpful to discuss, move certain effort
>>>>> forward and keep people on the same page. Glad to see this kind of working
>>>>> group happening.
>>>>>
>>>>> On Thu, Oct 25, 2018 at 5:58 PM John Zhuge wrote:
>>>&g
though Apache Spark provides the binary distributions, it would be
> great if this succeeds out of the box.
> >
> > Bests,
> > Dongjoon.
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
le. To fix
this problem, I would use a table capability, like
read-missing-columns-as-null.
Any comments on this approach?
rb
--
Ryan Blue
Software Engineer
Netflix
to
> throw exceptions when they don't support a specific operation.
>
>
> On Fri, Nov 9, 2018 at 11:54 AM Ryan Blue wrote:
>
>> Do you have an example in mind where we might add a capability and break
>> old versions of data sources?
>>
>> These are really for
pporting that property, and thus throwing an
> exception.
>
>
> On Fri, Nov 9, 2018 at 9:11 AM Ryan Blue wrote:
>
>> I'd have two places. First, a class that defines properties supported and
>> identified by Spark, like the SQLConf definitions. Second, in documentation
>>
Another solution to the decimal case is using the capability API: use a
capability to signal that the table knows about `supports-decimal`. So
before the decimal support check, it would check
`table.isSupported("type-capabilities")`.
On Fri, Nov 9, 2018 at 12:45 PM Ryan B
ll evolve (e.g. how many different
> capabilities there will be).
>
>
> On Thu, Nov 8, 2018 at 12:50 PM Ryan Blue
> wrote:
>
>> Hi everyone,
>>
>> I’d like to propose an addition to DataSourceV2 tables, a capability API.
>> This API would allow Spark t
tPartition[] parts = stream.planInputPartitions(start)
// returns when needsReconfiguration is true or all tasks finish
runTasks(parts, factory, end)
// the stream's current offset has been updated at the last epoch
}
--
Ryan Blue
Software Engineer
Netflix
a
couple of tests, it looks like live streams only work within an
organization. In the future, I won’t add a live stream since no one but
people from Netflix can join.
Last, here are the notes:
*Attendees*
Ryan Blue - Netflix
John Zhuge - Netflix
Yuanjian Li - Baidu - Interested in Catalog API
Felix
this in Spark community.
>>
>> Thanks,
>>
>> DB Tsai | Siri Open Source Technologies [not a contribution] |
>> Apple, Inc
>>
>>
> --
> Robert Stupp
> @snazy
>
>
--
Ryan Blue
Software Engineer
Netflix
hnologies [not a contribution] |
> Apple, Inc
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>
>
--
Ryan Blue
Software Engineer
Netflix
ned?
>
>
> --
> *From:* Ryan Blue
> *Sent:* Thursday, November 8, 2018 2:09 PM
> *To:* Reynold Xin
> *Cc:* Spark Dev List
> *Subject:* Re: DataSourceV2 capability API
>
>
> Yes, we currently use traits that have methods. Something like
ting data*
>
> However it does not specify behavior when the table does not exist.
> Does that throw exception or create the table or a NO-OP?
>
> Thanks,
> Shubham
>
--
Ryan Blue
Software Engineer
Netflix
the meet up.
I'll also plan on joining earlier than I did last time, in case we the
meet/hangout needs to be up for people to view the live stream.
rb
On Tue, Nov 13, 2018 at 4:00 PM Ryan Blue wrote:
> Hi everyone,
> I just wanted to send out a reminder that there’s a DSv2 sync tomorrow a
ies of a micro-batch) and may be then the
>> 'latest' offset is not needed at all.
>>
>> - Arun
>>
>>
>> On Tue, 13 Nov 2018 at 16:01, Ryan Blue
>> wrote:
>>
>>> Hi everyone,
>>> I just wanted to send out a reminder that there’
that converts from the parsed SQL plan to CatalogTable-based v1
plans. It is also cleaner to have the logic for converting to CatalogTable
in DataSourceAnalysis instead of in the parser itself.
Are there objections to this approach for integrating v2 plans?
--
Ryan Blue
Software Engineer
Netflix
add Hive compatible syntax later.
>
> On Tue, Oct 2, 2018 at 11:50 PM Ryan Blue
> wrote:
>
>> I'd say that it was important to be compatible with Hive in the past, but
>> that's becoming less important over time. Spark is well established with
>> Hadoop users and I think the f
inded message, I will probably have more as I continue
> to explore this.
>
> Thanks,
>Assaf.
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
.
>
> I am personally following this PR with a lot of interest, thanks for all
> the work along this direction.
>
> Best regards,
> Alessandro
>
> On Mon, 1 Oct 2018 at 20:21, Ryan Blue wrote:
>
>> What do you mean by consistent with the syntax in SqlBase.g4? These
>>
if you have suggestions based on a different
SQL engine or want this syntax to be different for another reason. Thanks!
rb
--
Ryan Blue
Software Engineer
Netflix
----
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
lowing the Hive DDL syntax:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Column
>
> Ryan Blue 于2018年9月28日周五 下午3:47写道:
>
>> Hi everyone,
>>
>> I’m currently working on new table DDL statements for v2 tables. F
>
> I ask because those are the most widely used data sources and have a lot
> of effort and thinking behind them, and if they have ported over to V2,
> then they can serve as excellent production examples of V2 API.
>
>
>
> Thanks,
>
> Jayesh
>
>
>
> *F
m generically in the API,
> allowing pass-through commands to manipulate them, or by some other
> means.
>
> Regards,
> Dale.
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
>
>
>
--
Ryan Blue
Software Engineer
Netflix
er
>> +- Project [Mort AS Mort#7, 1000 AS 1000#8]
>>+- OneRowRelation
>>
>> My DefaultSource V2 implementation extends DataSourceV2 with ReadSupport
>> with ReadSupportWithSchema with WriteSupport
>>
>> I'm wondering if there is something I'm not implementing, or if there is
>> a bug in my implementation or its an issue with Spark?
>>
>> Any pointers would be great,
>>
>> Ross
>>
>
--
Ryan Blue
Software Engineer
Netflix
partition loading in Hive and Oracle.
>
>
>
> So in short, I agree that partition management should be an optional
> interface.
>
>
>
> *From: *Ryan Blue
> *Reply-To: *"rb...@netflix.com"
> *Date: *Wednesday, September 19, 2018 at 2:58 PM
>
t;
>>>>>>>> > If you are a Spark user, you can help us test this release by
>>>>>>>> taking
>>>>>>>> > an existing Spark workload and running on this release candidate,
>>>>>>>> then
>>>>>>>> > reporting any regressions.
>>>>>>>> >
>>>>>>>> > If you're working in PySpark you can set up a virtual env and
>>>>>>>> install
>>>>>>>> > the current RC and see if anything important breaks, in the
>>>>>>>> Java/Scala
>>>>>>>> > you can add the staging repository to your projects resolvers and
>>>>>>>> test
>>>>>>>> > with the RC (make sure to clean up the artifact cache
>>>>>>>> before/after so
>>>>>>>> > you don't end up building with a out of date RC going forward).
>>>>>>>> >
>>>>>>>> > ===
>>>>>>>> > What should happen to JIRA tickets still targeting 2.3.2?
>>>>>>>> > ===
>>>>>>>> >
>>>>>>>> > The current list of open tickets targeted at 2.3.2 can be found
>>>>>>>> at:
>>>>>>>> > https://issues.apache.org/jira/projects/SPARK and search for
>>>>>>>> "Target Version/s" = 2.3.2
>>>>>>>> >
>>>>>>>> > Committers should look at those and triage. Extremely important
>>>>>>>> bug
>>>>>>>> > fixes, documentation, and API tweaks that impact compatibility
>>>>>>>> should
>>>>>>>> > be worked on immediately. Everything else please retarget to an
>>>>>>>> > appropriate release.
>>>>>>>> >
>>>>>>>> > ==
>>>>>>>> > But my bug isn't fixed?
>>>>>>>> > ==
>>>>>>>> >
>>>>>>>> > In order to make timely releases, we will typically not hold the
>>>>>>>> > release unless the bug in question is a regression from the
>>>>>>>> previous
>>>>>>>> > release. That being said, if there is something which is a
>>>>>>>> regression
>>>>>>>> > that has not been correctly targeted please ping me or a
>>>>>>>> committer to
>>>>>>>> > help target the issue.
>>>>>>>>
>>>>>>>>
>>>>>>>> -
>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>
>>>> --
>>>> ---
>>>> Takeshi Yamamuro
>>>>
>>>
>>>
>>> --
>>> John
>>>
>>
--
Ryan Blue
Software Engineer
Netflix
> Hi, Ryan.
>
> Could you share the result on 2.3.1 since this is 2.3.2 RC? That would be
> helpful to narrow down the scope.
>
> Bests,
> Dongjoon.
>
> On Thu, Sep 20, 2018 at 11:56 Ryan Blue wrote:
>
>> -0
>>
>> My DataSourceV2 implementation for Iceberg is f
ion
> items as well.
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -----
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
ttp://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
amespace that needs it.
>>
>> If the data source requires TLS support then we also need to support
>> passing
>> all the configuration values under "spark.ssl.*"
>>
>> What do people think? Placeholder Issue has been added at SPARK-25329.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
--
Ryan Blue
Software Engineer
Netflix
the tune of 2-6%. Has anyone
>> considered this before?
>>
>> Sean
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
--
Ryan Blue
Software Engineer
Netflix
a long time (say .. until
> Spark 4.0.0?).
> >
> > I know somehow it happened to be sensitive but to be just literally
> honest to myself, I think we should make a try.
> >
>
>
> --
> Marcelo
>
--
Ryan Blue
Software Engineer
Netflix
Any discussion on how Spark should manage identifiers when multiple
catalogs are supported?
I know this is an area where a lot of people are interested in making
progress, and it is a blocker for both multi-catalog support and CTAS in
DSv2.
On Sun, Jan 13, 2019 at 2:22 PM Ryan Blue wrote:
>
we are super 100% dependent on Hive...
>>
>>
>> --
>> *From:* Ryan Blue
>> *Sent:* Tuesday, January 15, 2019 9:53 AM
>> *To:* Xiao Li
>> *Cc:* Yuming Wang; dev
>> *Subject:* Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>
st PR <https://github.com/apache/spark/pull/23552> does not
>> contain the changes of hive-thriftserver. Please ignore the failed test in
>> hive-thriftserver.
>>
>> The second PR <https://github.com/apache/spark/pull/23553> is complete
>> changes.
>>
>>
>>
>> I have created a Spark distribution for Apache Hadoop 2.7, you might
>> download it via Google Drive
>> <https://drive.google.com/open?id=1cq2I8hUTs9F4JkFyvRfdOJ5BlxV0ujgt> or Baidu
>> Pan <https://pan.baidu.com/s/1b090Ctuyf1CDYS7c0puBqQ>.
>>
>> Please help review and test. Thanks.
>>
>
--
Ryan Blue
Software Engineer
Netflix
scheme will need to play nice with column identifier as
> well.
>
>
>
>
> --
>
> *From:* Ryan Blue
> *Sent:* Thursday, January 17, 2019 11:38 AM
> *To:* Spark Dev List
> *Subject:* Re: [DISCUSS] Identifiers with multi-catalog support
>
&
nges in the design, we can do that.
>
> Thoughts on this?
>
> Thanks,
> Marco
>
--
Ryan Blue
Software Engineer
Netflix
f = spark.read.json("s3://sample_bucket/people.json")
>>>>> > df.printSchema()
>>>>> > df.filter($"age" > 20).explain()
>>>>> >
>>>>> > root
>>>>> > |-- age: long (nullable = true)
>>>>> > |-- name: string (nullable = true)
>>>>> >
>>>>> > == Physical Plan ==
>>>>> > *Project [age#47L, name#48]
>>>>> > +- *Filter (isnotnull(age#47L) && (age#47L > 20))
>>>>> >+- *FileScan json [age#47L,name#48] Batched: false, Format: JSON,
>>>>> Location: InMemoryFileIndex[s3://sample_bucket/people.json],
>>>>> PartitionFilters: [], PushedFilters: [IsNotNull(age),
>>>>> GreaterThan(age,20)],
>>>>> ReadSchema: struct
>>>>> >
>>>>> > # Comments
>>>>> > As you can see, PushedFilter is shown even if input data is JSON.
>>>>> > Actually this pushdown is not used.
>>>>> >
>>>>> > I'm wondering if it has been already discussed or not.
>>>>> > If not, this is a chance to have such feature in DataSourceV2
>>>>> because it would require some API level changes.
>>>>> >
>>>>> >
>>>>> > Warm regards,
>>>>> >
>>>>> > Noritaka Sekiyama
>>>>> >
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>
--
Ryan Blue
Software Engineer
Netflix
people that are
looking at it now are the ones already familiar with the problem.
rb
On Wed, Dec 12, 2018 at 2:05 AM Marco Gaido wrote:
> Thank you all for your answers.
>
> @Ryan Blue sure, let me state the problem more
> clearly: imagine you have 2 dataframes with a co
eply to the
> sender that you have received this communication in error and then delete
> it.
>
> Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário,
> pode conter informação privilegiada ou confidencial e é para uso exclusivo
> da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário
> indicado, fica notificado de que a leitura, utilização, divulgação e/ou
> cópia sem autorização pode estar proibida em virtude da legislação vigente.
> Se recebeu esta mensagem por erro, rogamos-lhe que nos o comunique
> imediatamente por esta mesma via e proceda a sua destruição
>
--
Ryan Blue
Software Engineer
Netflix
l that we should follow RDBMS/SQL standard
>> regarding the behavior?
>>
>> > pass the default through to the underlying data source
>>
>> This is one way to implement the behavior.
>>
>> On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue wrote:
>>
>>>
ave to
solve challenges with function naming (whether there is a db component).
Right now I’d like to think through the overall idea and not get too
focused on those details.
Thanks,
rb
--
Ryan Blue
Software Engineer
Netflix
101 - 200 of 399 matches
Mail list logo