There is no explicit limit but a JVM string cannot be bigger than 2G. It will
also at some point run out of memory with too big of a query plan tree or
become incredibly slow due to query planning complexity. I've seen queries that
are tens of MBs in size.
On Thu, Jul 11, 2019 at 5:01 AM, 李书明 <
Hi all,
In the past two years, the pandas UDFs are perhaps the most important changes
to Spark for Python data science. However, these functionalities have evolved
organically, leading to some inconsistencies and confusions among users. I
created a ticket and a document summarizing the issues,
That's a good idea. We should only be using squash.
On Mon, Jul 01, 2019 at 1:52 PM, Dongjoon Hyun < dongjoon.h...@gmail.com >
wrote:
>
> Hi, Apache Spark PMC members and committers.
>
>
> We are using GitHub `Merge Button` in `spark-website` repository
> because it's very convenient.
>
>
>
Seems like a good idea. Can we test this with a component first?
On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun
wrote:
> Hi, All.
>
> Since we use both Apache JIRA and GitHub actively for Apache Spark
> contributions, we have lots of JIRAs and PRs consequently. One specific
> thing I've been long
+1 on Xiangrui’s plan.
On Thu, May 30, 2019 at 7:55 AM shane knapp wrote:
> I don't have a good sense of the overhead of continuing to support
>> Python 2; is it large enough to consider dropping it in Spark 3.0?
>>
>> from the build/test side, it will actually be pretty easy to continue
> suppo
Thanks Tom.
I finally had time to look at the updated SPIP 10 mins ago. I support the high
level idea and +1 on the SPIP.
That said, I think the proposed API is too complicated and invasive change to
the existing internals. A much simpler API would be to expose a columnar batch
iterator interf
Can we push this to June 1st? I have been meaning to read it but
unfortunately keeps traveling...
On Sat, May 25, 2019 at 8:31 PM Dongjoon Hyun
wrote:
> +1
>
> Thanks,
> Dongjoon.
>
> On Fri, May 24, 2019 at 17:03 DB Tsai wrote:
>
>> +1 on exposing the APIs for columnar processing support.
>>
>
> Interested in thoughts on how to proceed on something like this, as there
> will probably be a few more similar issues.
>
>
>
> On Fri, May 10, 2019 at 3:32 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
>
>
>>
>>
>>
d. I failed and gave up.
>>
>>
>> At some point maybe we figure out whether we can remove the SBT-based
>> build if it's super painful, but only if there's not much other choice.
>> That is for a future thread.
>>
>>
>>
>> On
Looks like a great idea to make changes in Spark 3.0 to prepare for Scala 2.13
upgrade.
Are there breaking changes that would require us to have two different source
code for 2.12 vs 2.13?
On Fri, May 10, 2019 at 11:41 AM, Sean Owen < sro...@gmail.com > wrote:
>
>
>
> While that's not happe
I do feel it'd be better to not switch default Scala versions in a minor
release. I don't know how much downstream this impacts. Dotnet is a good data
point. Anybody else hit this issue?
On Thu, Apr 25, 2019 at 11:36 PM, Terry Kim < yumin...@gmail.com > wrote:
>
>
>
> Very much interested in
"if others think it would be helpful, we can cancel this vote, update the SPIP
to clarify exactly what I am proposing, and then restart the vote after we have
gotten more agreement on what APIs should be exposed"
That'd be very useful. At least I was confused by what the SPIP was about. No
poin
normally wouldn't backport, except that I've heard a
> few times about concerns about CVEs affecting Databind, so wondering
> who else out there might have an opinion. I'm not pushing for it
> necessarily.
>
> On Wed, Apr 17, 2019 at 6:18 PM Reynold Xin wrote:
> >
>
For Jackson - are you worrying about JSON parsing for users or internal
Spark functionality breaking?
On Wed, Apr 17, 2019 at 6:02 PM Sean Owen wrote:
> There's only one other item on my radar, which is considering updating
> Jackson to 2.9 in branch-2.4 to get security fixes. Pros: it's come up
Are you talking about the ones that are defined in a dictionary? If yes, that
was actually not that great in hindsight (makes it harder to read & change), so
I'm OK changing it.
E.g.
_functions = {
'lit': _lit_doc,
'col': 'Returns a :class:`Column` based on the given column name.',
gt; Do you have design doc? I'm also interested in this topic and want to help
>>>> contribute.
>>>>
>>>> On Tue, Apr 2, 2019 at 10:00 PM Bobby Evans < bobby@ apache. org (
>>>> bo...@apache.org ) > wrote:
>>>>
>>>>
I just realized I didn't make it very clear my stance here ... here's another
try:
I think it's a no brainer to have a good columnar UDF interface. This would
facilitate a lot of high performance applications, e.g. GPU-based accelerations
for machine learning algorithms.
On rewriting the entir
As part of evolving the Scala language, the Scala team is considering removing
single-quote syntax for representing symbols. Single-quote syntax is one of the
ways to represent a column in Spark's DataFrame API. While I personally don't
use them (I prefer just using strings for column names, or
We tried enabling blacklisting for some customers and in the cloud, very
quickly they end up having 0 executors due to various transient errors. So
unfortunately I think the current implementation is terrible for cloud
deployments, and shouldn't be on by default. The heart of the issue is that t
of DataType classes.
>>
>>
>> All of these options are likely to have implications for the catalyst
>> systems. I'm not sure if they are minor more substantial.
>>
>>
>> On Wed, Mar 27, 2019 at 4:20 PM Reynold Xin < rxin@ databricks. com (
>
All of these options are likely to have implications for the catalyst
> systems. I'm not sure if they are minor more substantial.
>
>
> On Wed, Mar 27, 2019 at 4:20 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
>
>
>> Yes this is known a
Yes this is known and an issue for performance. Do you have any thoughts on
how to fix this?
On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson wrote:
> I describe some of the details here:
> https://issues.apache.org/jira/browse/SPARK-27296
>
> The short version of the story is that aggregating dat
26% improvement is underwhelming if it requires massive refactoring of the
codebase. Also you can't just add the benefits up this way, because:
- Both vectorization and codegen reduces the overhead in virtual function calls
- Vectorization code is more friendly to compilers / CPUs, but requires
n Kwon < gurwls223@ gmail. com (
> gurwls...@gmail.com ) > wrote:
>
>
>> BTW, I am working on the documentation related with this subject at https:/
>> / issues. apache. org/ jira/ browse/ SPARK-26022 (
>> https://issues.apache.org/jira/browse/SPARK-26022 ) to desc
s working on it - I'd prefer
> collaborating.
>
> Note - I'm not recommending we make the logical plan mutable (as I am
> scared of that too!). I think there are other ways of handling that - but
> we can go into details later.
>
> On Tue, Mar 26, 2019 at 11:58 AM R
We have been thinking about some of these issues. Some of them are harder
to do, e.g. Spark DataFrames are fundamentally immutable, and making the
logical plan mutable is a significant deviation from the current paradigm
that might confuse the hell out of some users. We are considering building
a s
At some point we should celebrate having the larger RC number ever in Spark ...
On Mon, Mar 25, 2019 at 9:44 PM, DB Tsai < dbt...@dbtsai.com.invalid > wrote:
>
>
>
> RC9 was just cut. Will send out another thread once the build is finished.
>
>
>
>
> Sincerely,
>
>
>
> DB Tsai
> ---
+1 on doing this in 3.0.
On Mon, Mar 25, 2019 at 9:31 PM, Felix Cheung < felixcheun...@hotmail.com >
wrote:
>
> I’m +1 if 3.0
>
>
>
>
> *From:* Sean Owen < srowen@ gmail. com ( sro...@gmail.com ) >
> *Sent:* Monday, March 25, 2019 6:48 PM
> *To:* Hyukjin Kwon
> *Cc:* dev; Bryan Cutler; Tak
This is more of a question for the connector. It depends on how the connector
is implemented. Some implements aggregate pushdown, but most don't.
On Mon, Mar 18, 2019 at 10:05 AM, asma zgolli < zgollia...@gmail.com > wrote:
>
> Hello,
>
>
> I'm executing using spark SQL an SQL workload on dat
If you use UDFs in Python, you would want to use Pandas UDF for better
performance.
On Mon, Mar 11, 2019 at 7:50 PM Jonathan Winandy
wrote:
> Thanks, I didn't know!
>
> That being said, any udf use seems to affect badly code generation (and
> the performance).
>
>
> On Mon, 11 Mar 2019, 15:13 Dy
Rather than calling it hash64, it'd be better to just call it xxhash64. The
reason being ten years from now, we probably would look back and laugh at a
specific hash implementation. It'd be better to just name the expression what
it is.
On Wed, Mar 06, 2019 at 7:59 PM, < huon.wil...@data61.csir
I think they might be used in bucketing? Not 100% sure.
On Wed, Mar 06, 2019 at 1:40 PM, < tcon...@gmail.com > wrote:
>
>
>
> Hi,
>
>
>
>
>
>
>
> I noticed the existence of a Hive Hash partitioning implementation in
> Spark, but also noticed that it’s not being used, and that the Spark
nyone is free to take on this, but I have no experience with R.
>
>
>
>
>
>
> If you folks agree with this, let us know, so we can move forward with the
> merge.
>
>
>
>
>
>
>
> Best.
>
>
>
>
>
>
>
> -- André.
>
&
reement that the intent
>> is not to make the exact names binding, we should be okay.
>>
>>
>> I can remove the user-facing API sketch, but I'd prefer to leave it in the
>> sketch section so we have it documented somewhere.
>>
>> On Fri, Mar 1, 20
Ryan - can you take the public user facing API part out of that SPIP?
In general it'd be better to have the SPIPs be higher level, and put the
detailed APIs in a separate doc. Alternatively, put them in the SPIP but
explicitly vote on the high level stuff and not the detailed APIs.
I don't wan
This should be fine. Dataset.groupByKey is a logical operation, not a
physical one (as in Spark wouldn’t always materialize all the groups in
memory).
On Thu, Feb 28, 2019 at 1:46 AM Etienne Chauchot
wrote:
> Hi all,
>
> I'm migrating RDD pipelines to Dataset and I saw that Combine.PerKey is no
We will have to fix that before we declare dev2 is stable, because
InternalRow is not a stable API. We don’t necessarily need to do it in 3.0.
On Tue, Feb 26, 2019 at 5:10 PM Matt Cheah wrote:
> Will that then require an API break down the line? Do we save that for
> Spark 4?
>
>
>
> -Matt Cheah
The challenge with the Scala/Java API in the past is that when there are
multipe parameters, it'd lead to an explosion of function overloads.
On Sun, Feb 24, 2019 at 3:22 PM, Felix Cheung < felixcheun...@hotmail.com >
wrote:
>
> I hear three topics in this thread
>
>
> 1. I don’t think we s
How is this different from materialized views?
On Sun, Feb 24, 2019 at 3:44 PM Daoyuan Wang wrote:
> Hi everyone,
>
> We'd like to discuss our proposal of Spark relational cache in this
> thread. Spark has native command for RDD caching, but the use of CACHE
> command in Spark SQL is limited, as
lol
On Fri, Feb 15, 2019 at 4:02 PM, Marcelo Vanzin < van...@cloudera.com.invalid >
wrote:
>
>
>
> You're talking about the spark-website script, right? The main repo's
> script has been working for me, the website one is broken.
>
>
>
> I think it was caused by this dude changing raw_inpu
This might be useful to do.
BTW, based on my experience with different build systems in the past few years
(extensively SBT/Maven/Bazel, and to a less extent Gradle/Cargo), I think the
longer term solution is to move to Bazel. It is so much easier to understand
and use, and also much more featu
Seems to make sense to have it false by default.
(I agree this deserves a dev list mention though even if there is easy
consensus). We should make sure we mark the Jira with releasenotes so we
can add it to uograde guide.
On Mon, Jan 28, 2019 at 8:47 AM Sean Owen wrote:
> Interesting notion at
If we can make the annotation compatible with Python 2, why don’t we add
type annotation to make life easier for users of Python 3 (with type)?
On Fri, Jan 25, 2019 at 7:53 AM Maciej Szymkiewicz
wrote:
>
> Hello everyone,
>
> I'd like to revisit the topic of adding PySpark type annotations in 3.
ns of Hive metastore. Feel
>> free to ping me if we hit any issue about it.
>>
>> Cheers,
>>
>> Xiao
>>
>> Reynold Xin 于2019年1月22日周二 下午11:18写道:
>>
>>> Actually a non trivial fraction of users / customers I interact with
>>> still us
Actually a non trivial fraction of users / customers I interact with still use
very old Hive metastores. Because it’s very difficult to upgrade Hive metastore
wholesale (it’d require all the production jobs that access the same metastore
be upgraded at once). This is even harder than JVM upgrade
com ( sro...@gmail.com ) >
> *Sent:* Monday, January 21, 2019 10:42 AM
> *To:* Reynold Xin
> *Cc:* dev
> *Subject:* Re: Make proactive check for closure serializability optional?
>
> None except the bug / PR I linked to, which is really just a bug in
> the RowMatrix implementati
Did you actually observe a perf issue?
On Mon, Jan 21, 2019 at 10:04 AM Sean Owen wrote:
> The ClosureCleaner proactively checks that closures passed to
> transformations like RDD.map() are serializable, before they're
> executed. It does this by just serializing it with the JavaSerializer.
>
>
BTW the largest change to SS right now is probably the entire data source API
v2 effort, which aims to unify streaming and batch from data source
perspective, and provide a reliable, expressive source/sink API.
On Mon, Jan 14, 2019 at 5:34 PM, Reynold Xin < r...@databricks.com >
There are a few things to keep in mind:
1. Structured Streaming isn't an independent project. It actually (by design)
depends on all the rest of Spark SQL, and virtually all improvements to Spark
SQL benefit Structured Streaming.
2. The project as far as I can tell is relatively mature for core
Thanks for writing this up. Just to show why option 1 is not sufficient. MySQL
and Postgres are the two most popular open source database systems, and both
support database → schema → table 3 part identification, so Spark supporting
only 2 part name passing to the data source (option 1) isn't su
Committers,
When you merge tickets fixing correctness bugs, please make sure you tag the
tickets with "correctness" label. I've found multiple tickets today that didn't
do that.
On Fri, Aug 17, 2018 at 7:11 AM, Tom Graves < tgraves...@yahoo.com.invalid >
wrote:
>
> Since we haven't heard any
The issue with the offheap mode is it is a pretty big behavior change and does
require additional setup (also for users that run with UDFs that allocate a lot
of heap memory, it might not be as good).
I can see us removing the legacy mode since it's been legacy for a long time
and perhaps very
Not sure how reputable or representative that paper is...
On Mon, Dec 31, 2018 at 10:57 AM Sean Owen wrote:
> https://github.com/apache/spark/pull/23401
>
> Interesting PR; I thought it was not worthwhile until I saw a paper
> claiming this can speed things up to the tune of 2-6%. Has anyone
> c
I'd only do any of the schema evolution things as add-on on top. This is an
extremely complicated area and we could risk never shipping anything because
there would be a lot of different requirements.
On Fri, Dec 21, 2018 at 9:46 AM, Russell Spitzer < russell.spit...@gmail.com >
wrote:
>
> I
I added my comment there too!
On Wed, Dec 19, 2018 at 7:26 PM, Hyukjin Kwon < gurwls...@gmail.com > wrote:
>
> Yea, that's a bit noisy .. I would just completely disable it to be
> honest. I failed https:/ / issues. apache. org/ jira/ browse/ INFRA-17469 (
> https://issues.apache.org/jira/browse
I think there is an infra ticket open for it right now.
On Wed, Dec 19, 2018 at 6:58 PM Nicholas Chammas
wrote:
> Can we somehow disable these new email alerts coming through for the Spark
> website repo?
>
> On Wed, Dec 19, 2018 at 8:25 PM GitBox wrote:
>
>> ueshin commented on a change in pul
Thanks for taking care of this, Shane!
On Wed, Dec 19, 2018 at 9:45 AM, shane knapp < skn...@berkeley.edu > wrote:
>
> master is back up and building.
>
> On Wed, Dec 19, 2018 at 9:31 AM shane knapp < sknapp@ berkeley. edu (
> skn...@berkeley.edu ) > wrote:
>
>
>> the jenkins process seems to
@gmail.com > wrote:
>
> This is at analysis time.
>
> On Tue, 18 Dec 2018, 17:32 Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) wrote:
>
>
>> Is this an analysis time thing or a runtime thing?
>>
>> On Tue, Dec 18, 2018 at 7:45 AM Mar
Is this an analysis time thing or a runtime thing?
On Tue, Dec 18, 2018 at 7:45 AM Marco Gaido wrote:
> Hi all,
>
> as you may remember, there was a design doc to support operations
> involving decimals with negative scales. After the discussion in the design
> doc, now the related PR is blocked
easily consumed by a UDF?
>
>
>
> Otherwise +1 for trying to get this to work without Hive. I think even
> having something without codegen and optimized row formats is worthwhile if
> only because it’s easier to use than Hive UDFs.
>
>
>
> -Matt Cheah
>
>
>
> *
Having a way to register UDFs that are not using Hive APIs would be great!
On Fri, Dec 14, 2018 at 1:30 PM, Ryan Blue < rb...@netflix.com.invalid > wrote:
>
>
>
> Hi everyone,
> I’ve been looking into improving how users of our Spark platform register
> and use UDFs and I’d like to discuss a f
In SQLConf, for each config option, we declare them in two places:
First in the SQLConf object, e.g.:
*val* CSV_PARSER_COLUMN_PRUNING = buildConf (
"spark.sql.csv.parser.columnPruning.enabled" )
.internal()
.doc( "If it is set to true, column names of the requested schema are passed to
CSV pa
Unfortunately I can't make it to the DSv2 sync today. Sending an email with my
thoughts instead. I spent a few hours thinking about this. It's evident that
progress has been slow, because this is an important API and people from
different perspectives have very different requirements, and the pr
A-17385 ( https://issues.apache.org/jira/browse/INFRA-17385 ) but no
> follow-up. Go ahead and open a new INFRA ticket.
>
> On Tue, Dec 11, 2018 at 6:20 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
>
>
>> Thanks, Sean. Which INFRA ticket is it? It'
Thanks, Sean. Which INFRA ticket is it? It's creating a lot of noise so I want
to put some pressure myself there too.
On Mon, Dec 10, 2018 at 9:51 AM, Sean Owen < sro...@apache.org > wrote:
>
>
>
> Agree, I'll ask on the INFRA ticket and follow up. That's a lot of extra
> noise.
>
>
>
> On
I used to, before each release during the RC phase, go through every single doc
page to make sure we don’t unintentionally leave things public. I no longer
have time to do that unfortunately. I find that very useful because I always
catch some mistakes through organic development.
> On Nov 13,
sed
>breaking changes / JIRA tickets? Perhaps we can include it in the JIRA
>ticket that can be filtered down to somehow?
>
>
>
> Thanks,
>
>
>
> -Matt Cheah
>
> *From: *Vinoo Ganesh
> *Date: *Monday, November 12, 2018 at 2:48 PM
> *To: *Reynold Xin
PM Vinoo Ganesh wrote:
> Quickly following up on this – is there a target date for when Spark 3.0
> may be released and/or a list of the likely api breaks that are
> anticipated?
>
>
>
> *From: *Xiao Li
> *Date: *Saturday, September 29, 2018 at 02:09
> *To: *Reynold
t a concern. When we
> add a capability, we add handling for it that old versions wouldn't be able
> to use anyway. The advantage is that we don't have to treat all sources the
> same.
>
> On Fri, Nov 9, 2018 at 11:32 AM Reynold Xin wrote:
>
>> How do we deal with
elix Cheung
> wrote:
>
>> One question is where will the list of capability strings be defined?
>>
>>
>> --
>> *From:* Ryan Blue
>> *Sent:* Thursday, November 8, 2018 2:09 PM
>> *To:* Reynold Xin
>> *Cc:* Spark D
Do you have a cached copy? I see it here
http://spark.apache.org/downloads.html
On Thu, Nov 8, 2018 at 4:12 PM Li Gao wrote:
> this is wonderful !
> I noticed the official spark download site does not have 2.4 download
> links yet.
>
> On Thu, Nov 8, 2018, 4:11 PM Swapnil Shinde wrote:
>
>>
This is currently accomplished by having traits that data sources can
extend, as well as runtime exceptions right? It's hard to argue one way vs
another without knowing how things will evolve (e.g. how many different
capabilities there will be).
On Thu, Nov 8, 2018 at 12:50 PM Ryan Blue wrote:
The website is already up but I didn’t see any email announcement yet.
Have we deprecated Scala 2.11 already in an existing release?
On Tue, Nov 6, 2018 at 4:43 PM DB Tsai wrote:
> Ideally, supporting only Scala 2.12 in Spark 3 will be ideal.
>
> DB Tsai | Siri Open Source Technologies [not a contribution] |
> Apple, Inc
>
> > On Nov 6, 2018, at 2:55 PM, Feli
What does OpenJDK do and other non-Oracle VMs? I know there was a lot of
discussions from Redhat etc to support.
On Tue, Nov 6, 2018 at 11:24 AM DB Tsai wrote:
> Given Oracle's new 6-month release model, I feel the only realistic option
> is to only test and support JDK such as JDK 11 LTS and f
Maybe deprecate and remove in next version? It is bad to just remove a
method without deprecation notice.
On Tue, Nov 6, 2018 at 5:44 AM Sean Owen wrote:
> See https://github.com/apache/spark/pull/22921#discussion_r230568058
>
> Methods like toDegrees, toRadians, approxCountDistinct were 'rename
+1
Look forward to the release!
On Mon, Oct 29, 2018 at 3:22 AM Wenchen Fan wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 2.4.0.
>
> The vote is open until November 1 PST and passes if a majority +1 PMC
> votes are cast, with
> a minimum of 3 +1 votes.
>
I agree - it is very easy for users to shoot themselves in the foot if we
don't put in the safeguards, or mislead them by giving them the impression
that operations are cheap. DataFrame in Spark isn't like a single node
in-memory data structure.
Note that the repr string work is very different. Th
People do use it, and the maintenance cost is pretty low so I don't think
we should just drop it. We can be explicit about there are not a lot of
developments going on and we are unlikely to add a lot of new features to
it, and users are also welcome to use other JDBC/ODBC endpoint
implementations
I also think we should get this in:
https://github.com/apache/spark/pull/22841
It's to deprecate a confusing & broken window function API, so we can
remove them in 3.0 and redesign a better one. See
https://issues.apache.org/jira/browse/SPARK-25841 for more information.
On Thu, Oct 25, 2018 at 4
+1
On Thu, Oct 25, 2018 at 4:12 PM Li Jin wrote:
> Although I am not specifically involved in DSv2, I think having this kind
> of meeting is definitely helpful to discuss, move certain effort forward
> and keep people on the same page. Glad to see this kind of working group
> happening.
>
> On
I have some pretty serious concerns over this proposal. I agree that there
are many things that can be improved, but at the same time I also think the
cost of introducing a new IR in the middle is extremely high. Having
participated in designing some of the IRs in other systems, I've seen more
fail
e could argue that the litany of the questions are really a
>> double-click on the essence: why, what, how. The three interrogatives ought
>> to be the essence and distillation of any proposal or technical exposition.
>>
>> Cheers
>> Jules
>>
>> Sent from
Rounding.
On Wed, Oct 17, 2018 at 6:25 PM Sandeep Katta <
sandeep0102.opensou...@gmail.com> wrote:
> Hi Guys,
>
> I am trying to understand structured streaming code flow by doing so I
> came across below code flow
>
> def nextBatchTime(now: Long): Long = {
> if (intervalMs == 0) now else now /
We shouldn’t merge new features into release branches anymore.
On Tue, Oct 16, 2018 at 6:32 PM Rob Vesse wrote:
> Right now the Kerberos support for Spark on K8S is only on master AFAICT
> i.e. the feature is not present on branch-2.4
>
>
>
> Therefore I don’t see any point in adding the tests i
Sounds like a good idea...
> On Oct 11, 2018, at 6:40 PM, Sean Owen wrote:
>
> Yep, that already exists as Bahir.
> Also, would anyone object to declaring Flume support at least
> deprecated in 2.4.0?
>> On Wed, Oct 10, 2018 at 2:29 PM Jörn Franke wrote:
>>
>> I think it makes sense to remove
the seed value and we add
>> the seed name in the test case name. This can help us reproduce it.
>>
>> Xiao
>>
>> On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin wrote:
>>
>>> I'm personally not a big fan of doing it that way in the PR. It is
>>>
I'm personally not a big fan of doing it that way in the PR. It is
perfectly fine to employ randomized tests, and in this case it might even
be fine to just pick couple different timezones like the way it happened in
the PR, but we should:
1. Document in the code comment why we did it that way.
2
No we used to have that (for views) but it wasn’t working well enough so we
removed it.
On Wed, Oct 3, 2018 at 6:41 PM Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:
> Hi everyone,
> Is there any known way to go from a Spark SQL Logical Plan (optimised ?)
> Back to a SQL query ?
>
> R
Hi all,
The Apache Spark PMC has recently voted to add several new committers to
the project, for their contributions:
- Shane Knapp (contributor to infra)
- Dongjoon Hyun (contributor to ORC support and other parts of Spark)
- Kazuaki Ishizaki (contributor to Spark SQL)
- Xingbo Jiang (contribut
getting
>> everything right before we see the results of the new API being more widely
>> used, and too much cost in maintaining until the next major release
>> something that we come to regret for us to create new API in a fully frozen
>> state.
>> >
>&
Thoughts on how the api would look like?
On Thu, Sep 27, 2018 at 11:13 AM Russell Spitzer
wrote:
> While that's easy for some users, we basically want to load up some
> functions by default into all session catalogues regardless of who made
> them. We do this with certain rules and strategies us
That’s a pretty major architectural change and would be extremely difficult
to do at this stage.
On Tue, Sep 25, 2018 at 9:31 AM sandeep mehandru
wrote:
> Hi Folks,
>
>There is a use-case , where we are doing large computation on two large
> vectors. It is basically a scenario, where we run
We also only block if it is a new regression.
On Wed, Sep 19, 2018 at 12:18 AM Saisai Shao wrote:
> Hi Marco,
>
> From my understanding of SPARK-25454, I don't think it is a block issue,
> it might be an corner case, so personally I don't want to block the release
> of 2.3.2 because of this issu
I'd just document it as a known limitation and move on for now, until there
are enough end users that need this. Spark is also very powerful with UDFs
and end users can easily work around this using UDFs.
--
excuse the brevity and lower case due to wrist injury
On Tue, Sep 18, 2018 at 11:14 PM s
i'd like to second that.
if we want to communicate timeline, we can add to the release notes saying
py2 will be deprecated in 3.0, and removed in a 3.x release.
--
excuse the brevity and lower case due to wrist injury
On Mon, Sep 17, 2018 at 4:24 PM Matei Zaharia
wrote:
> That’s a good point
Most of those are pretty difficult to add though, because they are
fundamentally difficult to do in a distributed setting and with lazy
execution.
We should add some but at some point there are fundamental differences
between the underlying execution engine that are pretty difficult to
reconcile.
makes sense - i'd make this as consistent as to_json / from_json as
possible.
how would this work in sql? i.e. how would passing options in work?
--
excuse the brevity and lower case due to wrist injury
On Sat, Sep 15, 2018 at 2:58 AM Maxim Gekk
wrote:
> Hi All,
>
> I would like to propose ne
we can also declare python 2 as deprecated and drop it in 3.x, not
necessarily 3.0.
--
excuse the brevity and lower case due to wrist injury
On Sat, Sep 15, 2018 at 10:33 AM Erik Erlandson wrote:
> I am probably splitting hairs to finely, but I was considering the
> difference between improvem
t be going to be duplicated.
>
> Ryan replied me as Iceberg and HBase MVCC timestamps can enable us to
> implement "commit" (his reply didn't hit dev. mailing list though) but I'm
> not an expert of both twos and I couldn't still imagine it can deal with
> v
101 - 200 of 1411 matches
Mail list logo