Re: Output Committers for S3

2017-02-20 Thread Ryan Blue
; > > -- > View this message in context: http://apache-spark- > developers-list.1001551.n3.nabble.com/Output-Committers- > for-S3-tp21033.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > -

Re: Will .count() always trigger an evaluation of each row?

2017-02-20 Thread Ryan Blue
acro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > > -- > View this message in context: RE: Will .count() always trigger an > evaluation of each row? > <http://apache-spark-developers-list.1001551.n3.nabble.com/Will-count-always-trigger-an-evaluation-of-each-row-tp21018p21027.html> > Sent from the Apache Spark Developers List mailing list archive > <http://apache-spark-developers-list.1001551.n3.nabble.com/> at > Nabble.com. > -- Ryan Blue Software Engineer Netflix

Re: Output Committers for S3

2017-02-21 Thread Ryan Blue
On Tue, Feb 21, 2017 at 6:15 AM, Steve Loughran <ste...@hortonworks.com> wrote: > On 21 Feb 2017, at 01:00, Ryan Blue <rb...@netflix.com.INVALID> wrote: > > You'd have to encode the task ID in the output file name to identify files > > to roll back in the even

Re: Driver hung and happend out of memory while writing to console progress bar

2017-02-10 Thread Ryan Blue
progress" > java.lang.OutOfMemoryError: Java heap space at > java.util.Arrays.copyOfRange(Arrays.java:3664) at > java.lang.String.(String.java:207) at > java.lang.StringBuilder.toString(StringBuilder.java:407) at > scala.collection.mutable.StringBuilder.toString(StringBuilder.scala:430) > at org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:101) > at > org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:71) > at > org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:55) > at java.util.TimerThread.mainLoop(Timer.java:555) at > java.util.TimerThread.run(Timer.java:505) > > -- Ryan Blue Software Engineer Netflix

Re: Add hive-site.xml at runtime

2017-02-13 Thread Ryan Blue
.@spark.apache.org and my mail was bouncing each > time so Sean Owen suggested to mail dev.(https://issues.apache. > org/jira/browse/SPARK-19546). Please give solution to above ticket also > if possible. > > Thanks > > -- > Shivam Sharma > -- Ryan Blue Software Engineer Netflix

Re: Spark Improvement Proposals

2017-02-16 Thread Ryan Blue
the SPIP is >>>>>>> recorded (rejected, accepted, etc.), and advising about the technical >>>>>>> quality of the SPIP: this person need not be a champion for the SPIP or >>>>>>> contribute to it, but rather makes sure it stands a chance of

Re: Spark Improvement Proposals

2017-02-16 Thread Ryan Blue
gt; to ensure our quality. Spark is not an application software. It is > an > >>>>>> infrastructure software that is being used by many many companies. > We have > >>>>>> to be very careful in the design and implementation, especially > >>>>>> adding/changing the external APIs. > >>>>>> > >>>>>> > >>>>>> When I developed the Mainframe infrastructure/middleware software in > >>>>>> the past 6 years, I were involved in the discussions with > external/internal > >>>>>> customers. The to-do feature list was always above 100. Sometimes, > the > >>>>>> customers are feeling frustrated when we are unable to deliver them > on time > >>>>>> due to the resource limits and others. Even if they paid us > billions, we > >>>>>> still need to do it phase by phase or sometimes they have to accept > the > >>>>>> workarounds. That is the reality everyone has to face, I think. > >>>>>> > >>>>>> > >>>>>> Thanks, > >>>>>> > >>>>>> > >>>>>> Xiao Li > >>>>>>> > >>>>>>> > >> > > > > - > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: Is it possible to get a job end kind of notification on the executor (slave)

2017-01-20 Thread Ryan Blue
finished? I tried doing it in > the JobProgressListener but it does not seem to work in a cluster. The > event is not triggered in the worker. > > Regards, > Keith. > > http://keith-chapman.com > -- Ryan Blue Software Engineer Netflix

Re: Spark Improvement Proposals

2017-02-27 Thread Ryan Blue
ess improvement ,It's like everything >> depends >> > only on shepherd . >> > >> > Also want to add point that SPIP should be time bound with define SLA >> else >> > will defeats purpose. >> > >> > >> > Regards, &g

Re: Spark Improvement Proposals

2016-10-10 Thread Ryan Blue
t; >>>>>>>> type of > >>> >>>>>>>> JIRA called a SIP and have a link to a filter that shows all > >>> >>>>>>>> such > >>> >>>>>>>> JIRAs from > >>> >>>>>>>> http://spark.apache.org. I also like the idea of SIP and > design > >>> >>>>>>>> doc > >>> >>>>>>>> templates (in fact many projects have them). > >>> >>>>>>>> > >>> >>>>>>>> Matei > >>> >>>>>>>> > >>> >>>>>>>> On Oct 7, 2016, at 10:38 AM, Reynold Xin <[hidden email]> > >>> >>>>>>>> wrote: > >>> >>>>>>>> > >>> >>>>>>>> I called Cody last night and talked about some of the topics > in > >>> >>>>>>>> his > >>> >>>>>>>> email. > >>> >>>>>>>> It became clear to me Cody genuinely cares about the project. > >>> >>>>>>>> > >>> >>>>>>>> Some of the frustrations come from the success of the project > >>> >>>>>>>> itself > >>> >>>>>>>> becoming very "hot", and it is difficult to get clarity from > >>> >>>>>>>> people > >>> >>>>>>>> who > >>> >>>>>>>> don't dedicate all their time to Spark. In fact, it is in some > >>> >>>>>>>> ways > >>> >>>>>>>> similar > >>> >>>>>>>> to scaling an engineering team in a successful startup: old > >>> >>>>>>>> processes that > >>> >>>>>>>> worked well might not work so well when it gets to a certain > >>> >>>>>>>> size, > >>> >>>>>>>> cultures > >>> >>>>>>>> can get diluted, building culture vs building process, etc. > >>> >>>>>>>> > >>> >>>>>>>> I also really like to have a more visible process for larger > >>> >>>>>>>> changes, > >>> >>>>>>>> especially major user facing API changes. Historically we > upload > >>> >>>>>>>> design docs > >>> >>>>>>>> for major changes, but it is not always consistent and > difficult > >>> >>>>>>>> to > >>> >>>>>>>> quality > >>> >>>>>>>> of the docs, due to the volunteering nature of the > organization. > >>> >>>>>>>> > >>> >>>>>>>> Some of the more concrete ideas we discussed focus on > building a > >>> >>>>>>>> culture > >>> >>>>>>>> to improve clarity: > >>> >>>>>>>> > >>> >>>>>>>> - Process: Large changes should have design docs posted on > JIRA. > >>> >>>>>>>> One > >>> >>>>>>>> thing > >>> >>>>>>>> Cody and I didn't discuss but an idea that just came to me is > we > >>> >>>>>>>> should > >>> >>>>>>>> create a design doc template for the project and ask everybody > >>> >>>>>>>> to > >>> >>>>>>>> follow. > >>> >>>>>>>> The design doc template should also explicitly list goals and > >>> >>>>>>>> non-goals, to > >>> >>>>>>>> make design doc more consistent. > >>> >>>>>>>> > >>> >>>>>>>> - Process: Email dev@ to solicit feedback. We have some this > >>> >>>>>>>> with > >>> >>>>>>>> some > >>> >>>>>>>> changes, but again very inconsistent. Just posting something > on > >>> >>>>>>>> JIRA > >>> >>>>>>>> isn't > >>> >>>>>>>> sufficient, because there are simply too many JIRAs and the > >>> >>>>>>>> signal > >>> >>>>>>>> get lost > >>> >>>>>>>> in the noise. While this is generally impossible to enforce > >>> >>>>>>>> because > >>> >>>>>>>> we can't > >>> >>>>>>>> force all volunteers to conform to a process (or they might > not > >>> >>>>>>>> even > >>> >>>>>>>> be > >>> >>>>>>>> aware of this), those who are more familiar with the project > >>> >>>>>>>> can > >>> >>>>>>>> help by > >>> >>>>>>>> emailing the dev@ when they see something that hasn't been. > >>> >>>>>>>> > >>> >>>>>>>> - Culture: The design doc author(s) should be open to > feedback. > >>> >>>>>>>> A > >>> >>>>>>>> design > >>> >>>>>>>> doc should serve as the base for discussion and is by no means > >>> >>>>>>>> the > >>> >>>>>>>> final > >>> >>>>>>>> design. Of course, this does not mean the author has to accept > >>> >>>>>>>> every > >>> >>>>>>>> feedback. They should also be comfortable accepting / > rejecting > >>> >>>>>>>> ideas on > >>> >>>>>>>> technical grounds. > >>> >>>>>>>> > >>> >>>>>>>> - Process / Culture: For major ongoing projects, it can be > >>> >>>>>>>> useful > >>> >>>>>>>> to > >>> >>>>>>>> have > >>> >>>>>>>> some monthly Google hangouts that are open to the world. I am > >>> >>>>>>>> actually not > >>> >>>>>>>> sure how well this will work, because of the volunteering > nature > >>> >>>>>>>> and > >>> >>>>>>>> we need > >>> >>>>>>>> to adjust for timezones for people across the globe, but it > >>> >>>>>>>> seems > >>> >>>>>>>> worth > >>> >>>>>>>> trying. > >>> >>>>>>>> > >>> >>>>>>>> - Culture: Contributors (including committers) should be more > >>> >>>>>>>> direct > >>> >>>>>>>> in > >>> >>>>>>>> setting expectations, including whether they are working on a > >>> >>>>>>>> specific > >>> >>>>>>>> issue, whether they will be working on a specific issue, and > >>> >>>>>>>> whether > >>> >>>>>>>> an > >>> >>>>>>>> issue or pr or jira should be rejected. Most people I know in > >>> >>>>>>>> this > >>> >>>>>>>> community > >>> >>>>>>>> are nice and don't enjoy telling other people no, but it is > >>> >>>>>>>> often > >>> >>>>>>>> more > >>> >>>>>>>> annoying to a contributor to not know anything than getting a > >>> >>>>>>>> no. > >>> >>>>>>>> > >>> >>>>>>>> > >>> >>>>>>>> On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia > >>> >>>>>>>> <[hidden email]> > >>> >>>>>>>> wrote: > >>> >>>>>>>>> > >>> >>>>>>>>> > >>> >>>>>>>>> Love the idea of a more visible "Spark Improvement Proposal" > >>> >>>>>>>>> process that > >>> >>>>>>>>> solicits user input on new APIs. For what it's worth, I don't > >>> >>>>>>>>> think > >>> >>>>>>>>> committers are trying to minimize their own work -- every > >>> >>>>>>>>> committer > >>> >>>>>>>>> cares > >>> >>>>>>>>> about making the software useful for users. However, it is > >>> >>>>>>>>> always > >>> >>>>>>>>> hard to > >>> >>>>>>>>> get user input and so it helps to have this kind of process. > >>> >>>>>>>>> I've > >>> >>>>>>>>> certainly > >>> >>>>>>>>> looked at the *IPs a lot in other software I use just to see > >>> >>>>>>>>> the > >>> >>>>>>>>> biggest > >>> >>>>>>>>> things on the roadmap. > >>> >>>>>>>>> > >>> >>>>>>>>> When you're talking about "changing interfaces", are you > >>> >>>>>>>>> talking > >>> >>>>>>>>> about > >>> >>>>>>>>> public or internal APIs? I do think many people hate changing > >>> >>>>>>>>> public APIs > >>> >>>>>>>>> and I actually think that's for the best of the project. > That's > >>> >>>>>>>>> a > >>> >>>>>>>>> technical > >>> >>>>>>>>> debate, but basically, the worst thing when you're using a > >>> >>>>>>>>> piece > >>> >>>>>>>>> of > >>> >>>>>>>>> software > >>> >>>>>>>>> is that the developers constantly ask you to rewrite your app > >>> >>>>>>>>> to > >>> >>>>>>>>> update to a > >>> >>>>>>>>> new version (and thus benefit from bug fixes, etc). Cue > anyone > >>> >>>>>>>>> who's used > >>> >>>>>>>>> Protobuf, or Guava. The "let's get everyone to change their > >>> >>>>>>>>> code > >>> >>>>>>>>> this > >>> >>>>>>>>> release" model works well within a single large company, but > >>> >>>>>>>>> doesn't work > >>> >>>>>>>>> well for a community, which is why nearly all *very* widely > >>> >>>>>>>>> used > >>> >>>>>>>>> programming > >>> >>>>>>>>> interfaces (I'm talking things like Java standard library, > >>> >>>>>>>>> Windows > >>> >>>>>>>>> API, etc) > >>> >>>>>>>>> almost *never* break backwards compatibility. All this is > done > >>> >>>>>>>>> within reason > >>> >>>>>>>>> though, e.g. we do change things in major releases (2.x, 3.x, > >>> >>>>>>>>> etc). > >>> >>>>>>>> > >>> >>>>>>>> > >>> >>>>>>>> > >>> >>>>>>>> > >>> >>>>>>> > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> > - > >>> >>>>>> To unsubscribe e-mail: [hidden email] > >>> >>>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> -- > >>> >>>>> Stavros Kontopoulos > >>> >>>>> Senior Software Engineer > >>> >>>>> Lightbend, Inc. > >>> >>>>> p: +30 6977967274 > >>> >>>>> e: [hidden email] > >>> >>>>> > >>> >>>>> > >>> >>>> > >>> >>> > >>> >> > >>> >> > >>> > >> > > > > > > - > > To unsubscribe e-mail: [hidden email] > > > > > > > > > > If you reply to this email, your message will be added to the discussion > > below: > > > > http://apache-spark-developers-list.1001551.n3. > nabble.com/Spark-Improvement-Proposals-tp19268p19359.html > > > > To start a new topic under Apache Spark Developers List, email [hidden > > email] > > To unsubscribe from Apache Spark Developers List, click here. > > NAML > > > > > > > > View this message in context: RE: Spark Improvement Proposals > > Sent from the Apache Spark Developers List mailing list archive at > > Nabble.com. > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: getting encoder implicits to be more accurate

2016-10-26 Thread Ryan Blue
gt;> >> this "claims" to handle for example Option[Set[Int]], but it really >> cannot handle Set so it leads to a runtime exception. >> >> would it be useful to make this a little more specific? i guess the >> challenge is going to be case classes which unfortunately dont extend >> Product1, Product2, etc. >> > > -- Ryan Blue Software Engineer Netflix

Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-28 Thread Ryan Blue
se mark the fix version as 2.0.3, rather than 2.0.2. If a new RC > (i.e. RC2) is cut, I will change the fix version of those patches to 2.0.2. > > > -- Ryan Blue Software Engineer Netflix

Re: Spark Improvement Proposals

2016-11-08 Thread Ryan Blue
gt; >>> >>> batch...We > >>> >>> >>> (and > >>> >>> >>> I am sure many others) are pushing spark as an engine for > stream > >>> >>> >>> and > >>> >>> >>> query > >>> >>> >>> processing.we need to make it a state-of-the-art engine for > >>> >>> >>> high > >>> >>> >>> speed > >>> >>> >>> streaming data and user queries as well ! > >>> >>> >>> > >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda > >>> >>> >>> <tomasz.gaw...@outlook.com> > >>> >>> >>> wrote: > >>> >>> >>>> > >>> >>> >>>> Hi everyone, > >>> >>> >>>> > >>> >>> >>>> I'm quite late with my answer, but I think my suggestions may > >>> >>> >>>> help a > >>> >>> >>>> little bit. :) Many technical and organizational topics were > >>> >>> >>>> mentioned, > >>> >>> >>>> but I want to focus on these negative posts about Spark and > >>> >>> >>>> about > >>> >>> >>>> "haters" > >>> >>> >>>> > >>> >>> >>>> I really like Spark. Easy of use, speed, very good community - > >>> >>> >>>> it's > >>> >>> >>>> everything here. But Every project has to "flight" on > "framework > >>> >>> >>>> market" > >>> >>> >>>> to be still no 1. I'm following many Spark and Big Data > >>> >>> >>>> communities, > >>> >>> >>>> maybe my mail will inspire someone :) > >>> >>> >>>> > >>> >>> >>>> You (every Spark developer; so far I didn't have enough time > to > >>> >>> >>>> join > >>> >>> >>>> contributing to Spark) has done excellent job. So why are some > >>> >>> >>>> people > >>> >>> >>>> saying that Flink (or other framework) is better, like it was > >>> >>> >>>> posted > >>> >>> >>>> in > >>> >>> >>>> this mailing list? No, not because that framework is better in > >>> >>> >>>> all > >>> >>> >>>> cases.. In my opinion, many of these discussions where started > >>> >>> >>>> after > >>> >>> >>>> Flink marketing-like posts. Please look at StackOverflow > "Flink > >>> >>> >>>> vs > >>> >>> >>>> " > >>> >>> >>>> posts, almost every post in "winned" by Flink. Answers are > >>> >>> >>>> sometimes > >>> >>> >>>> saying nothing about other frameworks, Flink's users (often > >>> >>> >>>> PMC's) > >>> >>> >>>> are > >>> >>> >>>> just posting same information about real-time streaming, about > >>> >>> >>>> delta > >>> >>> >>>> iterations, etc. It look smart and very often it is marked as > an > >>> >>> >>>> aswer, > >>> >>> >>>> even if - in my opinion - there wasn't told all the truth. > >>> >>> >>>> > >>> >>> >>>> > >>> >>> >>>> My suggestion: I don't have enough money and knowledgle to > >>> >>> >>>> perform > >>> >>> >>>> huge > >>> >>> >>>> performance test. Maybe some company, that supports Spark > >>> >>> >>>> (Databricks, > >>> >>> >>>> Cloudera? - just saying you're most visible in community :) ) > >>> >>> >>>> could > >>> >>> >>>> perform performance test of: > >>> >>> >>>> > >>> >>> >>>> - streaming engine - probably Spark will loose because of > >>> >>> >>>> mini-batch > >>> >>> >>>> model, however currently the difference should be much lower > >>> >>> >>>> that in > >>> >>> >>>> previous versions > >>> >>> >>>> > >>> >>> >>>> - Machine Learning models > >>> >>> >>>> > >>> >>> >>>> - batch jobs > >>> >>> >>>> > >>> >>> >>>> - Graph jobs > >>> >>> >>>> > >>> >>> >>>> - SQL queries > >>> >>> >>>> > >>> >>> >>>> People will see that Spark is envolving and is also a modern > >>> >>> >>>> framework, > >>> >>> >>>> because after reading posts mentioned above people may think > "it > >>> >>> >>>> is > >>> >>> >>>> outdated, future is in framework X". > >>> >>> >>>> > >>> >>> >>>> Matei Zaharia posted excellent blog post about how Spark > >>> >>> >>>> Structured > >>> >>> >>>> Streaming beats every other framework in terms of easy-of-use > >>> >>> >>>> and > >>> >>> >>>> reliability. Performance tests, done in various environments > (in > >>> >>> >>>> example: laptop, small 2 node cluster, 10-node cluster, > 20-node > >>> >>> >>>> cluster), could be also very good marketing stuff to say "hey, > >>> >>> >>>> you're > >>> >>> >>>> telling that you're better, but Spark is still faster and is > >>> >>> >>>> still > >>> >>> >>>> getting even more fast!". This would be based on facts (just > >>> >>> >>>> numbers), > >>> >>> >>>> not opinions. It would be good for companies, for marketing > >>> >>> >>>> puproses > >>> >>> >>>> and > >>> >>> >>>> for every Spark developer > >>> >>> >>>> > >>> >>> >>>> > >>> >>> >>>> Second: real-time streaming. I've written some time ago about > >>> >>> >>>> real-time > >>> >>> >>>> streaming support in Spark Structured Streaming. Some work > >>> >>> >>>> should be > >>> >>> >>>> done to make SSS more low-latency, but I think it's possible. > >>> >>> >>>> Maybe > >>> >>> >>>> Spark may look at Gearpump, which is also built on top of > Akka? > >>> >>> >>>> I > >>> >>> >>>> don't > >>> >>> >>>> know yet, it is good topic for SIP. However I think that Spark > >>> >>> >>>> should > >>> >>> >>>> have real-time streaming support. Currently I see many > >>> >>> >>>> posts/comments > >>> >>> >>>> that "Spark has too big latency". Spark Streaming is doing > very > >>> >>> >>>> good > >>> >>> >>>> jobs with micro-batches, however I think it is possible to add > >>> >>> >>>> also > >>> >>> >>>> more > >>> >>> >>>> real-time processing. > >>> >>> >>>> > >>> >>> >>>> Other people said much more and I agree with proposal of SIP. > >>> >>> >>>> I'm > >>> >>> >>>> also > >>> >>> >>>> happy that PMC's are not saying that they will not listen to > >>> >>> >>>> users, > >>> >>> >>>> but > >>> >>> >>>> they really want to make Spark better for every user. > >>> >>> >>>> > >>> >>> >>>> > >>> >>> >>>> What do you think about these two topics? Especially I'm > looking > >>> >>> >>>> at > >>> >>> >>>> Cody > >>> >>> >>>> (who has started this topic) and PMCs :) > >>> >>> >>>> > >>> >>> >>>> Pozdrawiam / Best regards, > >>> >>> >>>> > >>> >>> >>>> Tomasz > >>> >>> >>>> > >>> >>> >>>> > >>> >>> > >>> >> > >>> > > >>> > > > > > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-09 Thread Ryan Blue
gt; https://repository.apache.org/content/repositories/orgapachespark-1214/ >>>> >>>> The documentation corresponding to this release can be found at: >>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/ >>>> >>>> >>>> Q: How can I help test this release? >>>> A: If you are a Spark user, you can help us test this release by taking >>>> an existing Spark workload and running on this release candidate, then >>>> reporting any regressions from 2.0.1. >>>> >>>> Q: What justifies a -1 vote for this release? >>>> A: This is a maintenance release in the 2.0.x series. Bugs already >>>> present in 2.0.1, missing features, or bugs related to new features will >>>> not necessarily block this release. >>>> >>>> Q: What fix version should I use for patches merging into branch-2.0 >>>> from now on? >>>> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC >>>> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Regards, >>>> Vaquar Khan >>>> +1 -224-436-0783 <(224)%20436-0783> >>>> >>>> IT Architect / Lead Consultant >>>> Greater Chicago >>>> >>> >> > -- Ryan Blue Software Engineer Netflix

Re: source for org.spark-project.hive:1.2.1.spark2

2016-10-14 Thread Ryan Blue
. > mbox/%3ca0aa8b38-deee-476a-93ff-92fead06e...@hortonworks.com%3E] > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: OutOfMemoryError on parquet SnappyDecompressor

2016-11-21 Thread Ryan Blue
> >> > >> > >> > >> >> > org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1. >> apply(basicOperators.scala:219) >> > >> > org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) >> > >> > org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) >> > >> > >> > >> > >> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) >> >> > >> > >> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) >> > >> > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) >> > >> > >> > >> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) >> > >> > org.apache.spark.scheduler.Task.run(Task.scala:54) >> > >> > >> > >> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) >> >> > >> > >> > >> > >> > >> >> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> >> > >> > >> > >> > >> > >> >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> >> > >> > java.lang.Thread.run(Thread.java:722) >> > >> > >> > >> > >> > >> > >> > >> >> > > >> > > >> > >> >> >> -- >> If you reply to this email, your message will be added to the discussion >> below: >> http://apache-spark-developers-list.1001551.n3. >> nabble.com/OutOfMemoryError-on-parquet-SnappyDecompressor- >> tp8517p8528.html >> To start a new topic under Apache Spark Developers List, email [hidden >> email] <http:///user/SendEmail.jtp?type=node=19965=1> >> To unsubscribe from Apache Spark Developers List, click here. >> NAML >> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >> > > -- > View this message in context: Re: OutOfMemoryError on parquet > SnappyDecompressor > <http://apache-spark-developers-list.1001551.n3.nabble.com/OutOfMemoryError-on-parquet-SnappyDecompressor-tp8517p19965.html> > Sent from the Apache Spark Developers List mailing list archive > <http://apache-spark-developers-list.1001551.n3.nabble.com/> at > Nabble.com. > -- Ryan Blue Software Engineer Netflix

Re: Odp.: Spark Improvement Proposals

2016-10-31 Thread Ryan Blue
ent Proposals exactly, > >>> > but you need a documented process with a clear outcome (e.g. a vote). > >>> > Passing around google docs after an implementation has largely been > >>> > decided on doesn't cut it. > >>> > > >>> > - All technical communication needs to be public. > >>> > Things getting decided in private chat, or when 1/3 of the committers > >>> > work for the same company and can just talk to each other... > >>> > Yes, it's convenient, but it's ultimately detrimental to the health > of > >>> > the project. > >>> > The way structured streaming has played out has shown that there are > >>> > significant technical blind spots (myself included). > >>> > One way to address that is to get the people who have domain > knowledge > >>> > involved, and listen to them. > >>> > > >>> > - We need more committers, and more committer diversity. > >>> > Per committer there are, what, more than 20 contributors and 10 new > >>> > jira tickets a month? It's too much. > >>> > There are people (I am _not_ referring to myself) who have been > around > >>> > for years, contributed thousands of lines of code, helped educate the > >>> > public around Spark... and yet are never going to be voted in. > >>> > > >>> > - We need a clear process for managing volunteer work. > >>> > Too many tickets sit around unowned, unclosed, uncertain. > >>> > If someone proposed something and it isn't up to snuff, tell them and > >>> > close it. It may be blunt, but it's clearer than "silent no". > >>> > If someone wants to work on something, let them own the ticket and > set > >>> > a deadline. If they don't meet it, close it or reassign it. > >>> > > >>> > This is not me putting on an Apache Bureaucracy hat. This is me > >>> > saying, as a fellow hacker and loyal dissenter, something is wrong > >>> > with the culture and process. > >>> > > >>> > Please, let's change it. > >>> > > >>> > > - > >>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >>> > > >> > >> > > > > - > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: Updating Parquet dep to 1.9

2016-11-02 Thread Ryan Blue
ttps://github.com/apache/spark/pull/15538 needs to make > it into 2.1. The logging output issue is really bad. I would probably call > it a blocker. > > Michael > > > On Nov 1, 2016, at 1:22 PM, Ryan Blue <rb...@netflix.com> wrote: > > I can when I'm finished with a coup

Re: Updating Parquet dep to 1.9

2016-11-01 Thread Ryan Blue
> On Tue, Nov 1, 2016 at 9:05 AM, Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> 1.9.0 includes some fixes intended specifically for Spark: >> >> * PARQUET-389: Evaluates push-down predicates for missing columns as >> though they are null. This is t

Re: Spark Improvement Proposals

2016-10-11 Thread Ryan Blue
ther than committers can cast a meaningful vote, that's the > >>>> reality. Beyond that, if people think it's more open to allow formal > >>>> proposals from anyone, I'm not necessarily against it, but my main > >>>> question would be thi

Re: source for org.spark-project.hive:1.2.1.spark2

2016-10-17 Thread Ryan Blue
Are these changes that the Hive community has rejected? I don't see a compelling reason to have a long-term Spark fork of Hive. rb On Sat, Oct 15, 2016 at 5:27 AM, Steve Loughran <ste...@hortonworks.com> wrote: > > On 15 Oct 2016, at 01:28, Ryan Blue <rb...@netflix.com

Parquet patch release

2017-01-06 Thread Ryan Blue
1701.mbox/%3CCAO4re1mnWJ3%3Di0NpUmPU%2BwD8G%3DsG_%2BAA2PsFBzZv%3DwrUR1529g%40mail.gmail.com%3E> on the Parquet dev list. If you're interested in reviewing what goes into 1.8.2 or have suggestions, please follow that thread on the Parquet list. Thanks! rb -- Ryan Blue Software Engineer Netflix

Re: Skip Corrupted Parquet blocks / footer.

2017-01-03 Thread Ryan Blue
e Apache Spark Developers List mailing list archive at > Nabble.com. > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: Apache Hive with Spark Configuration

2017-01-03 Thread Ryan Blue
astore, can you tell me which > version is more compatible with Spark 2.0.2 ? > > THanks > -- Ryan Blue Software Engineer Netflix

Re: Output Committers for S3

2017-03-28 Thread Ryan Blue
dp.s3.S3PartitionedOutputCommitter not > > org.apache.parquet.hadoop.ParquetOutputCommitter > >at > > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2221) > >... 28 more > > > > can you please point out my mistake. > > > > If possible can you give a working example of saving a dataframe as a > > parquet file in s3. > > > > > > > > > > > > > > > > -- > > View this message in context: http://apache-spark- > developers-list.1001551.n3.nabble.com/Output-Committers- > for-S3-tp21033p21246.html > > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > > > - > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > > > > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-10 Thread Ryan Blue
gt;>> https://repository.apache.org/content/repositories/orgapache >>>>> spark-1227/ >>>>> >>>>> The documentation corresponding to this release can be found at: >>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.1. >>>>> 1-rc2-docs/ >>>>> >>>>> >>>>> *FAQ* >>>>> >>>>> *How can I help test this release?* >>>>> >>>>> If you are a Spark user, you can help us test this release by taking >>>>> an existing Spark workload and running on this release candidate, then >>>>> reporting any regressions. >>>>> >>>>> *What should happen to JIRA tickets still targeting 2.1.1?* >>>>> >>>>> Committers should look at those and triage. Extremely important bug >>>>> fixes, documentation, and API tweaks that impact compatibility should be >>>>> worked on immediately. Everything else please retarget to 2.1.2 or 2.2.0. >>>>> >>>>> *But my bug isn't fixed!??!* >>>>> >>>>> In order to make timely releases, we will typically not hold the >>>>> release unless the bug in question is a regression from 2.1.0. >>>>> >>>>> *What happened to RC1?* >>>>> >>>>> There were issues with the release packaging and as a result was >>>>> skipped. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Cell : 425-233-8271 <(425)%20233-8271> >>>>> Twitter: https://twitter.com/holdenkarau >>>>> >>>>> >>>>> -- >>>> Cell : 425-233-8271 <(425)%20233-8271> >>>> Twitter: https://twitter.com/holdenkarau >>>> >>> >>> >>> >>> -- >>> Cell : 425-233-8271 <(425)%20233-8271> >>> Twitter: https://twitter.com/holdenkarau >>> >> >> >> >> -- >> Cell : 425-233-8271 <(425)%20233-8271> >> Twitter: https://twitter.com/holdenkarau >> > > -- Ryan Blue Software Engineer Netflix

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-14 Thread Ryan Blue
ase >>>>> coordinator so I understand if that's not actually faster). >>>>> >>>>> On Mon, Apr 10, 2017 at 6:39 PM, DB Tsai <dbt...@dbtsai.com> wrote: >>>>> >>>>>> I backported the fix into both branch-2.1 and branch-2.0. Thank

Re: [Discuss][Spark staging dir] way to disable spark writing to _temporary

2017-04-07 Thread Ryan Blue
mand$$anonfun$run$1.apply$mcV$sp( > InsertIntoHadoopFsRelationCommand.scala:149) > at org.apache.spark.sql.execution.datasources. > InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply( > InsertIntoHadoopFsRelationCommand.scala:115) > > {logs} > > > -- Ryan Blue Software Engineer Netflix

Re: What is correct behavior for spark.task.maxFailures?

2017-04-24 Thread Ryan Blue
for a stage. In that version, you probably want to set spark.blacklist.task.maxTaskAttemptsPerExecutor. See the settings docs <http://spark.apache.org/docs/latest/configuration.html> and search for “blacklist” to see all the options. rb ​ On Mon, Apr 24, 2017 at 9:41 AM, Ryan Blue <rb...@netflix.c

Re: [VOTE] [SPIP] SPARK-18085: Better History Server scalability

2017-08-01 Thread Ryan Blue
good idea because of the following > technical reasons. > > Thanks! > > -- > Marcelo > > ----- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-02 Thread Ryan Blue
. >> memoryOverhead. >> >> Driver memory=4g, executor mem=12g, num-executors=8, executor core=8 >> >> Do you think below setting can help me to overcome above issue: >> >> spark.default.parellism=1000 >> spark.sql.shuffle.partitions=1000 >> >> Because default max number of partitions are 1000. >> >> >> > -- Ryan Blue Software Engineer Netflix

Re: Output Committers for S3

2017-06-19 Thread Ryan Blue
ove comments is probably (maybe > ryan or steve can confirm this assumption) not applicable to the Netflix > commiter uploaded by Ryan blue. Because Ryan's commiter uses multipart > upload. So either the whole file is live or nothing is. partial data will > not be available for read. What

Re: Parquet vectorized reader DELTA_BYTE_ARRAY

2017-05-22 Thread Ryan Blue
e e-mail: dev-unsubscr...@spark.apache.org > > > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: Are release docs part of a release?

2017-06-08 Thread Ryan Blue
itely not a release blocker. >> >> In any event I just resolved SPARK-20507, as I don't believe any website >> updates are required for this release anyway. That fully resolves the ML QA >> umbrella (SPARK-20499). >> >>> >>> -- Ryan Blue Software Engineer Netflix

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-01 Thread Ryan Blue
cts to > find Avro 1.8.0 on the runtime classpath and sees 1.7.7 instead. Spark > already has to work around this for unit tests to pass. > > > > On Mon, May 1, 2017 at 2:00 PM, Ryan Blue <rb...@netflix.com> wrote: > >> Thanks for the extra context, Frank. I a

Re: [VOTE] Spark 2.1.2 (RC1)

2017-09-15 Thread Ryan Blue
ere is something which is a regression form 2.1.1 that has not been >>> correctly targeted please ping a committer to help target the issue (you >>> can see the open issues listed as impacting Spark 2.1.1 & 2.1.2 >>> <https://issues.apache.org/jira/browse/SPARK-21985?jql=project%20%3D%20SPARK%20AND%20status%20%3D%20OPEN%20AND%20(affectedVersion%20%3D%202.1.2%20OR%20affectedVersion%20%3D%202.1.1)> >>> ) >>> >>> *What are the unresolved* issues targeted for 2.1.2 >>> <https://issues.apache.org/jira/browse/SPARK-21985?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.2> >>> ? >>> >>> At the time of the writing, there is one in progress major issue >>> SPARK-21985 <https://issues.apache.org/jira/browse/SPARK-21985>, I >>> believe Andrew Ray & HyukjinKwon are looking into this one. >>> >>> -- >>> Twitter: https://twitter.com/holdenkarau >>> >> >> >> -- Ryan Blue Software Engineer Netflix

Re: CHAR implementation?

2017-09-15 Thread Ryan Blue
the correct one in Spark, but > Parquet has been de-factor standard in Spark also. (I'm not comparing this > with the other DBMS.) > > I'm wondering which way we need to go or want to go in Spark? > > Bests, > Dongjoon. > -- Ryan Blue Software Engineer Netflix

Re: [VOTE] Spark 2.1.2 (RC1)

2017-09-15 Thread Ryan Blue
e however the Jenkins > scripts don't take a parameter for configuring who signs them rather it > always signs them with Patrick's key. You can see this from previous > releases which were managed by other folks but still signed by Patrick. > > On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue &

Re: Signing releases with pwendell or release manager's key?

2017-09-15 Thread Ryan Blue
> That's a good question, I built the release candidate however the Jenkins >> scripts don't take a parameter for configuring who signs them rather it >> always signs them with Patrick's key. You can see this from previous >> releases which were managed by other folk

Re: [discuss] Data Source V2 write path

2017-09-21 Thread Ryan Blue
h a list of predefined options, to save >> users from typing these options again and again for each query. >> If that's all, then everything is good, we don't need to add more >> interfaces to Data Source V2. However, data source tables provide special >> operators like ALTER TABLE SCHEMA, ADD PARTITION, etc., which requires data >> sources to have some extra ability. >> Currently these special operators only work for built-in file-based data >> sources, and I don't think we will extend it in the near future, I propose >> to mark them as out of the scope. >> >> >> Any comments are welcome! >> Thanks, >> Wenchen >> > > -- Ryan Blue Software Engineer Netflix

Re: [discuss] Data Source V2 write path

2017-09-21 Thread Ryan Blue
lt-in file-based data > sources, and I don't think we will extend it in the near future, I propose > to mark them as out of the scope. > > > Any comments are welcome! > Thanks, > Wenchen > -- Ryan Blue Software Engineer Netflix

Re: Signing releases with pwendell or release manager's key?

2017-09-19 Thread Ryan Blue
7 PM, Marcelo Vanzin <van...@cloudera.com> >>>>> wrote: >>>>> >>>>>> +1 to this. There should be a script in the Spark repo that has all >>>>>> the logic needed for a release. That script should take the RM's key >>>&g

Re: Signing releases with pwendell or release manager's key?

2017-09-18 Thread Ryan Blue
able parameters, right now >>>>> depends on Josh arisen as there are some scripts which generate the jobs >>>>> which aren't public. I've done temporary fixes in the past with the Python >>>>> packaging but my understanding is that in the medium term it requires >

Re: [VOTE] Spark 2.1.2 (RC1)

2017-09-15 Thread Ryan Blue
he thread Sean Owen made. > > On Fri, Sep 15, 2017 at 4:04 PM Ryan Blue <rb...@netflix.com> wrote: > >> I'm not familiar with the release procedure, can you send a link to this >> Jenkins job? Can anyone run this job, or is it limited to committers? >> >&g

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-03 Thread Ryan Blue
b.com/apache/spark-website/pull/66* >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark-2Dwebsite_pull_66=DwMFaQ=jf_iaSHvJObTbx-siA1ZOg=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg=cF_k_lDBFIRW7HXbcjAQSyY9hc2aq_au5TZdMKxvSQ8=7bXZCc6_vzMMe_xhzkbfp7iBGafk5C3tF4dKghY3QiI=> >> as >> I progress), however the chances of a mistake are higher with any change >> like this. If there something you normally take for granted as correct when >> checking a release, please double check this time :) >> >> *Should I be committing code to branch-2.1?* >> >> Thanks for asking! Please treat this stage in the RC process as "code >> freeze" so bug fixes only. If you're uncertain if something should be back >> ported please reach out. If you do commit to branch-2.1 please tag your >> JIRA issue fix version for 2.1.3 and if we cut another RC I'll move >> the 2.1.3 fixed into 2.1.2 as appropriate. >> >> *What happened to RC3?* >> >> Some R+zinc interactions kept it from getting out the door. >> -- >> Twitter: *https://twitter.com/holdenkarau* >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_holdenkarau=DwMFaQ=jf_iaSHvJObTbx-siA1ZOg=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg=cF_k_lDBFIRW7HXbcjAQSyY9hc2aq_au5TZdMKxvSQ8=P9h5SBpzXhoiU3Q-z7f9KQTxMrqiXdBvmAnJDdXfqDM=> >> >> >> >> > -- Ryan Blue Software Engineer Netflix

Re: Disabling Closed -> Reopened transition for non-committers

2017-10-05 Thread Ryan Blue
t; I also suggested it because this behavior appears to be the default for > ASF projects. It wasn't clear why Spark was setup differently. > > > On Thu, Oct 5, 2017 at 5:00 PM Ryan Blue <rb...@netflix.com> wrote: > >> While I have also felt this frustration and understand the

Re: Disabling Closed -> Reopened transition for non-committers

2017-10-05 Thread Ryan Blue
h...@gmail.com> >>> wrote: >>> >>>> It can stop reopening, but new JIRA issues with duplicate content will >>>> be created intentionally instead. >>>> >>>> Is that policy (privileged reopening) used in other Apache communities >>>> for that purpose? >>>> >>>> >>>> On Wed, Oct 4, 2017 at 7:06 PM, Sean Owen <so...@cloudera.com> wrote: >>>> >>>>> We have this problem occasionally, where a disgruntled user >>>>> continually reopens an issue after it's closed. >>>>> >>>>> https://issues.apache.org/jira/browse/SPARK-21999 >>>>> >>>>> (Feel free to comment on this one if anyone disagrees) >>>>> >>>>> Regardless of that particular JIRA, I'd like to disable to Closed -> >>>>> Reopened transition for non-committers: https://issues.apache.org/jira >>>>> /browse/INFRA-15221 >>>>> >>>>> >>>> >>> > -- Ryan Blue Software Engineer Netflix

Re: 2.1.2 maintenance release?

2017-09-08 Thread Ryan Blue
tial credential to > upload artifacts. > > > On Thu, Sep 7, 2017 at 11:59 PM Holden Karau <hol...@pigscanfly.ca> wrote: > >> I'd be happy to manage the 2.1.2 maintenance release (and 2.2.1 after >> that) if people are ok with a committer / me running the release process >> rather than a full PMC member. >> > -- Ryan Blue Software Engineer Netflix

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread Ryan Blue
tially the same as I have to do while >>>>>>> actually generating my RDD (essentially I have to generate my >>>>>>> partitions), >>>>>>> so I end up doing some weird caching work. >>>>>>> >>>>>>> This V2 API proposal has the same issues, but perhaps moreso. In >>>>>>> PrunedFilteredScan, there is essentially one degree of freedom for >>>>>>> pruning >>>>>>> (filters), so you just have to implement caching between >>>>>>> unhandledFilters >>>>>>> and buildScan. However, here we have many degrees of freedom; sorts, >>>>>>> individual filters, clustering, sampling, maybe aggregations eventually >>>>>>> - >>>>>>> and these operations are not all commutative, and computing my support >>>>>>> one-by-one can easily end up being more expensive than computing all in >>>>>>> one >>>>>>> go. >>>>>>> >>>>>>> For some trivial examples: >>>>>>> >>>>>>> - After filtering, I might be sorted, whilst before filtering I >>>>>>> might not be. >>>>>>> >>>>>>> - Filtering with certain filters might affect my ability to push >>>>>>> down others. >>>>>>> >>>>>>> - Filtering with aggregations (as mooted) might not be possible to >>>>>>> push down. >>>>>>> >>>>>>> And with the API as currently mooted, I need to be able to go back >>>>>>> and change my results because they might change later. >>>>>>> >>>>>>> Really what would be good here is to pass all of the filters and >>>>>>> sorts etc all at once, and then I return the parts I can’t handle. >>>>>>> >>>>>>> I’d prefer in general that this be implemented by passing some kind >>>>>>> of query plan to the datasource which enables this kind of replacement. >>>>>>> Explicitly don’t want to give the whole query plan - that sounds >>>>>>> painful - >>>>>>> would prefer we push down only the parts of the query plan we deem to be >>>>>>> stable. With the mix-in approach, I don’t think we can guarantee the >>>>>>> properties we want without a two-phase thing - I’d really love to be >>>>>>> able >>>>>>> to just define a straightforward union type which is our supported >>>>>>> pushdown >>>>>>> stuff, and then the user can transform and return it. >>>>>>> >>>>>>> I think this ends up being a more elegant API for consumers, and >>>>>>> also far more intuitive. >>>>>>> >>>>>>> James >>>>>>> >>>>>>> On Mon, 28 Aug 2017 at 18:00 蒋星博 <jiangxb1...@gmail.com> wrote: >>>>>>> >>>>>>>> +1 (Non-binding) >>>>>>>> >>>>>>>> Xiao Li <gatorsm...@gmail.com>于2017年8月28日 周一下午5:38写道: >>>>>>>> >>>>>>>>> +1 >>>>>>>>> >>>>>>>>> 2017-08-28 12:45 GMT-07:00 Cody Koeninger <c...@koeninger.org>: >>>>>>>>> >>>>>>>>>> Just wanted to point out that because the jira isn't labeled >>>>>>>>>> SPIP, it >>>>>>>>>> won't have shown up linked from >>>>>>>>>> >>>>>>>>>> http://spark.apache.org/improvement-proposals.html >>>>>>>>>> >>>>>>>>>> On Mon, Aug 28, 2017 at 2:20 PM, Wenchen Fan <cloud0...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> > Hi all, >>>>>>>>>> > >>>>>>>>>> > It has been almost 2 weeks since I proposed the data source V2 >>>>>>>>>> for >>>>>>>>>> > discussion, and we already got some feedbacks on the JIRA >>>>>>>>>> ticket and the >>>>>>>>>> > prototype PR, so I'd like to call for a vote. >>>>>>>>>> > >>>>>>>>>> > The full document of the Data Source API V2 is: >>>>>>>>>> > https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ- >>>>>>>>>> Z8qU5Frf6WMQZ6jJVM/edit >>>>>>>>>> > >>>>>>>>>> > Note that, this vote should focus on high-level >>>>>>>>>> design/framework, not >>>>>>>>>> > specified APIs, as we can always change/improve specified APIs >>>>>>>>>> during >>>>>>>>>> > development. >>>>>>>>>> > >>>>>>>>>> > The vote will be up for the next 72 hours. Please reply with >>>>>>>>>> your vote: >>>>>>>>>> > >>>>>>>>>> > +1: Yeah, let's go forward and implement the SPIP. >>>>>>>>>> > +0: Don't really care. >>>>>>>>>> > -1: I don't think this is a good idea because of the following >>>>>>>>>> technical >>>>>>>>>> > reasons. >>>>>>>>>> > >>>>>>>>>> > Thanks! >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> - >>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>> >>>> >> -- Ryan Blue Software Engineer Netflix

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread Ryan Blue
future without breaking any APIs. I'd rather us shipping something useful > that might not be the most comprehensive set, than debating about every > single feature we should add and then creating something super complicated > that has unclear value. > > > > On Wed, Aug 3

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread Ryan Blue
ur vote: >>> >>> +1: Yeah, let's go forward and implement the SPIP. >>> +0: Don't really care. >>> -1: I don't think this is a good idea because of the following technical >>> reasons. >>> >>> Thanks! >>> >>> >>> >> > > > -- > > Herman van Hövell > > Software Engineer > > Databricks Inc. > > hvanhov...@databricks.com > > +31 6 420 590 27 > > databricks.com > > [image: http://databricks.com] <http://databricks.com/> > > > > [image: Announcing Databricks Serverless. The first serverless data > science and big data platform. Watch the demo from Spark Summit 2017.] > <http://go.databricks.com/announcing-databricks-serverless> > -- Ryan Blue Software Engineer Netflix

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-09-06 Thread Ryan Blue
I do appreciate your feedbacks/comments on the prototype, let's keep > the discussion there. In the meanwhile, let's have more discussion on the > overall framework, and drive this project together. > > Wenchen > > > > On Thu, Aug 31, 2017 at 6:22 AM, Ryan Blue <rb...@netflix.com&

Re: [discuss] Data Source V2 write path

2017-09-25 Thread Ryan Blue
. > > The same thing applies to Hadoop FS data sources, we need to pass metadata > to the writer anyway. > > > > On Tue, Sep 26, 2017 at 1:08 AM, Ryan Blue <rb...@netflix.com> wrote: > >> However, without catalog federation, Spark doesn’t have an API to ask an >&

Re: Should Flume integration be behind a profile?

2017-09-26 Thread Ryan Blue
nless you generate the sources manually) > -- Ryan Blue Software Engineer Netflix

Re: [discuss] Data Source V2 write path

2017-10-02 Thread Ryan Blue
t don't have > metastore. > > Personally I prefer proposal 3, because it's not blocked by catalog > federation, so that we can develop it incrementally. And it makes the > catalog support optional, so that simple data sources without metastore can > also implement data source v2. > &

Re: [VOTE] Spark 2.1.2 (RC2)

2017-09-29 Thread Ryan Blue
>>> This is the first release in awhile not built on the AMPLAB Jenkins. >>>> This is good because it means future releases can more easily be built and >>>> signed securely (and I've been updating the documentation in >>>> https://github.com/apache/spark-website/pull/66 as I progress), >>>> however the chances of a mistake are higher with any change like this. If >>>> there something you normally take for granted as correct when checking a >>>> release, please double check this time :) >>>> >>>> *Should I be committing code to branch-2.1?* >>>> >>>> Thanks for asking! Please treat this stage in the RC process as "code >>>> freeze" so bug fixes only. If you're uncertain if something should be back >>>> ported please reach out. If you do commit to branch-2.1 please tag your >>>> JIRA issue fix version for 2.1.3 and if we cut another RC I'll move the >>>> 2.1.3 fixed into 2.1.2 as appropriate. >>>> >>>> *Why the longer voting window?* >>>> >>>> Since there is a large industry big data conference this week I figured >>>> I'd add a little bit of extra buffer time just to make sure everyone has a >>>> chance to take a look. >>>> >>>> -- >>>> Twitter: https://twitter.com/holdenkarau >>>> >>> >>> >>> >>> -- >>> Luciano Resende >>> http://twitter.com/lresende1975 >>> http://lresende.blogspot.com/ >>> >> -- >> Twitter: https://twitter.com/holdenkarau >> > > > > -- > Twitter: https://twitter.com/holdenkarau > -- Ryan Blue Software Engineer Netflix

Re: [discuss] Data Source V2 write path

2017-09-29 Thread Ryan Blue
n implement some dirty features via options. e.g. file >>>> format data sources can take partitioning/bucketing from options, data >>>> source with metastore can use a special flag in options to indicate a >>>> create table command(without writing data). >>>> >>> >>> I can see how this would make changes smaller, but I don't think it is a >>> good thing to do. If we do this, then I think we will not really accomplish >>> what we want to with this (a clean write API). >>> >>> >>>> In other words, Spark connects users to data sources with a clean >>>> protocol that only focus on data, but this protocol has a backdoor: the >>>> data source options. Concrete data sources are free to define how to deal >>>> with metadata, e.g. Cassandra data source can ask users to create table at >>>> Cassandra side first, then write data at Spark side, or ask users to >>>> provide more details in options and do CTAS at Spark side. These can be >>>> done via options. >>>> >>>> After catalog federation, hopefully only file format data sources still >>>> use this backdoor. >>>> >>> >>> Why would file format sources use a back door after catalog federation?? >>> >>> rb >>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> > -- Ryan Blue Software Engineer Netflix

Re: [discuss] Data Source V2 write path

2017-09-27 Thread Ryan Blue
free to define how to deal with > metadata, e.g. Cassandra data source can ask users to create table at > Cassandra side first, then write data at Spark side, or ask users to > provide more details in options and do CTAS at Spark side. These can be > done via options. > > After catalog fe

Re: [discuss] Data Source V2 write path

2017-09-25 Thread Ryan Blue
to take > these informations and create the table, or throw exception if these > informations don't match the already-configured table. > > > On Fri, Sep 22, 2017 at 9:35 AM, Ryan Blue <rb...@netflix.com> wrote: > >> > input data requirement >> >> Clustering

Re: OutputMetrics empty for DF writes - any hints?

2017-12-12 Thread Ryan Blue
Great. What's the JIRA issue? On Mon, Dec 11, 2017 at 8:12 PM, Jason White <jason.wh...@shopify.com> wrote: > Yes, the fix has been merged at should make it into the 2.3 release. > > On Mon, Dec 11, 2017, 5:50 PM Ryan Blue <rb...@netflix.com> wrote: > >> Is anyon

Re: OutputMetrics empty for DF writes - any hints?

2017-12-11 Thread Ryan Blue
see > any commands with metrics, but I could be missing something. > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@

Re: eager execution and debuggability

2018-05-08 Thread Ryan Blue
I've opened SPARK-24215 to track this. On Tue, May 8, 2018 at 3:58 PM, Reynold Xin <r...@databricks.com> wrote: > Yup. Sounds great. This is something simple Spark can do and provide huge > value to the end users. > > > On Tue, May 8, 2018 at 3:53 PM Ryan Blue <

[DISCUSS] Spark SQL internal data: InternalRow or UnsafeRow?

2018-05-08 Thread Ryan Blue
nalRow, then there is an easy performance win that also simplifies the v2 data source API. rb ​ -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Spark SQL internal data: InternalRow or UnsafeRow?

2018-05-08 Thread Ryan Blue
, Reynold Xin <r...@databricks.com> wrote: > What the internal operators do are strictly internal. To take one step > back, is the goal to design an API so the consumers of the API can directly > produces what Spark expects internally, to cut down perf cost? > > > On Tue, May 8, 2

Re: eager execution and debuggability

2018-05-08 Thread Ryan Blue
also struggle in similar >> ways as these students. While eager execution is really not practical in >> big data, in learning environments or in development against small, sampled >> datasets it can be pretty helpful. >> >> >> >> >> >> >> >> >> >> > -- Ryan Blue Software Engineer Netflix

Re: eager execution and debuggability

2018-05-08 Thread Ryan Blue
are tools/ways to force the >>> execution, helping in the debugging phase. So they can achieve without a >>> big effort the same result, but with a big difference: they are aware of >>> what is really happening, which may help them later. >>> >>> Thanks

Re: [DISCUSS] Spark SQL internal data: InternalRow or UnsafeRow?

2018-05-08 Thread Ryan Blue
ent types of rows, so we forced > the conversion at input. > > Can't your "wish" be satisfied by having the public API producing the > internals of UnsafeRow (without actually exposing UnsafeRow)? > > > On Tue, May 8, 2018 at 4:16 PM Ryan Blue <rb...@netflix.com>

Re: eager execution and debuggability

2018-05-10 Thread Ryan Blue
end that triggered the >>> error. >>> >>> I don’t know how feasible this is, but addressing it would directly >>> solve the issue of linking failures to the responsible transformation, as >>> opposed to leaving the user to break up a chain of trans

Re: Time for 2.3.1?

2018-05-10 Thread Ryan Blue
e weekend). > > Thanks! > > > -- > Marcelo > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: Time for 2.3.1?

2018-05-11 Thread Ryan Blue
(by > >> replying here or updating the bug in Jira), otherwise I'm volunteering > >> to prepare the first RC soon-ish (around the weekend). > >> > >> Thanks! > >> > >> > >> -- > >> Marcelo > >> > >> -----

Re: eager execution and debuggability

2018-05-21 Thread Ryan Blue
uot; nodes and see how many jobs > was triggered and number of tasks and their duration. Now it's hard to > debug it, especially for newbies. > > Pozdrawiam / Best regards, > Tomek Gawęda > > > On 2018-05-10 18:31, Ryan Blue wrote: > > > it would be fantastic if

Re: Very slow complex type column reads from parquet

2018-06-18 Thread Ryan Blue
level of parallelism (searching for a given object id when sorted > by time needs to scan all/more the groups for larger times). > One question here - is Parquet reader reading & decoding the projection > columns even if the predicate columns should filter the record out? > > Unfortunatel

Re: Very slow complex type column reads from parquet

2018-06-12 Thread Ryan Blue
e know if there is anybody currently working on it > or maybe you have it in a roadmap for the future? > Or maybe you could give me some suggestions how to avoid / resolve this > problem? I’m using Spark 2.2.1. > > Best regards, > Jakub Wozniak > > > > >

Re: Datasource API V2 and checkpointing

2018-05-01 Thread Ryan Blue
re to Spark users. So a design that we can't migrate > file sources to without a side channel would be worrying; won't we end up > regressing to the same situation? > > On Mon, Apr 30, 2018 at 11:59 AM, Ryan Blue <rb...@netflix.com> wrote: > >> Should we really plan the API fo

Re: org.apache.spark.shuffle.FetchFailedException: Too large frame:

2018-05-01 Thread Ryan Blue
kFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:419) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:349) > > -- Ryan Blue Software Engineer Netflix

Re: Datasource API V2 and checkpointing

2018-04-30 Thread Ryan Blue
t;> >> After enabling checkpointing, I do see a folder being created under the >> checkpoint folder, but there's nothing else in there. >> >> >> >> Same question for write-ahead and recovery? >> >> And on a restart from a failed streaming session - who should set the >> offsets? >> >> The driver/Spark or the datasource? >> >> >> >> Any pointers to design docs would also be greatly appreciated. >> >> >> >> Thanks, >> >> Jayesh >> >> >> > > -- Ryan Blue Software Engineer Netflix

Re: Datasource API V2 and checkpointing

2018-04-30 Thread Ryan Blue
has ever read. In order > to parse this efficiently, the stream connector needs detailed control over > how it's stored; the current implementation even has complex > compactification and retention logic. > > > On Mon, Apr 30, 2018 at 10:48 AM, Ryan Blue <rb...@netflix.com> wrot

Re: org.apache.spark.shuffle.FetchFailedException: Too large frame:

2018-05-03 Thread Ryan Blue
or shouldn't > come. Let me know if this understanding is correct > > On Tue, May 1, 2018 at 9:37 PM, Ryan Blue <rb...@netflix.com> wrote: > >> This is usually caused by skew. Sometimes you can work around it by in >> creasing the number of partitions like you tri

Re: Spark 3

2018-01-19 Thread Ryan Blue
f shading. >>> However this also entails updating (unshaded) Chill from 0.8.x to 0.9.x. >>> I am not sure if that causes problems for apps. >>> >>> Normally I'd avoid any major-version change in a minor release. This one >>> looked potentially entirely internal. >>> I think if there are any doubts, we can leave it for Spark 3. There was >>> a bug report that needed a fix from Kryo 4, but it might be minor after all. >>> >>>> >>>> -- Ryan Blue Software Engineer Netflix

PSA: Release and commit quality

2018-01-30 Thread Ryan Blue
ere we can improve. It is just far easier to get a branch committed as-is than to adhere to these guidelines, but these are important for our releases and downstream users. Thanks for reading, rb ​ -- Ryan Blue Software Engineer Netflix

DataSourceV2: support for named tables

2018-02-02 Thread Ryan Blue
that need to be in sync with the same convention. On the other hand, passing TableIdentifier to DataSourceV2Relation and relying on the relation to correctly set the options passed to readers and writers minimizes the number of places that conversion needs to happen. rb ​ -- Ryan Blue Software Engineer

Re: DataSourceV2: support for named tables

2018-02-02 Thread Ryan Blue
ces are between: > - Spark > - A (The?) metastore > - A data source > > If we pass in the table identifier is the data source then responsible for > talking directly to the metastore? Is that what we want? (I'm not sure) > > On Fri, Feb 2, 2018 at 10:39 AM, Ryan Blue <

Re: data source v2 online meetup

2018-02-01 Thread Ryan Blue
t 9:10 AM, Felix Cheung <felixcheun...@hotmail.com> wrote: > +1 hangout > > -- > *From:* Xiao Li <gatorsm...@gmail.com> > *Sent:* Wednesday, January 31, 2018 10:46:26 PM > *To:* Ryan Blue > *Cc:* Reynold Xin; dev; Wenchen Fen; Russell

SQL logical plans and DataSourceV2 (was: data source v2 online meetup)

2018-02-01 Thread Ryan Blue
and easier maintenance. rb ​ -- Ryan Blue Software Engineer Netflix

Re: data source v2 online meetup

2018-01-31 Thread Ryan Blue
; by implementing some new sources or porting an existing source over. > > > -- Ryan Blue Software Engineer Netflix

Re: SQL logical plans and DataSourceV2 (was: data source v2 online meetup)

2018-02-05 Thread Ryan Blue
r >> and easier maintenance. >> > Context aside, I really like these rules! I think having query planning be > the boundary for specialization makes a lot of sense. > > (RunnableCommand might also be my fault though sorry! :P) > -- Ryan Blue Software Engineer Netflix

Re: Corrupt parquet file

2018-02-05 Thread Ryan Blue
; > Dong > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: Corrupt parquet file

2018-02-05 Thread Ryan Blue
aware of the issue > within a month, and we certainly don’t run as large data infrastructure > compared to Netflix. > > > > I will keep an eye on this issue. > > > > Thanks, > > > Dong > > > > *From: *Ryan Blue <rb...@netflix.com> > *Reply-

Re: Corrupt parquet file

2018-02-05 Thread Ryan Blue
If we see the _SUCCESS file, does that suggest all data is > good? > > How can we prevent a recurrence? Can you share your experience? > > > > Thanks, > > > Dong > > > > *From: *Ryan Blue <rb...@netflix.com> > *Reply-To: *"rb...@netflix.com&qu

Re: Corrupt parquet file

2018-02-05 Thread Ryan Blue
file. The issue seems only impact one column, and > very hard to detect. It seems you have encountered this issue before, what > do you do to prevent a recurrence? > > > > Thanks, > > > > Dong > > > > *From: *Ryan Blue <rb...@netflix.com> > *Reply-T

Re: SQL logical plans and DataSourceV2 (was: data source v2 online meetup)

2018-02-06 Thread Ryan Blue
? A transaction made of from a delete and an insert would work. Is this what we want to use? How do we add this to v2? rb ​ -- Ryan Blue Software Engineer Netflix

Re: Corrupt parquet file

2018-02-12 Thread Ryan Blue
s not going to fix the problem —not if this really is corrupted local > HDD data > -- Ryan Blue Software Engineer Netflix

Re: [VOTE] Spark 2.3.0 (RC3)

2018-02-15 Thread Ryan Blue
on in FeatureHasher before >> FeatureHasher released in 2.3.0. >> >> https://issues.apache.org/jira/browse/SPARK-23381 >> https://github.com/apache/spark/pull/20568 >> >> I will fix it soon. >> >> >> >> -- >> Sent from: http://apache-spark-

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-20 Thread Ryan Blue
t;>>>>>>>> Please see https://s.apache.org/oXKi. At the time of writing, >>>>>>>>> there are currently no known release blockers. >>>>>>>>> >>>>>>>>> = >>>>>>>>> How can I help test this release? >>>>>>>>> = >>>>>>>>> >>>>>>>>> If you are a Spark user, you can help us test this release by >>>>>>>>> taking an existing Spark workload and running on this release >>>>>>>>> candidate, >>>>>>>>> then reporting any regressions. >>>>>>>>> >>>>>>>>> If you're working in PySpark you can set up a virtual env and >>>>>>>>> install the current RC and see if anything important breaks, in the >>>>>>>>> Java/Scala you can add the staging repository to your projects >>>>>>>>> resolvers >>>>>>>>> and test with the RC (make sure to clean up the artifact cache >>>>>>>>> before/after >>>>>>>>> so you don't end up building with a out of date RC going forward). >>>>>>>>> >>>>>>>>> === >>>>>>>>> What should happen to JIRA tickets still targeting 2.3.0? >>>>>>>>> === >>>>>>>>> >>>>>>>>> Committers should look at those and triage. Extremely important >>>>>>>>> bug fixes, documentation, and API tweaks that impact compatibility >>>>>>>>> should >>>>>>>>> be worked on immediately. Everything else please retarget to 2.3.1 or >>>>>>>>> 2.4.0 >>>>>>>>> as appropriate. >>>>>>>>> >>>>>>>>> === >>>>>>>>> Why is my bug not fixed? >>>>>>>>> === >>>>>>>>> >>>>>>>>> In order to make timely releases, we will typically not hold the >>>>>>>>> release unless the bug in question is a regression from 2.2.0. That >>>>>>>>> being >>>>>>>>> said, if there is something which is a regression from 2.2.0 and has >>>>>>>>> not >>>>>>>>> been correctly targeted please ping me or a committer to help target >>>>>>>>> the >>>>>>>>> issue (you can see the open issues listed as impacting Spark 2.3.0 at >>>>>>>>> https://s.apache.org/WmoI). >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Sameer Agarwal >>>>>>>> Computer Science | UC Berkeley >>>>>>>> http://cs.berkeley.edu/~sameerag >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Sameer Agarwal >>>>>>> Computer Science | UC Berkeley >>>>>>> http://cs.berkeley.edu/~sameerag >>>>>>> >>>>>>> >>>>>> >>>> >>>> >>>> -- >>>> Takuya UESHIN >>>> Tokyo, Japan >>>> >>>> http://twitter.com/ueshin >>>> >>> >>> >> > -- Ryan Blue Software Engineer Netflix

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-21 Thread Ryan Blue
er shuffle services, but it's pretty safe. >> >> >> >> On Tue, Feb 20, 2018 at 5:58 PM, Sameer Agarwal <samee...@apache.org> >> >> wrote: >> >> > This RC has failed due to >> >> > https://issues.apache.org/jira/browse/SPARK-23470. >&g

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-21 Thread Ryan Blue
we should learn from, that we should work > on stuff we want in the release before the RC, instead of after. > > On Thu, Feb 22, 2018 at 1:01 AM, Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> What does everyone think about getting some of the newer DataSourceV2 &g

Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-26 Thread Ryan Blue
gt;> Please see https://s.apache.org/oXKi. At the time of writing, >>>>>>>>> there are currently no known release blockers. >>>>>>>>> >>>>>>>>> = >>>>>>>>> How can I help test this release? >>>>>>>>> = >>>>>>>>> >>>>>>>>> If you are a Spark user, you can help us test this release by >>>>>>>>> taking an existing Spark workload and running on this release >>>>>>>>> candidate, >>>>>>>>> then reporting any regressions. >>>>>>>>> >>>>>>>>> If you're working in PySpark you can set up a virtual env and >>>>>>>>> install the current RC and see if anything important breaks, in the >>>>>>>>> Java/Scala you can add the staging repository to your projects >>>>>>>>> resolvers >>>>>>>>> and test with the RC (make sure to clean up the artifact cache >>>>>>>>> before/after >>>>>>>>> so you don't end up building with a out of date RC going forward). >>>>>>>>> >>>>>>>>> === >>>>>>>>> What should happen to JIRA tickets still targeting 2.3.0? >>>>>>>>> === >>>>>>>>> >>>>>>>>> Committers should look at those and triage. Extremely important >>>>>>>>> bug fixes, documentation, and API tweaks that impact compatibility >>>>>>>>> should >>>>>>>>> be worked on immediately. Everything else please retarget to 2.3.1 or >>>>>>>>> 2.4.0 >>>>>>>>> as appropriate. >>>>>>>>> >>>>>>>>> === >>>>>>>>> Why is my bug not fixed? >>>>>>>>> === >>>>>>>>> >>>>>>>>> In order to make timely releases, we will typically not hold the >>>>>>>>> release unless the bug in question is a regression from 2.2.0. That >>>>>>>>> being >>>>>>>>> said, if there is something which is a regression from 2.2.0 and has >>>>>>>>> not >>>>>>>>> been correctly targeted please ping me or a committer to help target >>>>>>>>> the >>>>>>>>> issue (you can see the open issues listed as impacting Spark 2.3.0 at >>>>>>>>> https://s.apache.org/WmoI). >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Twitter: https://twitter.com/holdenkarau >>>>> >>>> >>>> >>> >> > -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Multiple catalog support

2018-07-29 Thread Ryan Blue
Wenchen, what I'm suggesting is a bit of both of your proposals. I think that USING should be optional like your first option. USING (or format(...) in the DF side) should configure the source or implementation, while the catalog should be part of the table identifier. They serve two different

Re: [DISCUSS] SPIP: APIs for Table Metadata Operations

2018-07-26 Thread Ryan Blue
ersions. > > > On Tue, Jul 24, 2018 at 9:26 AM Ryan Blue > wrote: > >> The recently adopted SPIP to standardize logical plans requires a way for >> to plug in providers for table metadata operations, so that the new plans >> can create and drop tables. I proposed an A

Re: [DISCUSS] Multiple catalog support

2018-07-31 Thread Ryan Blue
rce can provide catalog > functionalities. > > Under the hood, I feel this proposal is very similar to my second > proposal, except that a catalog implementation must provide a default data > source/storage, and different rule for looking up tables. > > > On Sun, Jul 29,

  1   2   3   4   >