This has been an interesting topic, and I am sorry I could not participate more since my original post due to traveling. Stefán is obviously frustrated, and I can empathize with him. Being in a position of making architectural decisions as well, it can be difficult to help define a strategy for your org based on available documentation, be willing to working through problems (these are "new" projects), and feel like you are yelling in a canyon. The level of frustration there is real. I do think, as mentioned, the documentation for Avro should be updated ASAP.
To that end, here is a recommendation: Avro needs to be called out as experimental. On the documentation page, under "Query Data -> Querying a File System, let's add "Querying Avro Files". On this page, I think we should, in the first paragraph, state Avro Support has been moved to experimental, and as of now the Drill project is working through the following problems with Avro files. Basically, let's take Stefán's list, and outline the problems, the JIRAs, and the errors that coming up, as well as outline what works and how it works. I will be willing to work on this with Stefán. My reasoning is this: obviously Avro support has been implied in the docs thus far, others who may have chosen Avro may be going down a path like Stefán based on the documentation, and may end up in a similar frustrated state. I want to avoid that. This situation has caused community tension, and does nothing for the project if we don't look to fix it. Yes, this is a different approach then other "experimental" type features in Drill, but I feel in order to avoid this situation particularly on Avro, it makes sense to call this out. Now, this does not fix Stefán's current problem. As a user and community member who doesn't code Java, I often struggle to balance asking for help/changes with the fact that I personally can't force that change or write the change myself, and thus am looking for ways contribute other ways. Stefán has been contributing, and I do think we need to acknowledge that. We are all busy, we all have commitments, from the developer side, to those with day jobs, and even Stefán in his job. We all do; in this situation it's easy to point fingers and send the blame around, and yet, I don't think any individual completely to blame; there is a confluence of situations that has contributed here. Frustrations are high, but we can handle this, and I think we should be able to handle it in a way that ends positively for Drill, for the community, and for Stefán. To that end, here are my suggestions for discussion: 1. My documentation suggestion above. It puts it clearly out there that Avro is experimental, and lets users know the risks of Avro. As the issues get knocked off the list, we can track there as well as JIRAs. While this is "extra" work, and one may ask "why can't we just use JIRA?". I think since the documentation in the past has been wrong on this, in response we should use the documentation in this special case to pull out of the situation. I commit to helping this by facilitating the Avro page, I just need discussion and approval to go this route, and someone who has access to change the pages to work with me. In addition, it may help pull others in who have Java/Avro knowledge into contributing to some of the fixes. 2. Let's ensure going forward we consider the challenges of new features like this and making them as experimental for a while. I think for new plugins/readers we could develop a process where we mark as experimental for a number of releases to help work out test cases from users. The issues that are brought up by users will help identify bugs as well as test cases we can use in the code to not only ensure solid interfaces, but help prevent regressions in future releases. 3. I know this one will be asking a lot, if 1 and 2 seem reasonable, let's roll up our sleeves on the Avro stuff. Identify those "I can't use this" issues and separate from "I really want this" issues for prioritization, and work to resolve the issues starting with the blockers. For the Drill project, "our bad" on the supported nature of Avro in the docs, and instead of pulling back and forth on resources, commitments, etc, on user lists, (which in my opinion really hurts community) we say , "this sucks, it puts everyone in a bad position, let's steer out of this and get on track". Based on some of the responses, I don't think this is unreasonable thus far. I think Stefán, while I don't speak for him, understands the nature of what the community can provide to "him" and that the community doesn't work for him, at the same time, this is a really good opportunity for us to band together, and right the course here. I welcome discussion here. Jacques and Julian, I know that there are some challenges around topics like this, and you've outlined them, and I can't disagree with your points. At the same time, I don't think anyone is saying the project path, the project itself, or anything Dremio, MapR, or individual committers are doing is at fault or should be responsible for fixing stuff on their own. I think as I've stated before we have a confluence of little things that have added up, and in the end looking for a community solution is our best path. Cheers, John On Sat, Apr 2, 2016 at 1:37 AM, Stefán Baxter <[email protected]> wrote: > Hi Jason, > > Thank you for writing this up, it's appreciated. > > First things first. We would be more than happy to help on these Avro > related issues but the Drill code base is quite complex, with a fairly > steep learning curve, and lately a lot of my time has been spent on dealing > with the repercussions of having decided to use Avro for fresh/inbound > data. (I realize some here might not see this a contribution but I beg to > differ. Any project requires regular users to put in the time to adapt > new/unhardened projects to their solutions and in the case of using Avro > with Drill it's been more like testing and duck-taping than a "simple > adaption of free software") > > I find these Avro problems interesting for other reasons as well: > > - They raises the question of the commitment behind accepting a plugin > like this (and not marking it experimental) > > - There are design decision the I think are very wrong > - enforcing schema looks to me like a serious violation of the, no where > to be found, "Drill Manifesto" that I have asked about > - see the original entry > > - The level of noise required to get feedback on a topic like this > - I apologize to everyone but ask them to appreciate that this > provocative approach was by no means the first option > > As a "user" I'm obviously not a person that can call for or insist on > having these things address but perhaps that changes with time. > > Now on towards fixing the outstanding bugs. If someone can point us in the > reght direction and discuss the best approach to fixing each bug then we > can at least try to help (and we do so gladly). > > It's at least clear to me that many users of Drill, those working on > streaming data, need the support for a schema capable format to store their > inbound/fresh data before it's converted into Parquet. > Currently there seems to be no real alternative. > > So, If we can help then we are willing and I suggest that, if you want, we > take this to Jira and try to work ir from there. > > Regards, > -Stefán > > > > > > On Fri, Apr 1, 2016 at 9:56 PM, Jason Altekruse <[email protected]> wrote: > > > I take some responsibility for your lack of response on this, because I > had > > said I would try to take a look at the dirN issue that has been > outstanding > > for some time with Avro. This might have prevented others from jumping in > > to help and I will work on communicating when I don't have time to work > on > > something that I raise my hand for. > > > > That being said, there are lots of parts of Drill that still need > > attention. I do think that you are the only active user of the Avro > support > > that I know of. Even though that is the case, I have been trying to make > > the feature useable for you and and other possible users, like John. > > > > One thing that would likely be worth discussing as a follow up to this is > > our expectations for code quality we accept from contributors. There were > > several issues with Avro when it was merged, and no one ever really took > on > > the task of fully testing it. > > I do know there is another issue around a lack of responses of recent > > requests, but I'm tabling that for a little bit. I would like to see it > > discussed, but I want to scope this discussion for now. > > > > I don't think the plugin is far from fully complete, and I have been > > working to improve the tests each time I fix an issue with it. I think it > > would be very useful for us to define a clear set of criteria for a > feature > > like a format plugin to be considered fully tested and ready for > inclusion > > in the core project. I think this would have the benefit of both helping > > users to avoid issues, as well as give a clearer definition of the task > of > > writing a format plugin. This is a community contribution that should be > > easier and more strongly encouraged than it is today, and could really > help > > new users adopt Drill if they are using other data formats. > > > > Jason Altekruse > > Software Engineer at Dremio > > Apache Drill Committer > > > > On Fri, Apr 1, 2016 at 1:42 PM, Stefán Baxter <[email protected] > > > > wrote: > > > > > Yes Parth, you are 100% right and we are willing to help. > > > > > > The relationship one builds with a community also depends on the > > > "wipe/feeling" of the community and I know it reflects on me here, as > > well > > > as the community, that many of my attempts to help and get help have > not > > > been fruitful. > > > > > > I also acknowledge that I this topic get's me frustrated and that my > > > manners could easily improve but it's not as if that is a "first > > response" > > > but an eventual state caused by indifference on one side and the > > > determination to get some response on the other. > > > > > > Marking Avro as experimental is a considered towards new users and > > > something I wish was in place before we decided to depend on it and > spend > > > all this time on trying to make it work. > > > > > > Ideally, for us, the decision would be to support Avro properly. > > > > > > My +1 for improving Avro support so that it can truly be used as an > > interim > > > file format before data is converted to Parquet. (I see no real > > alternative > > > here) > > > > > > - Stefán > > > > > > > > > On Fri, Apr 1, 2016 at 8:25 PM, Parth Chandra <[email protected]> > wrote: > > > > > > > +1 on marking Avro experimental. > > > > > > > > @Stefan, we have been trying to help you as much as our time > permits. I > > > > know that I held up the 1.6 release while Jason fixed the issues that > > you > > > > brought up. As was said earlier, this is personal time we are > spending > > to > > > > help users in the community, so providing an immediate response to > > > everyone > > > > is difficult. Ultimately, it boils down to the relationships one > builds > > > > within the community. Folks with shared goals help each other and > > > everyone > > > > benefits. > > > > > > > > > > > > > > > > On Fri, Apr 1, 2016 at 11:10 AM, Jacques Nadeau <[email protected]> > > > > wrote: > > > > > > > > > Stefan, > > > > > > > > > > It makes sense to me to mark the Avro plugin experimental. Clearly, > > > there > > > > > are bugs. I also want to note your requirements and expectations > > > haven't > > > > > always been in alignment with what the Avro plugin developers > > > > > built/envisioned (especially around schemas). As part of trying to > > > > address > > > > > these gaps, I'd like to ask again for you to provide actual data > and > > > > tests > > > > > cases so we make sure that the Avro plugin includes those as future > > > test > > > > > cases. (This is absolutely the best way to ensure that the project > > > > > continues to work for your use case.) > > > > > > > > > > The bigger issue I see here is that you expect the community to > spend > > > > time > > > > > doing what you want. You have already received a lot of that via > free > > > > > support and numerous bug fixes by myself, Jason and others. You > need > > to > > > > > remember: this community is run by a bunch of volunteers. Everybody > > > here > > > > > has a day job. A lot of time I spend in the community is at the > cost > > of > > > > my > > > > > personal life. For others, it is the same. > > > > > > > > > > This is a good place to ask for help but you should never demand > it. > > If > > > > you > > > > > want paid support, I know Ted offered this from MapR and I'm sure > if > > > you > > > > > went that route, your issues would get addressed very quickly. If > you > > > > don't > > > > > want to go that route, then I suggest that you help by creating > more > > > > > example data and test cases and focusing on what are the most > > important > > > > > issues that you need to solve. From there, you can continue to > expect > > > > that > > > > > people will help you--as they can. There are no guarantees in open > > > > source. > > > > > Everything comes through the kindness and shared goals of those in > > the > > > > > community. > > > > > > > > > > thanks, > > > > > Jacques > > > > > > > > > > > > > > > -- > > > > > Jacques Nadeau > > > > > CTO and Co-Founder, Dremio > > > > > > > > > > On Fri, Apr 1, 2016 at 5:43 AM, Stefán Baxter < > > > [email protected] > > > > > > > > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > Is it at all possible that we are the only company trying to use > > Avro > > > > > with > > > > > > Drill to some serious extent? > > > > > > > > > > > > We continue to coma across all sorts of embarrassing shortcomings > > > like > > > > > the > > > > > > one we are dealing with now where a schema change exception is > > thrown > > > > > even > > > > > > when working with a single Avro file (that has the same schema). > > > > > > > > > > > > Can a non project member call for a discussion on this topic and > > the > > > > > level > > > > > > of support that is offered for Avro in Drill? > > > > > > > > > > > > My discussion topics would be: > > > > > > > > > > > > - Strange schema validation that ... : > > > > > > ... currently fails on single file > > > > > > ... prevents dirX variables to work > > > > > > ... would require Drill to scan all Avro files to establish > > schema > > > > > (even > > > > > > when pruning would be used) > > > > > > ... would ALWAY fail for old queries if the an old Avro file, > > > > > containing > > > > > > the original fields, was removed and could not be scanned > > > > > > ... does not rhyme with the "eliminate ETL" and "Evolving > > Schema" > > > > > goals > > > > > > of Drill > > > > > > > > > > > > - Simple union types do not work to declare nullable fields > > > > > > > > > > > > - Drill can not read Parquet that is created by > parquet-mr-avro > > > > > > > > > > > > - What is the intention for Avro in Drill > > > > > > - Should we select to use some other format to buffer/badge > data > > > > > before > > > > > > creating a Parquet file for it? > > > > > > > > > > > > - The culture here regarding talking about boring/hard topics > > like > > > > > this > > > > > > - Where serious complaints/issues are met with silence > > > > > > - I know full well that my frustration shines through here and > > > that > > > > it > > > > > > not helping but this Drill+Avro mess is really getting too > much > > > for > > > > us > > > > > > to > > > > > > handle > > > > > > > > > > > > Look forward do discuss this here or during the next hangout. > > > > > > > > > > > > Regards, > > > > > > -Stefán (or ... mr. old & frustrated) > > > > > > > > > > > > > > > > > > > > >
