This has been an interesting topic, and I am sorry I could not participate
more since my original post due to traveling.  Stefán is obviously
frustrated, and I can empathize with him. Being in a position of making
architectural decisions as well, it can be difficult to help define a
strategy for your org based on available documentation, be willing to
working through problems (these are "new" projects), and feel like you are
yelling in a canyon.  The level of frustration there is real.  I do think,
as mentioned, the documentation for Avro should be updated ASAP.

To that end, here is a recommendation:  Avro needs to be called out as
experimental.  On the documentation page, under "Query Data -> Querying a
File System, let's add "Querying Avro Files".  On this page,  I think we
should, in the first paragraph, state Avro Support has been moved to
experimental, and as of now the Drill project is working through the
following problems with Avro files. Basically, let's take Stefán's list,
and outline the problems, the JIRAs, and the errors that coming up, as well
as outline what works and how it works. I will be willing to work on this
with Stefán. My reasoning is this:  obviously Avro support has been implied
in the docs thus far, others who may have chosen Avro may be going down a
path like Stefán based on the documentation, and may end up in a similar
frustrated state. I want to avoid that. This situation has caused community
tension, and does nothing for the project if we don't look to fix it. Yes,
this is a different approach then other "experimental" type features in
Drill, but I feel in order to avoid this situation particularly on Avro, it
makes sense to call this out.

Now, this does not fix Stefán's current problem.  As a user and community
member who doesn't code Java, I often struggle to balance asking for
help/changes with the fact that  I personally can't force that change or
write the change myself, and thus am looking for ways contribute other
ways.  Stefán has been contributing, and I do think we need to acknowledge
that. We are all busy, we all have commitments, from the developer side, to
those with day jobs, and even Stefán in his job.  We all do; in this
situation it's easy to point fingers and send the blame around,  and yet, I
don't think any individual completely to blame; there is a confluence of
situations that has contributed here.  Frustrations are high, but we can
handle this, and I think we should be able to handle it in a way that ends
positively for Drill, for the community, and for Stefán.  To that end, here
are my suggestions for discussion:


   1. My documentation suggestion above. It puts it clearly out there that
   Avro is experimental, and lets users know the risks of Avro. As the issues
   get knocked off the list, we can track there as well as JIRAs. While this
   is "extra" work, and one may ask "why can't we just use JIRA?".  I think
   since the documentation in the past has been wrong on this, in response we
   should use the documentation in this special case to pull out of the
   situation.  I commit to helping this by facilitating the Avro page, I just
   need discussion and approval to go this route, and someone who has access
   to change the pages to work with me. In addition, it may help pull others
   in who have Java/Avro knowledge into contributing to some of the fixes.
   2. Let's ensure going forward we consider the challenges of new features
   like this and making them as experimental for a while.  I think for new
   plugins/readers we could develop a process where we mark as experimental
   for a number of releases to help work out test cases from users.   The
   issues that are brought up by users will help identify bugs as well as test
   cases we can use in the code to not only ensure solid interfaces, but help
   prevent regressions in future releases.
   3. I know this one will be asking a lot, if 1 and 2 seem reasonable,
   let's roll up our sleeves on the Avro stuff.  Identify those "I can't use
   this" issues  and separate from "I really want this" issues for
   prioritization, and work to resolve the issues starting with the blockers.
   For the Drill project, "our bad" on the supported nature of Avro in the
   docs, and instead of pulling back and forth on resources, commitments, etc,
   on user lists, (which in my opinion really hurts community) we say , "this
   sucks, it puts everyone in a bad position, let's steer out of this and get
   on track".  Based on some of the responses, I don't think this is
   unreasonable thus far. I think Stefán, while I don't speak for him,
   understands the nature of what the community can provide to "him" and that
   the community doesn't work for him, at the same time, this is a really good
   opportunity for us to band together, and right the course here.

I welcome discussion here. Jacques and Julian, I know that there are some
challenges around topics like this, and you've outlined them, and I can't
disagree with your points. At the same time, I don't think anyone is saying
the project path, the project itself, or anything Dremio, MapR, or
individual  committers are doing is at fault or should be responsible for
fixing stuff on their own.  I think as I've stated before we have a
confluence of little things that have added up, and in the end looking for
a community solution is our best path.

Cheers,

John




On Sat, Apr 2, 2016 at 1:37 AM, Stefán Baxter <[email protected]>
wrote:

> Hi Jason,
>
> Thank you for writing this up, it's appreciated.
>
> First things first. We would be more than happy to help on these Avro
> related issues but the Drill code base is quite complex, with a fairly
> steep learning curve, and lately a lot of my time has been spent on dealing
> with the repercussions of having decided to use Avro for fresh/inbound
> data.  (I realize some here might not see this a contribution but I beg to
> differ. Any project requires regular users to put in the time to adapt
> new/unhardened projects to their solutions and in the case of using Avro
> with Drill it's been more like testing and duck-taping than a "simple
> adaption of free software")
>
> I find these Avro problems interesting for other reasons as well:
>
>    - They raises the question of the commitment behind accepting a plugin
>    like this (and not marking it experimental)
>
>    - There are design decision the I think are very wrong
>    - enforcing schema looks to me like a serious violation of the, no where
>    to be found, "Drill Manifesto" that I have asked about
>    - see the original entry
>
>    - The level of noise required to get feedback on a topic like this
>    - I apologize to everyone but ask them to appreciate that this
>    provocative approach was by no means the first option
>
> As a "user" I'm obviously not a person that can call for or insist on
> having these things address but perhaps that changes with time.
>
> Now on towards fixing the outstanding bugs.  If someone can point us in the
> reght direction and discuss the best approach to fixing each bug then we
> can at least try to help (and we do so gladly).
>
> It's at least clear to me that many users of Drill, those working on
> streaming data, need the support for a schema capable format to store their
> inbound/fresh data before it's converted into Parquet.
> Currently there seems to be no real alternative.
>
> So, If we can help then we are willing and I suggest that, if you want, we
> take this to Jira and try to work ir from there.
>
> Regards,
>  -Stefán
>
>
>
>
>
> On Fri, Apr 1, 2016 at 9:56 PM, Jason Altekruse <[email protected]> wrote:
>
> > I take some responsibility for your lack of response on this, because I
> had
> > said I would try to take a look at the dirN issue that has been
> outstanding
> > for some time with Avro. This might have prevented others from jumping in
> > to help and I will work on communicating when I don't have time to work
> on
> > something that I raise my hand for.
> >
> > That being said, there are lots of parts of Drill that still need
> > attention. I do think that you are the only active user of the Avro
> support
> > that I know of. Even though that is the case, I have been trying to make
> > the feature useable for you and and other possible users, like John.
> >
> > One thing that would likely be worth discussing as a follow up to this is
> > our expectations for code quality we accept from contributors. There were
> > several issues with Avro when it was merged, and no one ever really took
> on
> > the task of fully testing it.
> > I do know there is another issue around a lack of responses of recent
> > requests, but I'm tabling that for a little bit. I would like to see it
> > discussed, but I want to scope this discussion for now.
> >
> > I don't think the plugin is far from fully complete, and I have been
> > working to improve the tests each time I fix an issue with it. I think it
> > would be very useful for us to define a clear set of criteria for a
> feature
> > like a format plugin to be considered fully tested and ready for
> inclusion
> > in the core project. I think this would have the benefit of both helping
> > users to avoid issues, as well as give a clearer definition of the task
> of
> > writing a format plugin. This is a community contribution that should be
> > easier and more strongly encouraged than it is today, and could really
> help
> > new users adopt Drill if they are using other data formats.
> >
> > Jason Altekruse
> > Software Engineer at Dremio
> > Apache Drill Committer
> >
> > On Fri, Apr 1, 2016 at 1:42 PM, Stefán Baxter <[email protected]
> >
> > wrote:
> >
> > > Yes Parth, you are 100% right and we are willing to help.
> > >
> > > The relationship one builds with a community also depends on the
> > > "wipe/feeling" of the community and I know it reflects on me here, as
> > well
> > > as the community, that many of my attempts to help and get help have
> not
> > > been fruitful.
> > >
> > > I also acknowledge that I this topic get's me frustrated and that my
> > > manners could easily improve but it's not as if that is a "first
> > response"
> > > but an eventual state caused by indifference on one side and the
> > > determination to get some response on the other.
> > >
> > > Marking Avro as experimental is a considered towards new users and
> > > something I wish was in place before we decided to depend on it and
> spend
> > > all this time on trying to make it work.
> > >
> > > Ideally, for us, the decision would be to support Avro properly.
> > >
> > > My +1 for improving Avro support so that it can truly be used as an
> > interim
> > > file format before data is converted to Parquet. (I see no real
> > alternative
> > > here)
> > >
> > > - Stefán
> > >
> > >
> > > On Fri, Apr 1, 2016 at 8:25 PM, Parth Chandra <[email protected]>
> wrote:
> > >
> > > > +1 on marking Avro experimental.
> > > >
> > > > @Stefan, we have been trying to help you as much as our time
> permits. I
> > > > know that I held up the 1.6 release while Jason fixed the issues that
> > you
> > > > brought up. As was said earlier, this is personal time we are
> spending
> > to
> > > > help users in the community, so providing an immediate response to
> > > everyone
> > > > is difficult. Ultimately, it boils down to the relationships one
> builds
> > > > within the community. Folks with shared goals help each other and
> > > everyone
> > > > benefits.
> > > >
> > > >
> > > >
> > > > On Fri, Apr 1, 2016 at 11:10 AM, Jacques Nadeau <[email protected]>
> > > > wrote:
> > > >
> > > > > Stefan,
> > > > >
> > > > > It makes sense to me to mark the Avro plugin experimental. Clearly,
> > > there
> > > > > are bugs. I also want to note your requirements and expectations
> > > haven't
> > > > > always been in alignment with what the Avro plugin developers
> > > > > built/envisioned (especially around schemas). As part of trying to
> > > > address
> > > > > these gaps, I'd like to ask again for you to provide actual data
> and
> > > > tests
> > > > > cases so we make sure that the Avro plugin includes those as future
> > > test
> > > > > cases. (This is absolutely the best way to ensure that the project
> > > > > continues to work for your use case.)
> > > > >
> > > > > The bigger issue I see here is that you expect the community to
> spend
> > > > time
> > > > > doing what you want. You have already received a lot of that via
> free
> > > > > support and numerous bug fixes by myself, Jason and others. You
> need
> > to
> > > > > remember: this community is run by a bunch of volunteers. Everybody
> > > here
> > > > > has a day job. A lot of time I spend in the community is at the
> cost
> > of
> > > > my
> > > > > personal life. For others, it is the same.
> > > > >
> > > > > This is a good place to ask for help but you should never demand
> it.
> > If
> > > > you
> > > > > want paid support, I know Ted offered this from MapR and I'm sure
> if
> > > you
> > > > > went that route, your issues would get addressed very quickly. If
> you
> > > > don't
> > > > > want to go that route, then I suggest that you help by creating
> more
> > > > > example data and test cases and focusing on what are the most
> > important
> > > > > issues that you need to solve. From there, you can continue to
> expect
> > > > that
> > > > > people will help you--as they can. There are no guarantees in open
> > > > source.
> > > > > Everything comes through the kindness and shared goals of those in
> > the
> > > > > community.
> > > > >
> > > > > thanks,
> > > > > Jacques
> > > > >
> > > > >
> > > > > --
> > > > > Jacques Nadeau
> > > > > CTO and Co-Founder, Dremio
> > > > >
> > > > > On Fri, Apr 1, 2016 at 5:43 AM, Stefán Baxter <
> > > [email protected]
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Is it at all possible that we are the only company trying to use
> > Avro
> > > > > with
> > > > > > Drill to some serious extent?
> > > > > >
> > > > > > We continue to coma across all sorts of embarrassing shortcomings
> > > like
> > > > > the
> > > > > > one we are dealing with now where a schema change exception is
> > thrown
> > > > > even
> > > > > > when working with a single Avro file (that has the same schema).
> > > > > >
> > > > > > Can a non project member call for a discussion on this topic and
> > the
> > > > > level
> > > > > > of support that is offered for Avro in Drill?
> > > > > >
> > > > > > My discussion topics would be:
> > > > > >
> > > > > >    - Strange schema validation that ... :
> > > > > >    ... currently fails on single file
> > > > > >    ... prevents dirX variables to work
> > > > > >    ... would require Drill to scan all Avro files to establish
> > schema
> > > > > (even
> > > > > >    when pruning would be used)
> > > > > >    ... would ALWAY fail for old queries if the an old Avro file,
> > > > > containing
> > > > > >    the original fields, was removed and could not be scanned
> > > > > >    ... does not rhyme with the "eliminate ETL" and "Evolving
> > Schema"
> > > > > goals
> > > > > >    of Drill
> > > > > >
> > > > > >    - Simple union types do not work to declare nullable fields
> > > > > >
> > > > > >    - Drill can not read Parquet that is created by
> parquet-mr-avro
> > > > > >
> > > > > >    - What is the intention for Avro in Drill
> > > > > >    - Should we select to use some other format to buffer/badge
> data
> > > > > before
> > > > > >    creating a Parquet file for it?
> > > > > >
> > > > > >    - The culture here regarding talking about boring/hard topics
> > like
> > > > > this
> > > > > >    - Where serious complaints/issues are met with silence
> > > > > >    - I know full well that my frustration shines through here and
> > > that
> > > > it
> > > > > >    not helping but this Drill+Avro mess is really getting too
> much
> > > for
> > > > us
> > > > > > to
> > > > > >    handle
> > > > > >
> > > > > > Look forward do discuss this here or during the next hangout.
> > > > > >
> > > > > > Regards,
> > > > > >  -Stefán (or ... mr. old & frustrated)
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to