John, Some MapR folk have collected some UDFs here: https://github.com/mapr-demos/simple-drill-functions
Additions via pull request are welcome, I'm sure. On Wednesday, May 25, 2016, John Omernik <[email protected]> wrote: > So Charles here's how I'd set it up (I am not tied to this as I would love > to have others from Drill community feel like it's an open community ala > Apache, however, I am not sure the best way to approach) > > So, up to me, and this is just spitballing > > 1. Create a Github repo (I'd use my account just because, but if it makes > sense under the apache one I am not tied to it) > 2. Create a Readme that describes what we have > - I think UDFs should be grouped into folders under the repo, thing of > these as "groups of UDFs" This is a human based grouping that makes it > easy to organize by some general types, say string, language processing. > Not sure the best way to approach this, but I want to make it a little bit > grouped, rather than flat to make it easy. > - Each UDF would have it's own folder. > - We could create a basic requirements for UDFs to be accepted, perhaps > certain tests, a README.md, LICENSE (we'd need people to submit to the > apache license) package.info, (explained below). Readme would have > certain > data about how to use etc. > - package.info Here would be a file that has a json record that has > name, description, how to use, and tags. It's kinda like the grouping by > folders, but it's used from an install perspective and from a package > management perspective. See below) > 3. We won't keep jars in the repo, only source. But we will include a > docker file that will be as small as possible, and this will be used to > build on demand and UDF that someone wants to install. Thus we can ensure > the UDFs build well on anyone system AND that people who want to use UDFs > don't have to be Java Experts > 4. The package manager could have settings, like your Drill install > directory, and basically, it would build and install any UDF you want Now, > to keep things simple, the package manager can use the tags on the UDFs to > determine which udfs to build and then deploy, so you could build install > UDFs by tag (so you can say build all with tag X so you don't have to > individual ones) or you could build by name. > 5. The package manager would have list and search features that would use > the description, name and tags to help you search through the packages, and > provide a list of packages. This could be a "pre" step prior to installing > allowing you to search, and only install whats needed based on what you > want. > 6. We can remove packages based on the install dir. > 7. How to we handle across nodes? Shared locations are great, or we could > create "install packages" i.e. after build we can bundle all jars into tgz > that can be deployed etc. > > Shrug, perhaps it's a bit verbose, but the idea here is we want to > encourage people to submit here, we want issues to be tracked, and we want > to have one place to send folks. > > I would still like to use the drill user list for discussion (at first) but > if it the UDF discussion grows to be to much noise, we'd need a new list. > All UDFs would have to be Apache Licensed, and like I said, maybe we prove > this out with the idea that we can get this moved to Apache. I am not sure, > does Apache do "related projects" I.e. this on its own may not be an > Apache project, but to keep it within the Apache guidelines would be really > cool. > > So, that's a lot of stuff, but I am trying to toss out ideas more for > critique/discussion. > > So please, critique/discuss :) > > John > > > > > > On Wed, May 25, 2016 at 12:11 PM, MattK <[email protected] <javascript:;>> > wrote: > > > UDFs scare me in that the only Java I've conquered is evident from my > empty > >> french press... > >> > > > > Same issue here. I have solved this in other platforms by pre-processing > > the data with a set of regex replacements in Awk: > > > > ~~~ > > # "Repair" invalid dates as stored in MySQL (3 replacements for > > readability no slower than one nested) > > $0 = gensub(/0000-([0-9]{2}-[0-9]{2})/, "0001-\\1", "g", $0) > > $0 = gensub(/([0-9]{4})-00-([0-9]{2})/, "\\1-01-\\2", "g", $0) > > $0 = gensub(/([0-9]{4}-[0-9]{2})-00/, "\\1-01", "g", $0) > > ~~~ > > > > But of course this adds another step in the pipeline. Perhaps something > > similar to could be implemented via > > https://drill.apache.org/docs/string-manipulation/#regexp_replace ? > > > > > > > > > > On 25 May 2016, at 12:55, John Omernik wrote: > > > > Cool, I wasn't aware of SIMILAR to (I learned something) However, that > >> doesn't work because my data is accurate i.e. '____-__-__' 2015-04-02 > and > >> 2015-00-23 but 00 doesn't work (bad data) . > >> > >> UDFs scare me in that the only Java I've conquered is evident from my > >> empty > >> french press... > >> > >> I know I've brought it up in the past, but has anyone seen any community > >> around UDFs start? I'd love to have a community that follows Apache like > >> rules, and allows us to create and track UDFs to share... that would be > >> pretty neat. I guess if we were to do something like that, should one > of > >> us (I can volunteer) just start a Github project and encourage folks to > >> come to the table or is there better way via Apache to do something like > >> that? > >> > >> On Wed, May 25, 2016 at 10:27 AM, Veera Naranammalpuram < > >> [email protected] <javascript:;>> wrote: > >> > >> You could write a UDF. Or you could do something like this: > >>> > >>> cat data.csv > >>> 05/25/2016 > >>> 20160525 > >>> May 25th 2016 > >>> > >>> 0: jdbc:drill:> select case when columns[0] similar to '__/__/____' > then > >>> to_date(columns[0],'MM/dd/yyyy') when columns[0] similar to '________' > >>> then > >>> to_date(columns[0],'yyyyMMdd') else NULL end from `data.csv`; > >>> +-------------+ > >>> | EXPR$0 | > >>> +-------------+ > >>> | 2016-05-25 | > >>> | 2016-05-25 | > >>> | null | > >>> +-------------+ > >>> 3 rows selected (0.4 seconds) > >>> 0: jdbc:drill:> > >>> > >>> -Veera > >>> > >>> On Wed, May 25, 2016 at 11:12 AM, Vince Gonzalez <[email protected] > <javascript:;>> > >>> wrote: > >>> > >>> Sounds like a job for a UDF? > >>>> > >>>> You could do the try/catch inside the UDF. > >>>> > >>>> ---- > >>>> Vince Gonzalez > >>>> Systems Engineer > >>>> 212.694.3879 > >>>> > >>>> mapr.com > >>>> > >>>> On Wed, May 25, 2016 at 11:05 AM, John Omernik <[email protected] > <javascript:;>> > >>>> wrote: > >>>> > >>>> I have some DOBs, and some fields are empty others apparently were > >>>>> > >>>> filled > >>> > >>>> by trained monkeys, but while most data is accurate, some data is not. > >>>>> > >>>>> As you saw from my other post, I am trying to get the age for those > >>>>> > >>>> DOBs > >>> > >>>> that are valid... > >>>>> > >>>>> My function works, until I get to a record that is not valid and I > get > >>>>> something like this: > >>>>> > >>>>> Error: SYSTEM ERROR: IllegalFieldValueException: Value 0 for > >>>>> > >>>> monthOfYear > >>> > >>>> must be in the range [1,12] > >>>>> > >>>>> > >>>>> Is there a good "Try -> Except" type solution that will grant me the > >>>>> > >>>> valid > >>>> > >>>>> data if things worked, and just return 0 or whatever I specify if > there > >>>>> > >>>> is > >>>> > >>>>> an error? > >>>>> > >>>>> I could try casting the data, but if it fails won't it kill my query? > >>>>> Basically I want it to keep going if it fails... not sure if Drill > has > >>>>> > >>>> this > >>>> > >>>>> ability, but thought I would ask. > >>>>> > >>>>> > >>>> > >>> > >>> > >>> -- > >>> Veera Naranammalpuram > >>> Product Specialist - SQL on Hadoop > >>> *MapR Technologies (www.mapr.com <http://www.mapr.com>)* > >>> *(Email) [email protected] <javascript:;> < > [email protected] <javascript:;>>* > >>> *(Mobile) 917 683 8116 - can text * > >>> *Timezone: ET (UTC -5:00 / -4:00)* > >>> > >>> > -- ---- Vince Gonzalez Systems Engineer 212.694.3879 mapr.com
