Re: Is there a good way to handle bad date data?

Vince Gonzalez Wed, 25 May 2016 17:01:11 -0700

John,

Some MapR folk have collected some UDFs here:
https://github.com/mapr-demos/simple-drill-functions


Additions via pull request are welcome, I'm sure.

On Wednesday, May 25, 2016, John Omernik <[email protected]> wrote:

> So Charles here's how I'd set it up (I am not tied to this as I would love
> to have others from Drill community feel like it's an open community ala
> Apache, however, I am not sure the best way to approach)
>
> So, up to me, and this is just spitballing
>
> 1. Create a Github repo (I'd use my account just because, but if it makes
> sense under the apache one I am not tied to it)
> 2. Create a Readme that  describes what we have
>    - I think UDFs should be grouped into folders under the repo, thing of
> these as "groups of UDFs"  This is a human based grouping that makes it
> easy to organize by some general types, say string, language processing.
> Not sure the best way to approach this, but I want to make it a little bit
> grouped, rather than flat to make it easy.
>   - Each UDF would have it's own folder.
>   - We could create a basic requirements for UDFs to be accepted, perhaps
> certain tests, a README.md, LICENSE (we'd need people to submit to the
> apache license) package.info, (explained below).  Readme would have
> certain
> data about how to use etc.
>   - package.info  Here would be a file that has a json record that has
> name, description, how to use, and tags.  It's kinda like the grouping by
> folders, but it's used from an install perspective and from a package
> management perspective.  See below)
> 3. We won't keep jars in the repo, only source. But we will include a
> docker file that will be as small as possible, and this will be used to
> build on demand and UDF that someone wants to install.  Thus we can ensure
> the UDFs build well on anyone system AND that people who want to use UDFs
> don't have to be Java Experts
> 4.  The package manager could have settings, like your Drill install
> directory, and basically, it would build and install any UDF you want  Now,
> to keep things simple, the package manager can use the tags on the UDFs to
> determine which udfs to build and then deploy, so you could build install
> UDFs by tag (so you can say build all with tag X so you don't have to
> individual ones) or you could build by name.
> 5. The package manager would have list and search features that would use
> the description, name and tags to help you search through the packages, and
> provide a list of packages.  This could be a "pre" step prior to installing
> allowing you to search, and only install whats needed based on what you
> want.
> 6. We can remove packages based on the install dir.
> 7. How to we handle across nodes? Shared locations are great, or we could
> create "install packages" i.e. after build we can bundle all jars into tgz
> that can be deployed etc.
>
> Shrug, perhaps it's a bit verbose, but the idea here is we want to
> encourage people to submit here, we want issues to be tracked, and we want
> to have one place to send folks.
>
> I would still like to use the drill user list for discussion (at first) but
> if it the UDF discussion grows to be to much noise, we'd need a new list.
> All UDFs would have to be Apache Licensed, and like I said, maybe we prove
> this out with the idea that we can get this moved to Apache. I am not sure,
> does Apache do "related projects"  I.e. this on its own may not be an
> Apache project, but to keep it within the Apache guidelines would be really
> cool.
>
> So, that's a lot of stuff, but I am trying to toss out ideas more for
> critique/discussion.
>
> So please, critique/discuss :)
>
> John
>
>
>
>
>
> On Wed, May 25, 2016 at 12:11 PM, MattK <[email protected] <javascript:;>>
> wrote:
>
> > UDFs scare me in that the only Java I've conquered is evident from my
> empty
> >> french press...
> >>
> >
> > Same issue here. I have solved this in other platforms by pre-processing
> > the data with a set of regex replacements in Awk:
> >
> > ~~~
> > # "Repair" invalid dates as stored in MySQL (3 replacements for
> > readability no slower than one nested)
> > $0 = gensub(/0000-([0-9]{2}-[0-9]{2})/, "0001-\\1", "g", $0)
> > $0 = gensub(/([0-9]{4})-00-([0-9]{2})/, "\\1-01-\\2", "g", $0)
> > $0 = gensub(/([0-9]{4}-[0-9]{2})-00/, "\\1-01", "g", $0)
> > ~~~
> >
> > But of course this adds another step in the pipeline. Perhaps something
> > similar to could be implemented via
> > https://drill.apache.org/docs/string-manipulation/#regexp_replace ?
> >
> >
> >
> >
> > On 25 May 2016, at 12:55, John Omernik wrote:
> >
> > Cool, I wasn't aware of SIMILAR to (I learned something)  However, that
> >> doesn't work because my data is accurate i.e. '____-__-__' 2015-04-02
> and
> >> 2015-00-23  but 00 doesn't work (bad data) .
> >>
> >> UDFs scare me in that the only Java I've conquered is evident from my
> >> empty
> >> french press...
> >>
> >> I know I've brought it up in the past, but has anyone seen any community
> >> around UDFs start? I'd love to have a community that follows Apache like
> >> rules, and allows us to create and track UDFs to share... that would be
> >> pretty neat.   I guess if we were to do something like that, should one
> of
> >> us (I can volunteer) just start a Github project and encourage folks to
> >> come to the table or is there better way via Apache to do something like
> >> that?
> >>
> >> On Wed, May 25, 2016 at 10:27 AM, Veera Naranammalpuram <
> >> [email protected] <javascript:;>> wrote:
> >>
> >> You could write a UDF. Or you could do something like this:
> >>>
> >>> cat data.csv
> >>> 05/25/2016
> >>> 20160525
> >>> May 25th 2016
> >>>
> >>> 0: jdbc:drill:> select case when columns[0] similar to '__/__/____'
> then
> >>> to_date(columns[0],'MM/dd/yyyy') when columns[0] similar to '________'
> >>> then
> >>> to_date(columns[0],'yyyyMMdd') else NULL end from `data.csv`;
> >>> +-------------+
> >>> |   EXPR$0    |
> >>> +-------------+
> >>> | 2016-05-25  |
> >>> | 2016-05-25  |
> >>> | null        |
> >>> +-------------+
> >>> 3 rows selected (0.4 seconds)
> >>> 0: jdbc:drill:>
> >>>
> >>> -Veera
> >>>
> >>> On Wed, May 25, 2016 at 11:12 AM, Vince Gonzalez <[email protected]
> <javascript:;>>
> >>> wrote:
> >>>
> >>> Sounds like a job for a UDF?
> >>>>
> >>>> You could do the try/catch inside the UDF.
> >>>>
> >>>>  ----
> >>>>  Vince Gonzalez
> >>>>  Systems Engineer
> >>>>  212.694.3879
> >>>>
> >>>>  mapr.com
> >>>>
> >>>> On Wed, May 25, 2016 at 11:05 AM, John Omernik <[email protected]
> <javascript:;>>
> >>>> wrote:
> >>>>
> >>>> I have some DOBs, and some fields are empty others apparently were
> >>>>>
> >>>> filled
> >>>
> >>>> by trained monkeys, but while most data is accurate, some data is not.
> >>>>>
> >>>>> As you saw from my other post, I am trying to get the age for those
> >>>>>
> >>>> DOBs
> >>>
> >>>> that are valid...
> >>>>>
> >>>>> My function works, until I get to a record that is not valid and I
> get
> >>>>> something like this:
> >>>>>
> >>>>> Error: SYSTEM ERROR: IllegalFieldValueException: Value 0 for
> >>>>>
> >>>> monthOfYear
> >>>
> >>>> must be in the range [1,12]
> >>>>>
> >>>>>
> >>>>> Is there a good "Try -> Except" type solution that will grant me the
> >>>>>
> >>>> valid
> >>>>
> >>>>> data if things worked, and just return 0 or whatever I specify if
> there
> >>>>>
> >>>> is
> >>>>
> >>>>> an error?
> >>>>>
> >>>>> I could try casting the data, but if it fails won't it kill my query?
> >>>>> Basically I want it to keep going if it fails... not sure if Drill
> has
> >>>>>
> >>>> this
> >>>>
> >>>>> ability, but thought I would ask.
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Veera Naranammalpuram
> >>> Product Specialist - SQL on Hadoop
> >>> *MapR Technologies (www.mapr.com <http://www.mapr.com>)*
> >>> *(Email) [email protected] <javascript:;> <
> [email protected] <javascript:;>>*
> >>> *(Mobile) 917 683 8116 - can text *
> >>> *Timezone: ET (UTC -5:00 / -4:00)*
> >>>
> >>>
>


-- 
 ----
 Vince Gonzalez
 Systems Engineer
 212.694.3879

 mapr.com

Re: Is there a good way to handle bad date data?

Reply via email to