Re: [Design Document] Support the Ability to Identify And Skip Records when Function Evaluations Fail

Mustafa Engin Sözer Thu, 22 Oct 2015 04:32:13 -0700

Sorry guys, accidentally answered the question here, did not realize this
was another topic.


On 22 October 2015 at 13:29, Mustafa Engin Sözer <
[email protected]> wrote:

> Hi,
>
> If you only need to show the data but not some sort of report (e.g. as in
> Microstrategy, Tableau), then you can basically use most SQL clients, for
> instance SQuirrel, Heidi SQL etc. via jdbc or odbc connection . The
> following links should help you manage that:
>
> https://drill.apache.org/docs/odbc-jdbc-interfaces/
> https://drill.apache.org/docs/using-the-jdbc-driver/
>
> For the Hadoop data problem, can you share your storage plugin settings?
> Normally you shouldn't have any problems with seeing or querying hdfs data.
>
>
>
> On 22 October 2015 at 13:06, ganesh <[email protected]> wrote:
>
>> Hello,
>>
>> John, you seem to be quite impressed with apache drill .. nice.
>> I am new to un-structured world and just started 1 week back on APACHE
>> DRILL after suggestion from my collegues. We have a semi structured data
>> where we have constraint that we do not know number of columns
>>
>> I heard that APACHE DRILL is column free applicationa nd with support of
>> JSON format, it allows to create columns on-fly,
>> I converted my data from CSV-like-format to JSON and trying to figure out
>> if it will work for me.
>>
>> Here I hit two issues :-
>> 1) My column were like : 3100.2.1.2  and values like '-2303" or
>> '01/01/2015
>> 02:02:00"
>>
>> Challenge was that column cant be started with Numeric value. So I had to
>> change key as: "t3100.2.1.2"
>> After that things were quite OK,
>>
>> Now I need some help from you guys. To proceed I have to present my work
>> to
>> management as an example.
>> But querying on apache drill console, doesnt seem to be an attractive way
>> to present things.
>>
>> I tried drill explorer too.But didnt find that so good.
>> One thing to note, I am playing with files on Hadoop standalone mode in
>> UBUNTU.
>>
>> To make it appear more good looking, I started with QLIK SENSE .. but was
>> unable to connect it with hadoop file system. It only showed me HIVE
>> FILES.
>> Then I downloaded TABLEAU Trial version ... but I am unable to get Hadoop
>> data here too...
>>
>> Please help me how to proceed. I have presentation on coming Monday.
>> Queries are quite ready .. I just need to show in visualization form
>> ........ using OPEN SOURCE applications only.
>>
>>
>> Guys please help me.
>>
>>
>>
>> On Wed, Oct 21, 2015 at 6:43 PM, John Omernik <[email protected]> wrote:
>>
>> > AWESOME!
>> >
>> > I had just been in the process of writing up a long user story to ask
>> for
>> > and support exactly this.   I modified it and included it here:
>> >
>> >
>> > To start out, I want to say how much I love the Drill project, and the
>> > potential it has. I've put this together based on my experiences and
>> want
>> > to contribute a perspective as a user, not just put a bunch of
>> critiques in
>> > an email.  I hope it's all taken in that spirit.  Additional note, I
>> wrote
>> > this prior to seeing the Design Document share by Hsuan Yi Chu
>> yesterday.
>> > If you are reading it, and think to yourself “that wording is odd…”
>> please
>> > consider it from the “I didn’t want to throw away the user story”
>> > perspective and the “I wrote it before the design doc” perspective.
>> >
>> >
>> >
>> > Additionally, I understand that some of what I am suggesting may not be
>> > easy from a development perspective.  I am just being upfront with my
>> > experience, so we can look to determine what can be done; I am not
>> looking
>> > for a silver bullet here, just looking for improvement.  Some may be as
>> > simple as better documentation, other suggestions may be harder to
>> > implement.  Either way, I thought a verbose user story might be useful
>> to
>> > the community as a whole.
>> >
>> >
>> >
>> > John
>> >
>> >
>> >
>> > *User Story*
>> >
>> >
>> >
>> > As I have been working with Drill for data exploration, I came across
>> > multiple "things" that just were hard.  In dealing with some data,
>> > especially JSON data, it can be ugly, and scaled ugly is even worse!
>> >
>> >
>> >
>> > For this story, I am working with a JSON dump from MongoDB, and you
>> would
>> > think it would be well structured, and for the most part it is.  There
>> are
>> > some application level mistakes that were made (I will go into that in a
>> > moment), but in general Drill handles this well.  So with this data set,
>> > there are a few main challenges I am seeing:
>> >
>> >
>> >
>> > 1.     When there is a field that has a float, and then a later record
>> has
>> > the number 0 in it (which Drill takes as a INT). This is a known problem
>> > and one that Drill has a solution for.
>> >
>> > 2.     When there is a field is of one type (a map) and then a later
>> record
>> > has a string in it.  No easy solution here.
>> >
>> > 3.     Select * where there is a json field with a . in the name. I
>> won’t
>> > go into details here, but I feel this factors into data exploration,
>> > because it changes the ability to “stay in Drill” to explore their data
>> (
>> > https://issues.apache.org/jira/browse/DRILL-3922)
>> >
>> > 4.     Error reporting challenges
>> >
>> >
>> >
>> >
>> >
>> > With the problem summary laid out, I wanted to walk through my process
>> in
>> > working with this data, and where, if I were a user Drill could have
>> been
>> > much more helpful to the process.
>> >
>> >
>> >
>> > Here is a description of the process I went through:
>> >
>> >
>> >
>> > 1.     Copy data into filesystem
>> >
>> > 2.     Use drill to “Select * from `path_to/dump.json` limit 1
>> >
>> > 3.     (I just want to see what it looks like!)
>> >
>> >
>> >
>> >
>> >
>> > Here I get this error:
>> >
>> >
>> >
>> > > select * from `path_to/ dump.json` limit 1;
>> >
>> > Error: DATA_READ ERROR: You tried to write a BigInt type when you are
>> using
>> > a ValueWriter of type NullableFloat8WriterImpl.
>> >
>> >
>> >
>> > File  /data/dev/path_to/dump.json
>> >
>> > Record  1
>> >
>> > Line  1
>> >
>> > Column  9054
>> >
>> > Field  entropy
>> >
>> > Fragment 0:0
>> >
>> >
>> >
>> > This isn’t incredibly helpful from a user perspective.  I.e. When I
>> Google
>> > around, I realize now that in the docs it talks about “Schema Changes”
>> and
>> > one possible item is use the setting below. However, examples of the
>> data
>> > that was trying to be displayed (with it’s implied type) may help users
>> > grok what is happening.  At least in this case it showed me the field
>> name!
>> >
>> >
>> >
>> > ALTER SYSTEM SET `store.json.read_numbers_as_double` = true;
>> >
>> >
>> >
>> > This is a great example where since we have known use case (when numbers
>> > are doubles but someone tries to store 0 an INT) it fails, thus dev’s
>> have
>> > added a setting to allow a user to get through that, that the error
>> message
>> > could be more helpful.   In this case, Showing two record numbers (line
>> > numbers) with different types, the field values with their implied
>> types,
>> > and perhaps a suggestion about using the setting to address the problem.
>> > This could make it more intuitive for the user to stay in Drill, and
>> stay
>> > in the data.   In this case, I looked at a head of the file, and saw the
>> > issue and was able to proceed.
>> >
>> >
>> >
>> > Also, as a corollary here, the user documentation does not show this
>> error
>> > related to the schema change problem. This would be a great place to
>> state,
>> > “if you see an error that looks like X, this is what is happening and
>> what
>> > you can do for it.”
>> >
>> >
>> >
>> >
>> >
>> > *Side node on documentation*
>> >
>> > We should look to have documentation try to be role based.   In this
>> case,
>> > the documentation says use “ALTER SYSTEM” I would argue, and I am
>> guessing
>> > others would concur, that for this use case, “ALTER SESSION” may be a
>> > better suggestion as this is specific alteration to address the use
>> case of
>> > loading/querying a specific data set, and is likely done by a user of
>> the
>> > system.
>> >
>> >
>> >
>> > If a user is doing self-serve data, then in an enterprise environment,
>> they
>> > may not have the ability to use ALTER SYSTEM and get an error, thus may
>> be
>> > confused on how to proceed.   In addition ALTER SYSTEM by a user who
>> > doesn’t understand that they are changing, yet have the rights to
>> change,
>> > may introduce future data problems they didn’t expect.   I like that the
>> > default is a more constrictive method, because it makes people be
>> explicit
>> > about data, yet the documentation should also aim to be explicit about
>> > something like a system wide change.
>> >
>> >
>> >
>> >
>> >
>> > *Back to the story*
>> >
>> > Ok so now I will do ALTER SESSION SET on the read_numbers_as_double
>> setting
>> >
>> >
>> >
>> > I run the query again.
>> >
>> >
>> >
>> > > select * from `path_to/dump.json` limit 1;
>> >
>> > Error: DATA_READ ERROR: Error parsing JSON - You tried to write a
>> VarChar
>> > type when you are using a ValueWriter of type SingleMapWriter.
>> >
>> >
>> >
>> > File  /data/dev/path_to/dump.json
>> >
>> > Record  4009
>> >
>> > Fragment 0:0
>> >
>> >
>> >
>> > Another error   But what does this one mean? Ok, now that I have been
>> > living in the docs and in the Drill user list, and because it’s similar
>> to
>> > the schema change issue, that that is what we are looking at here.
>> Instead
>> > of double to int, we have one field that is map most of the time, and in
>> > some cases it’s a string.
>> >
>> >
>> >
>> > But this doesn’t really help me as a user.  To troubleshoot this Drill
>> > doesn’t offer any options. This file is 500 MB of dense and nested JSON
>> > data with 51k records.   My solution? I took the record number, then I
>> went
>> > to my NFS mounted clustered file system (thank goodness I had MapR
>> here, I
>> > am not sure how I would have done this with Posix tools)
>> >
>> >
>> >
>> > My command: $ head -4009 dump.json|tail -1
>> >
>> >
>> >
>> > That (I hoped) showed me the record in question, note the error from
>> Drill
>> > didn’t tell me which field was at fault here, so I had to visually align
>> > things to address that.  However, I was able to spot the difference and
>> > work with the dev to understand why that happened. I removed those
>> records,
>> > and things worked correctly.
>> >
>> >
>> >
>> > Could there have been a way to identify that within drill? My solution
>> was
>> > to take a python script and read through, and discard those records that
>> > were not a map, however, on 500MB that can work, but what about 500
>> GB?  I
>> > guess a Spark job could clean the data…. But could Drill be given some
>> > tools to help with this situation?
>> >
>> >
>> >
>> > For example, the first thing I said was: What field is at issue?  I had
>> no
>> > way to see what was up there.  I had to use other tools to see the data
>> so
>> > I could understand the problem. Then when I understood the problem, I
>> had
>> > to use Python to produce data that was queryable.
>> >
>> >
>> >
>> > Based on the design document Hsuan Yi Chu just posted to the mailing
>> list,
>> > at this point my post is just a user story to support the design
>> document.
>> > To summarize the points I’d like to see included in the design document
>> > (from a user perspective), not understanding “how or why”:
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > *1.     **Error messages that are more verbose in explaining the
>> problem*
>> >
>> > a.     Filename, row number, column number or name
>> >
>> > b.     Option to output the “offending row”
>> >
>> > c.     Showing the data that is causing the error WITH the type Drill
>> > inferred.
>> >
>> > d.     If there are options to help work through dirty data, perhaps the
>> > error message could include those: “Data was an double, then drill found
>> > this data: 0 that was a int in File x, at row 24 in column
>> “myfloatingdata”
>> > consider using store.json.read_numbers_as_double to address the issue.
>> >
>> > 2.     *A way to determine how common this exception is*
>> >
>> > a.     If I am playing with a messy data set, and this error happens,
>> does
>> > it happen on 1 record? 2? 5000?  Knowing that information would:
>> >
>> >                                                i.     Help users
>> understand
>> > how Drill is seeing that particular column
>> >
>> >                                              ii.     Make decisions on
>> > excluding data rather than just removing it. What if the first 10
>> records
>> > were errors, and then you excluded the remaining 10 million because they
>> > were correct yet different from the first 10?
>> >
>> > b.     Perhaps there could be a “stats” function that only works if it’s
>> > the only selected item or if the select is all those functions (stats
>> > functions)?
>> >
>> >                                                i.     Select
>> > type_stats(fieldsname) from data
>> >
>> >                                              ii.      (that wouldn’t
>> error
>> > on different types)
>> >
>> > 3.     *An ability to set a “return null on this field if error or if
>> non
>> > castable to X type, especially in a view, perhaps in a function.*
>> >
>> > a.     Allow them to not have to reparse data outside drill
>> >
>> > b.     Load it into a sane format (one time loads/ETL to clean data)
>> >
>> > c.     Not be system or session wide exception.
>> >
>> >                                                i.     I think this is
>> > important because I may have a field where I want it to read the
>> numbers as
>> > double, but what if I have another field in the same dataset where I
>> don’t
>> > want it to read the numbers as double? A SYSTEM or SESSION level
>> variable
>> > takes away that granularity
>> >
>> > d.     Select field1, CASTORNULL(field2, int) as field2,
>> CASTORNULL(field3,
>> > double) as field3 from ugly_data.
>> >
>> > e.     That’s an example when it’s in the select, but I Could see a
>> where
>> > clause
>> >
>> > f.      Select field1, field2, field3 from ugly data where
>> ISTYPE(field2,
>> > int) and ISTYPE(field3, double)
>> >
>> > 4.     *Updating of the documentation related to ALTER SESSION vs ALTER
>> > SYSTEM with an eye to the context of the majority use case of the
>> > documented feature*
>> >
>> > a.     For data loads, the documentation uses ALTER SYSTEM and that’s
>> > problematic because:
>> >
>> >                                                i.     Not all users have
>> > the privileges to issue an ALTER SYSTEM. Thus a new user trying to
>> figure
>> > things out may not realize they can just ALTER SESSION after getting an
>> > ALTER SYSTEM error.
>> >
>> >                                              ii.     ALTER SYSTEM on
>> data
>> > loading items, especially in areas that make Drill’s data interpretation
>> > more permissive can lead to unintended consequences later. An admin, who
>> > may be a good systems admin, and helps a data user troubleshoot and
>> error
>> > may issue an ALTER SYSTEM not realizing this changes all future data
>> > imports.
>> >
>> > b.     Note, I found a few cases, but I would suggest a thorough review
>> of
>> > the various use cases throughout the documentation, and in areas where
>> it
>> > really could be either, have a small paragraph indicating the
>> ramifications
>> > of either command.
>> >
>> > *5.     **A Philosophy within the Drill Community to “Stay in Drill” for
>> > data exploration*
>> >
>> > a.     This is obviously not as much of a development thing as a
>> mindset.
>> > If someone says “I tried to do X, and I got and error” and the
>> communities
>> > response is Y where Y is “Look through your data and do Z to it so Drill
>> > can read it” then we should reconsider that scenario and try to provide
>> and
>> > option within Drill to intuitively handle the edge case.  This is
>> > difficult.
>> >
>> > b.     There are cases even in the documentation where this is the case:
>> > https://drill.apache.org/docs/json-data-model/ talking about arrays at
>> the
>> > root level or reading some empty arrays.  In these cases, we have to
>> leave
>> > drill to fix the problem. This works on small data, but may not work on
>> > large or wide data. Consider the  array at root level limitation.  What
>> if
>> > some process out of the users control produces 1000 100mb json files
>> and we
>> > want to read that. To fix it, we have to address those files. Lots of
>> work
>> > there, either manual or automated.
>> >
>> > c.     Once again I know this isn’t easy, but we shouldn’t answer
>> questions
>> > about how to do something by saying “fix this outside of Drill so Drill
>> can
>> > read your data” if at all possible.
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > I hope this story helps support the design document presented.  I am
>> happy
>> > to participate in more discussion around these topics as I have enjoying
>> > digging into the internals of Drill
>> >
>> >
>> >
>> > John Omernik
>> >
>>
>>
>>
>> --
>> *Name: Ganesh Semalty*
>> *Location: Gurgaon,Haryana(India)*
>> *Email Id: [email protected] <[email protected]>*
>>
>>
>> P
>>
>> *Please consider the environment before printing this e-mail - SAVE TREE.*
>>
>
>
>
> --
>
> *M. Engin Sözer*
> Junior Datawarehouse Manager
> [email protected]
>
> Goodgame Studios
> Theodorstr. 42-90, House 9
> 22761 Hamburg, Germany
> Phone: +49 (0)40 219 880 -0
> *www.goodgamestudios.com <http://www.goodgamestudios.com>*
>
> Goodgame Studios is a brand of Altigi GmbH
> Altigi GmbH, District court Hamburg, HRB 99869
> Board of directors: Dr. Kai Wawrzinek, Dr. Christian Wawrzinek, Fabian
> Ritter
>
>


-- 

*M. Engin Sözer*
Junior Datawarehouse Manager
[email protected]

Goodgame Studios
Theodorstr. 42-90, House 9
22761 Hamburg, Germany
Phone: +49 (0)40 219 880 -0
*www.goodgamestudios.com <http://www.goodgamestudios.com>*

Goodgame Studios is a brand of Altigi GmbH
Altigi GmbH, District court Hamburg, HRB 99869
Board of directors: Dr. Kai Wawrzinek, Dr. Christian Wawrzinek, Fabian
Ritter

Re: [Design Document] Support the Ability to Identify And Skip Records when Function Evaluations Fail

Reply via email to