Re: [Design Document] Support the Ability to Identify And Skip Records when Function Evaluations Fail

ganesh Thu, 22 Oct 2015 04:07:49 -0700

Hello,

John, you seem to be quite impressed with apache drill .. nice.
I am new to un-structured world and just started 1 week back on APACHE
DRILL after suggestion from my collegues. We have a semi structured data
where we have constraint that we do not know number of columns


I heard that APACHE DRILL is column free applicationa nd with support of
JSON format, it allows to create columns on-fly,
I converted my data from CSV-like-format to JSON and trying to figure out
if it will work for me.

Here I hit two issues :-
1) My column were like : 3100.2.1.2  and values like '-2303" or '01/01/2015
02:02:00"

Challenge was that column cant be started with Numeric value. So I had to
change key as: "t3100.2.1.2"
After that things were quite OK,

Now I need some help from you guys. To proceed I have to present my work to
management as an example.
But querying on apache drill console, doesnt seem to be an attractive way
to present things.

I tried drill explorer too.But didnt find that so good.
One thing to note, I am playing with files on Hadoop standalone mode in
UBUNTU.

To make it appear more good looking, I started with QLIK SENSE .. but was
unable to connect it with hadoop file system. It only showed me HIVE FILES.
Then I downloaded TABLEAU Trial version ... but I am unable to get Hadoop
data here too...

Please help me how to proceed. I have presentation on coming Monday.
Queries are quite ready .. I just need to show in visualization form
........ using OPEN SOURCE applications only.


Guys please help me.



On Wed, Oct 21, 2015 at 6:43 PM, John Omernik <[email protected]> wrote:

> AWESOME!
>
> I had just been in the process of writing up a long user story to ask for
> and support exactly this.   I modified it and included it here:
>
>
> To start out, I want to say how much I love the Drill project, and the
> potential it has. I've put this together based on my experiences and want
> to contribute a perspective as a user, not just put a bunch of critiques in
> an email.  I hope it's all taken in that spirit.  Additional note, I wrote
> this prior to seeing the Design Document share by Hsuan Yi Chu yesterday.
> If you are reading it, and think to yourself “that wording is odd…” please
> consider it from the “I didn’t want to throw away the user story”
> perspective and the “I wrote it before the design doc” perspective.
>
>
>
> Additionally, I understand that some of what I am suggesting may not be
> easy from a development perspective.  I am just being upfront with my
> experience, so we can look to determine what can be done; I am not looking
> for a silver bullet here, just looking for improvement.  Some may be as
> simple as better documentation, other suggestions may be harder to
> implement.  Either way, I thought a verbose user story might be useful to
> the community as a whole.
>
>
>
> John
>
>
>
> *User Story*
>
>
>
> As I have been working with Drill for data exploration, I came across
> multiple "things" that just were hard.  In dealing with some data,
> especially JSON data, it can be ugly, and scaled ugly is even worse!
>
>
>
> For this story, I am working with a JSON dump from MongoDB, and you would
> think it would be well structured, and for the most part it is.  There are
> some application level mistakes that were made (I will go into that in a
> moment), but in general Drill handles this well.  So with this data set,
> there are a few main challenges I am seeing:
>
>
>
> 1.     When there is a field that has a float, and then a later record has
> the number 0 in it (which Drill takes as a INT). This is a known problem
> and one that Drill has a solution for.
>
> 2.     When there is a field is of one type (a map) and then a later record
> has a string in it.  No easy solution here.
>
> 3.     Select * where there is a json field with a . in the name. I won’t
> go into details here, but I feel this factors into data exploration,
> because it changes the ability to “stay in Drill” to explore their data (
> https://issues.apache.org/jira/browse/DRILL-3922)
>
> 4.     Error reporting challenges
>
>
>
>
>
> With the problem summary laid out, I wanted to walk through my process in
> working with this data, and where, if I were a user Drill could have been
> much more helpful to the process.
>
>
>
> Here is a description of the process I went through:
>
>
>
> 1.     Copy data into filesystem
>
> 2.     Use drill to “Select * from `path_to/dump.json` limit 1
>
> 3.     (I just want to see what it looks like!)
>
>
>
>
>
> Here I get this error:
>
>
>
> > select * from `path_to/ dump.json` limit 1;
>
> Error: DATA_READ ERROR: You tried to write a BigInt type when you are using
> a ValueWriter of type NullableFloat8WriterImpl.
>
>
>
> File  /data/dev/path_to/dump.json
>
> Record  1
>
> Line  1
>
> Column  9054
>
> Field  entropy
>
> Fragment 0:0
>
>
>
> This isn’t incredibly helpful from a user perspective.  I.e. When I Google
> around, I realize now that in the docs it talks about “Schema Changes” and
> one possible item is use the setting below. However, examples of the data
> that was trying to be displayed (with it’s implied type) may help users
> grok what is happening.  At least in this case it showed me the field name!
>
>
>
> ALTER SYSTEM SET `store.json.read_numbers_as_double` = true;
>
>
>
> This is a great example where since we have known use case (when numbers
> are doubles but someone tries to store 0 an INT) it fails, thus dev’s have
> added a setting to allow a user to get through that, that the error message
> could be more helpful.   In this case, Showing two record numbers (line
> numbers) with different types, the field values with their implied types,
> and perhaps a suggestion about using the setting to address the problem.
> This could make it more intuitive for the user to stay in Drill, and stay
> in the data.   In this case, I looked at a head of the file, and saw the
> issue and was able to proceed.
>
>
>
> Also, as a corollary here, the user documentation does not show this error
> related to the schema change problem. This would be a great place to state,
> “if you see an error that looks like X, this is what is happening and what
> you can do for it.”
>
>
>
>
>
> *Side node on documentation*
>
> We should look to have documentation try to be role based.   In this case,
> the documentation says use “ALTER SYSTEM” I would argue, and I am guessing
> others would concur, that for this use case, “ALTER SESSION” may be a
> better suggestion as this is specific alteration to address the use case of
> loading/querying a specific data set, and is likely done by a user of the
> system.
>
>
>
> If a user is doing self-serve data, then in an enterprise environment, they
> may not have the ability to use ALTER SYSTEM and get an error, thus may be
> confused on how to proceed.   In addition ALTER SYSTEM by a user who
> doesn’t understand that they are changing, yet have the rights to change,
> may introduce future data problems they didn’t expect.   I like that the
> default is a more constrictive method, because it makes people be explicit
> about data, yet the documentation should also aim to be explicit about
> something like a system wide change.
>
>
>
>
>
> *Back to the story*
>
> Ok so now I will do ALTER SESSION SET on the read_numbers_as_double setting
>
>
>
> I run the query again.
>
>
>
> > select * from `path_to/dump.json` limit 1;
>
> Error: DATA_READ ERROR: Error parsing JSON - You tried to write a VarChar
> type when you are using a ValueWriter of type SingleMapWriter.
>
>
>
> File  /data/dev/path_to/dump.json
>
> Record  4009
>
> Fragment 0:0
>
>
>
> Another error   But what does this one mean? Ok, now that I have been
> living in the docs and in the Drill user list, and because it’s similar to
> the schema change issue, that that is what we are looking at here.  Instead
> of double to int, we have one field that is map most of the time, and in
> some cases it’s a string.
>
>
>
> But this doesn’t really help me as a user.  To troubleshoot this Drill
> doesn’t offer any options. This file is 500 MB of dense and nested JSON
> data with 51k records.   My solution? I took the record number, then I went
> to my NFS mounted clustered file system (thank goodness I had MapR here, I
> am not sure how I would have done this with Posix tools)
>
>
>
> My command: $ head -4009 dump.json|tail -1
>
>
>
> That (I hoped) showed me the record in question, note the error from Drill
> didn’t tell me which field was at fault here, so I had to visually align
> things to address that.  However, I was able to spot the difference and
> work with the dev to understand why that happened. I removed those records,
> and things worked correctly.
>
>
>
> Could there have been a way to identify that within drill? My solution was
> to take a python script and read through, and discard those records that
> were not a map, however, on 500MB that can work, but what about 500 GB?  I
> guess a Spark job could clean the data…. But could Drill be given some
> tools to help with this situation?
>
>
>
> For example, the first thing I said was: What field is at issue?  I had no
> way to see what was up there.  I had to use other tools to see the data so
> I could understand the problem. Then when I understood the problem, I had
> to use Python to produce data that was queryable.
>
>
>
> Based on the design document Hsuan Yi Chu just posted to the mailing list,
> at this point my post is just a user story to support the design document.
> To summarize the points I’d like to see included in the design document
> (from a user perspective), not understanding “how or why”:
>
>
>
>
>
>
>
>
>
> *1.     **Error messages that are more verbose in explaining the problem*
>
> a.     Filename, row number, column number or name
>
> b.     Option to output the “offending row”
>
> c.     Showing the data that is causing the error WITH the type Drill
> inferred.
>
> d.     If there are options to help work through dirty data, perhaps the
> error message could include those: “Data was an double, then drill found
> this data: 0 that was a int in File x, at row 24 in column “myfloatingdata”
> consider using store.json.read_numbers_as_double to address the issue.
>
> 2.     *A way to determine how common this exception is*
>
> a.     If I am playing with a messy data set, and this error happens, does
> it happen on 1 record? 2? 5000?  Knowing that information would:
>
>                                                i.     Help users understand
> how Drill is seeing that particular column
>
>                                              ii.     Make decisions on
> excluding data rather than just removing it. What if the first 10 records
> were errors, and then you excluded the remaining 10 million because they
> were correct yet different from the first 10?
>
> b.     Perhaps there could be a “stats” function that only works if it’s
> the only selected item or if the select is all those functions (stats
> functions)?
>
>                                                i.     Select
> type_stats(fieldsname) from data
>
>                                              ii.      (that wouldn’t error
> on different types)
>
> 3.     *An ability to set a “return null on this field if error or if non
> castable to X type, especially in a view, perhaps in a function.*
>
> a.     Allow them to not have to reparse data outside drill
>
> b.     Load it into a sane format (one time loads/ETL to clean data)
>
> c.     Not be system or session wide exception.
>
>                                                i.     I think this is
> important because I may have a field where I want it to read the numbers as
> double, but what if I have another field in the same dataset where I don’t
> want it to read the numbers as double? A SYSTEM or SESSION level variable
> takes away that granularity
>
> d.     Select field1, CASTORNULL(field2, int) as field2, CASTORNULL(field3,
> double) as field3 from ugly_data.
>
> e.     That’s an example when it’s in the select, but I Could see a where
> clause
>
> f.      Select field1, field2, field3 from ugly data where ISTYPE(field2,
> int) and ISTYPE(field3, double)
>
> 4.     *Updating of the documentation related to ALTER SESSION vs ALTER
> SYSTEM with an eye to the context of the majority use case of the
> documented feature*
>
> a.     For data loads, the documentation uses ALTER SYSTEM and that’s
> problematic because:
>
>                                                i.     Not all users have
> the privileges to issue an ALTER SYSTEM. Thus a new user trying to figure
> things out may not realize they can just ALTER SESSION after getting an
> ALTER SYSTEM error.
>
>                                              ii.     ALTER SYSTEM on data
> loading items, especially in areas that make Drill’s data interpretation
> more permissive can lead to unintended consequences later. An admin, who
> may be a good systems admin, and helps a data user troubleshoot and error
> may issue an ALTER SYSTEM not realizing this changes all future data
> imports.
>
> b.     Note, I found a few cases, but I would suggest a thorough review of
> the various use cases throughout the documentation, and in areas where it
> really could be either, have a small paragraph indicating the ramifications
> of either command.
>
> *5.     **A Philosophy within the Drill Community to “Stay in Drill” for
> data exploration*
>
> a.     This is obviously not as much of a development thing as a mindset.
> If someone says “I tried to do X, and I got and error” and the communities
> response is Y where Y is “Look through your data and do Z to it so Drill
> can read it” then we should reconsider that scenario and try to provide and
> option within Drill to intuitively handle the edge case.  This is
> difficult.
>
> b.     There are cases even in the documentation where this is the case:
> https://drill.apache.org/docs/json-data-model/ talking about arrays at the
> root level or reading some empty arrays.  In these cases, we have to leave
> drill to fix the problem. This works on small data, but may not work on
> large or wide data. Consider the  array at root level limitation.  What if
> some process out of the users control produces 1000 100mb json files and we
> want to read that. To fix it, we have to address those files. Lots of work
> there, either manual or automated.
>
> c.     Once again I know this isn’t easy, but we shouldn’t answer questions
> about how to do something by saying “fix this outside of Drill so Drill can
> read your data” if at all possible.
>
>
>
>
>
>
>
> I hope this story helps support the design document presented.  I am happy
> to participate in more discussion around these topics as I have enjoying
> digging into the internals of Drill
>
>
>
> John Omernik
>



-- 
*Name: Ganesh Semalty*
*Location: Gurgaon,Haryana(India)*
*Email Id: [email protected] <[email protected]>*


P

*Please consider the environment before printing this e-mail - SAVE TREE.*

Re: [Design Document] Support the Ability to Identify And Skip Records when Function Evaluations Fail

Reply via email to