Re: New Topic: Drill Visualization and Use Support

Andries Engelbrecht Thu, 22 Oct 2015 07:47:47 -0700

See if the videos on this page helps you.

https://www.mapr.com/products/apache-drill 
<https://www.mapr.com/products/apache-drill>





—Andries


> On Oct 22, 2015, at 7:38 AM, ganesh <[email protected]> wrote:
> 
> Hello,
> 
> Are there any other links as tutorial for TABLEAU v/s HIVE
> 
> I have already gone through the one in apache-drill site. I am not able to
> proceed with those.
> 
> On Thu, Oct 22, 2015 at 8:03 PM, Andries Engelbrecht <
> [email protected]> wrote:
> 
>> Hive should be visible and usable in Tableau. You can use Drill Views for
>> dfs data, or you can Tableau Custom SQL to access the data.
>> 
>> Make sure to install the Tableau TDC file that comes with the ODBC driver.
>> https://drill.apache.org/docs/installing-the-tdc-file-on-windows/ <
>> https://drill.apache.org/docs/installing-the-tdc-file-on-windows/>
>> https://drill.apache.org/docs/using-apache-drill-with-tableau-9-desktop/ <
>> https://drill.apache.org/docs/using-apache-drill-with-tableau-9-desktop/>
>> 
>> 
>> —Andries
>> 
>> 
>>> On Oct 22, 2015, at 6:40 AM, ganesh <[email protected]> wrote:
>>> 
>>> Hi John,
>>> 
>>> Thanks for suggesting new name:  Apache Zeppelin
>>> 
>>> I was currently trying 14 Days trial version of TABLEAU with not much
>> success.
>>> Today only I knew that for files in Hadoop or local file system, I would
>> need to create view.
>>> 
>>> Still, though I can see my Tables from HIVE in Tableau, I cannot see any
>> data.
>>> Nor I am able to use TABLEAU from the links help given in apache-drill
>> currently (http://drill.apache.org/docs/tableau-examples/ <
>> http://drill.apache.org/docs/tableau-examples/>)
>>> 
>>> Snapshot attached, Incase you have worked over TABLEAU
>>> 
>>> I will look into Zeppelin also.
>>> 
>>> 
>>> On Thu, Oct 22, 2015 at 6:43 PM, John Omernik <[email protected] <mailto:
>> [email protected]>> wrote:
>>> I separated my response from the original topic to keep any responses
>> there
>>> focused on the design document.
>>> 
>>> As to ways to use Drill, I have been working with SQL Squirrel quite
>>> successfully.  On my list of things to do, if you want to stay in the
>>> Apache world is looking at Apache Zeppelin.  In the Git Repo, there is a
>>> Drill Plugin so you can run SQL again Drill, look at results, and do
>> basic
>>> visualizations, I have been trying to way until the PR is merged on
>>> Zeppelin, but for your use case, you may want to grab the plugin code and
>>> run with it.
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Oct 22, 2015 at 6:06 AM, ganesh <[email protected]
>> <mailto:[email protected]>> wrote:
>>> 
>>>> Hello,
>>>> 
>>>> John, you seem to be quite impressed with apache drill .. nice.
>>>> I am new to un-structured world and just started 1 week back on APACHE
>>>> DRILL after suggestion from my collegues. We have a semi structured
>> data
>>>> where we have constraint that we do not know number of columns
>>>> 
>>>> I heard that APACHE DRILL is column free applicationa nd with support
>> of
>>>> JSON format, it allows to create columns on-fly,
>>>> I converted my data from CSV-like-format to JSON and trying to figure
>> out
>>>> if it will work for me.
>>>> 
>>>> Here I hit two issues :-
>>>> 1) My column were like : 3100.2.1.2 <tel:3100.2.1.2>  and values like
>> '-2303" or
>>>> '01/01/2015
>>>> 02:02:00"
>>>> 
>>>> Challenge was that column cant be started with Numeric value. So I had
>> to
>>>> change key as: "t3100.2.1.2"
>>>> After that things were quite OK,
>>>> 
>>>> Now I need some help from you guys. To proceed I have to present my
>> work to
>>>> management as an example.
>>>> But querying on apache drill console, doesnt seem to be an attractive
>> way
>>>> to present things.
>>>> 
>>>> I tried drill explorer too.But didnt find that so good.
>>>> One thing to note, I am playing with files on Hadoop standalone mode in
>>>> UBUNTU.
>>>> 
>>>> To make it appear more good looking, I started with QLIK SENSE .. but
>> was
>>>> unable to connect it with hadoop file system. It only showed me HIVE
>> FILES.
>>>> Then I downloaded TABLEAU Trial version ... but I am unable to get
>> Hadoop
>>>> data here too...
>>>> 
>>>> Please help me how to proceed. I have presentation on coming Monday.
>>>> Queries are quite ready .. I just need to show in visualization form
>>>> ........ using OPEN SOURCE applications only.
>>>> 
>>>> 
>>>> Guys please help me.
>>>> 
>>>> 
>>>> 
>>>> On Wed, Oct 21, 2015 at 6:43 PM, John Omernik <[email protected]
>> <mailto:[email protected]>> wrote:
>>>> 
>>>>> AWESOME!
>>>>> 
>>>>> I had just been in the process of writing up a long user story to
>> ask for
>>>>> and support exactly this.   I modified it and included it here:
>>>>> 
>>>>> 
>>>>> To start out, I want to say how much I love the Drill project, and
>> the
>>>>> potential it has. I've put this together based on my experiences and
>> want
>>>>> to contribute a perspective as a user, not just put a bunch of
>> critiques
>>>> in
>>>>> an email.  I hope it's all taken in that spirit.  Additional note, I
>>>> wrote
>>>>> this prior to seeing the Design Document share by Hsuan Yi Chu
>> yesterday.
>>>>> If you are reading it, and think to yourself “that wording is odd…”
>>>> please
>>>>> consider it from the “I didn’t want to throw away the user story”
>>>>> perspective and the “I wrote it before the design doc” perspective.
>>>>> 
>>>>> 
>>>>> 
>>>>> Additionally, I understand that some of what I am suggesting may not
>> be
>>>>> easy from a development perspective.  I am just being upfront with my
>>>>> experience, so we can look to determine what can be done; I am not
>>>> looking
>>>>> for a silver bullet here, just looking for improvement.  Some may be
>> as
>>>>> simple as better documentation, other suggestions may be harder to
>>>>> implement.  Either way, I thought a verbose user story might be
>> useful to
>>>>> the community as a whole.
>>>>> 
>>>>> 
>>>>> 
>>>>> John
>>>>> 
>>>>> 
>>>>> 
>>>>> *User Story*
>>>>> 
>>>>> 
>>>>> 
>>>>> As I have been working with Drill for data exploration, I came across
>>>>> multiple "things" that just were hard.  In dealing with some data,
>>>>> especially JSON data, it can be ugly, and scaled ugly is even worse!
>>>>> 
>>>>> 
>>>>> 
>>>>> For this story, I am working with a JSON dump from MongoDB, and you
>> would
>>>>> think it would be well structured, and for the most part it is.
>> There
>>>> are
>>>>> some application level mistakes that were made (I will go into that
>> in a
>>>>> moment), but in general Drill handles this well.  So with this data
>> set,
>>>>> there are a few main challenges I am seeing:
>>>>> 
>>>>> 
>>>>> 
>>>>> 1.     When there is a field that has a float, and then a later
>> record
>>>> has
>>>>> the number 0 in it (which Drill takes as a INT). This is a known
>> problem
>>>>> and one that Drill has a solution for.
>>>>> 
>>>>> 2.     When there is a field is of one type (a map) and then a later
>>>> record
>>>>> has a string in it.  No easy solution here.
>>>>> 
>>>>> 3.     Select * where there is a json field with a . in the name. I
>> won’t
>>>>> go into details here, but I feel this factors into data exploration,
>>>>> because it changes the ability to “stay in Drill” to explore their
>> data (
>>>>> https://issues.apache.org/jira/browse/DRILL-3922 <
>> https://issues.apache.org/jira/browse/DRILL-3922>)
>>>>> 
>>>>> 4.     Error reporting challenges
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> With the problem summary laid out, I wanted to walk through my
>> process in
>>>>> working with this data, and where, if I were a user Drill could have
>> been
>>>>> much more helpful to the process.
>>>>> 
>>>>> 
>>>>> 
>>>>> Here is a description of the process I went through:
>>>>> 
>>>>> 
>>>>> 
>>>>> 1.     Copy data into filesystem
>>>>> 
>>>>> 2.     Use drill to “Select * from `path_to/dump.json` limit 1
>>>>> 
>>>>> 3.     (I just want to see what it looks like!)
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Here I get this error:
>>>>> 
>>>>> 
>>>>> 
>>>>>> select * from `path_to/ dump.json` limit 1;
>>>>> 
>>>>> Error: DATA_READ ERROR: You tried to write a BigInt type when you are
>>>> using
>>>>> a ValueWriter of type NullableFloat8WriterImpl.
>>>>> 
>>>>> 
>>>>> 
>>>>> File  /data/dev/path_to/dump.json
>>>>> 
>>>>> Record  1
>>>>> 
>>>>> Line  1
>>>>> 
>>>>> Column  9054
>>>>> 
>>>>> Field  entropy
>>>>> 
>>>>> Fragment 0:0
>>>>> 
>>>>> 
>>>>> 
>>>>> This isn’t incredibly helpful from a user perspective.  I.e. When I
>>>> Google
>>>>> around, I realize now that in the docs it talks about “Schema
>> Changes”
>>>> and
>>>>> one possible item is use the setting below. However, examples of the
>> data
>>>>> that was trying to be displayed (with it’s implied type) may help
>> users
>>>>> grok what is happening.  At least in this case it showed me the field
>>>> name!
>>>>> 
>>>>> 
>>>>> 
>>>>> ALTER SYSTEM SET `store.json.read_numbers_as_double` = true;
>>>>> 
>>>>> 
>>>>> 
>>>>> This is a great example where since we have known use case (when
>> numbers
>>>>> are doubles but someone tries to store 0 an INT) it fails, thus dev’s
>>>> have
>>>>> added a setting to allow a user to get through that, that the error
>>>> message
>>>>> could be more helpful.   In this case, Showing two record numbers
>> (line
>>>>> numbers) with different types, the field values with their implied
>> types,
>>>>> and perhaps a suggestion about using the setting to address the
>> problem.
>>>>> This could make it more intuitive for the user to stay in Drill, and
>> stay
>>>>> in the data.   In this case, I looked at a head of the file, and saw
>> the
>>>>> issue and was able to proceed.
>>>>> 
>>>>> 
>>>>> 
>>>>> Also, as a corollary here, the user documentation does not show this
>>>> error
>>>>> related to the schema change problem. This would be a great place to
>>>> state,
>>>>> “if you see an error that looks like X, this is what is happening and
>>>> what
>>>>> you can do for it.”
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> *Side node on documentation*
>>>>> 
>>>>> We should look to have documentation try to be role based.   In this
>>>> case,
>>>>> the documentation says use “ALTER SYSTEM” I would argue, and I am
>>>> guessing
>>>>> others would concur, that for this use case, “ALTER SESSION” may be a
>>>>> better suggestion as this is specific alteration to address the use
>> case
>>>> of
>>>>> loading/querying a specific data set, and is likely done by a user
>> of the
>>>>> system.
>>>>> 
>>>>> 
>>>>> 
>>>>> If a user is doing self-serve data, then in an enterprise
>> environment,
>>>> they
>>>>> may not have the ability to use ALTER SYSTEM and get an error, thus
>> may
>>>> be
>>>>> confused on how to proceed.   In addition ALTER SYSTEM by a user who
>>>>> doesn’t understand that they are changing, yet have the rights to
>> change,
>>>>> may introduce future data problems they didn’t expect.   I like that
>> the
>>>>> default is a more constrictive method, because it makes people be
>>>> explicit
>>>>> about data, yet the documentation should also aim to be explicit
>> about
>>>>> something like a system wide change.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> *Back to the story*
>>>>> 
>>>>> Ok so now I will do ALTER SESSION SET on the read_numbers_as_double
>>>> setting
>>>>> 
>>>>> 
>>>>> 
>>>>> I run the query again.
>>>>> 
>>>>> 
>>>>> 
>>>>>> select * from `path_to/dump.json` limit 1;
>>>>> 
>>>>> Error: DATA_READ ERROR: Error parsing JSON - You tried to write a
>> VarChar
>>>>> type when you are using a ValueWriter of type SingleMapWriter.
>>>>> 
>>>>> 
>>>>> 
>>>>> File  /data/dev/path_to/dump.json
>>>>> 
>>>>> Record  4009
>>>>> 
>>>>> Fragment 0:0
>>>>> 
>>>>> 
>>>>> 
>>>>> Another error   But what does this one mean? Ok, now that I have been
>>>>> living in the docs and in the Drill user list, and because it’s
>> similar
>>>> to
>>>>> the schema change issue, that that is what we are looking at here.
>>>> Instead
>>>>> of double to int, we have one field that is map most of the time,
>> and in
>>>>> some cases it’s a string.
>>>>> 
>>>>> 
>>>>> 
>>>>> But this doesn’t really help me as a user.  To troubleshoot this
>> Drill
>>>>> doesn’t offer any options. This file is 500 MB of dense and nested
>> JSON
>>>>> data with 51k records.   My solution? I took the record number, then
>> I
>>>> went
>>>>> to my NFS mounted clustered file system (thank goodness I had MapR
>> here,
>>>> I
>>>>> am not sure how I would have done this with Posix tools)
>>>>> 
>>>>> 
>>>>> 
>>>>> My command: $ head -4009 dump.json|tail -1
>>>>> 
>>>>> 
>>>>> 
>>>>> That (I hoped) showed me the record in question, note the error from
>>>> Drill
>>>>> didn’t tell me which field was at fault here, so I had to visually
>> align
>>>>> things to address that.  However, I was able to spot the difference
>> and
>>>>> work with the dev to understand why that happened. I removed those
>>>> records,
>>>>> and things worked correctly.
>>>>> 
>>>>> 
>>>>> 
>>>>> Could there have been a way to identify that within drill? My
>> solution
>>>> was
>>>>> to take a python script and read through, and discard those records
>> that
>>>>> were not a map, however, on 500MB that can work, but what about 500
>> GB?
>>>> I
>>>>> guess a Spark job could clean the data…. But could Drill be given
>> some
>>>>> tools to help with this situation?
>>>>> 
>>>>> 
>>>>> 
>>>>> For example, the first thing I said was: What field is at issue?  I
>> had
>>>> no
>>>>> way to see what was up there.  I had to use other tools to see the
>> data
>>>> so
>>>>> I could understand the problem. Then when I understood the problem,
>> I had
>>>>> to use Python to produce data that was queryable.
>>>>> 
>>>>> 
>>>>> 
>>>>> Based on the design document Hsuan Yi Chu just posted to the mailing
>>>> list,
>>>>> at this point my post is just a user story to support the design
>>>> document.
>>>>> To summarize the points I’d like to see included in the design
>> document
>>>>> (from a user perspective), not understanding “how or why”:
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> *1.     **Error messages that are more verbose in explaining the
>> problem*
>>>>> 
>>>>> a.     Filename, row number, column number or name
>>>>> 
>>>>> b.     Option to output the “offending row”
>>>>> 
>>>>> c.     Showing the data that is causing the error WITH the type Drill
>>>>> inferred.
>>>>> 
>>>>> d.     If there are options to help work through dirty data, perhaps
>> the
>>>>> error message could include those: “Data was an double, then drill
>> found
>>>>> this data: 0 that was a int in File x, at row 24 in column
>>>> “myfloatingdata”
>>>>> consider using store.json.read_numbers_as_double to address the
>> issue.
>>>>> 
>>>>> 2.     *A way to determine how common this exception is*
>>>>> 
>>>>> a.     If I am playing with a messy data set, and this error happens,
>>>> does
>>>>> it happen on 1 record? 2? 5000?  Knowing that information would:
>>>>> 
>>>>>                                               i.     Help users
>>>> understand
>>>>> how Drill is seeing that particular column
>>>>> 
>>>>>                                             ii.     Make decisions
>> on
>>>>> excluding data rather than just removing it. What if the first 10
>> records
>>>>> were errors, and then you excluded the remaining 10 million because
>> they
>>>>> were correct yet different from the first 10?
>>>>> 
>>>>> b.     Perhaps there could be a “stats” function that only works if
>> it’s
>>>>> the only selected item or if the select is all those functions (stats
>>>>> functions)?
>>>>> 
>>>>>                                               i.     Select
>>>>> type_stats(fieldsname) from data
>>>>> 
>>>>>                                             ii.      (that wouldn’t
>>>> error
>>>>> on different types)
>>>>> 
>>>>> 3.     *An ability to set a “return null on this field if error or
>> if non
>>>>> castable to X type, especially in a view, perhaps in a function.*
>>>>> 
>>>>> a.     Allow them to not have to reparse data outside drill
>>>>> 
>>>>> b.     Load it into a sane format (one time loads/ETL to clean data)
>>>>> 
>>>>> c.     Not be system or session wide exception.
>>>>> 
>>>>>                                               i.     I think this is
>>>>> important because I may have a field where I want it to read the
>> numbers
>>>> as
>>>>> double, but what if I have another field in the same dataset where I
>>>> don’t
>>>>> want it to read the numbers as double? A SYSTEM or SESSION level
>> variable
>>>>> takes away that granularity
>>>>> 
>>>>> d.     Select field1, CASTORNULL(field2, int) as field2,
>>>> CASTORNULL(field3,
>>>>> double) as field3 from ugly_data.
>>>>> 
>>>>> e.     That’s an example when it’s in the select, but I Could see a
>> where
>>>>> clause
>>>>> 
>>>>> f.      Select field1, field2, field3 from ugly data where
>> ISTYPE(field2,
>>>>> int) and ISTYPE(field3, double)
>>>>> 
>>>>> 4.     *Updating of the documentation related to ALTER SESSION vs
>> ALTER
>>>>> SYSTEM with an eye to the context of the majority use case of the
>>>>> documented feature*
>>>>> 
>>>>> a.     For data loads, the documentation uses ALTER SYSTEM and that’s
>>>>> problematic because:
>>>>> 
>>>>>                                               i.     Not all users
>> have
>>>>> the privileges to issue an ALTER SYSTEM. Thus a new user trying to
>> figure
>>>>> things out may not realize they can just ALTER SESSION after getting
>> an
>>>>> ALTER SYSTEM error.
>>>>> 
>>>>>                                             ii.     ALTER SYSTEM on
>> data
>>>>> loading items, especially in areas that make Drill’s data
>> interpretation
>>>>> more permissive can lead to unintended consequences later. An admin,
>> who
>>>>> may be a good systems admin, and helps a data user troubleshoot and
>> error
>>>>> may issue an ALTER SYSTEM not realizing this changes all future data
>>>>> imports.
>>>>> 
>>>>> b.     Note, I found a few cases, but I would suggest a thorough
>> review
>>>> of
>>>>> the various use cases throughout the documentation, and in areas
>> where it
>>>>> really could be either, have a small paragraph indicating the
>>>> ramifications
>>>>> of either command.
>>>>> 
>>>>> *5.     **A Philosophy within the Drill Community to “Stay in Drill”
>> for
>>>>> data exploration*
>>>>> 
>>>>> a.     This is obviously not as much of a development thing as a
>> mindset.
>>>>> If someone says “I tried to do X, and I got and error” and the
>>>> communities
>>>>> response is Y where Y is “Look through your data and do Z to it so
>> Drill
>>>>> can read it” then we should reconsider that scenario and try to
>> provide
>>>> and
>>>>> option within Drill to intuitively handle the edge case.  This is
>>>>> difficult.
>>>>> 
>>>>> b.     There are cases even in the documentation where this is the
>> case:
>>>>> https://drill.apache.org/docs/json-data-model/ <
>> https://drill.apache.org/docs/json-data-model/> talking about arrays at
>>>> the
>>>>> root level or reading some empty arrays.  In these cases, we have to
>>>> leave
>>>>> drill to fix the problem. This works on small data, but may not work
>> on
>>>>> large or wide data. Consider the  array at root level limitation.
>> What
>>>> if
>>>>> some process out of the users control produces 1000 100mb json files
>> and
>>>> we
>>>>> want to read that. To fix it, we have to address those files. Lots of
>>>> work
>>>>> there, either manual or automated.
>>>>> 
>>>>> c.     Once again I know this isn’t easy, but we shouldn’t answer
>>>> questions
>>>>> about how to do something by saying “fix this outside of Drill so
>> Drill
>>>> can
>>>>> read your data” if at all possible.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> I hope this story helps support the design document presented.  I am
>>>> happy
>>>>> to participate in more discussion around these topics as I have
>> enjoying
>>>>> digging into the internals of Drill
>>>>> 
>>>>> 
>>>>> 
>>>>> John Omernik
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> *Name: Ganesh Semalty*
>>>> *Location: Gurgaon,Haryana(India)*
>>>> *Email Id: [email protected] <mailto:[email protected]>
>> <[email protected] <mailto:[email protected]>>*
>>>> 
>>>> 
>>>> P
>>>> 
>>>> *Please consider the environment before printing this e-mail - SAVE
>> TREE.*
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Name: Ganesh Semalty
>>> Location: Gurgaon,Haryana(India)
>>> Email Id: [email protected] <mailto:[email protected]>
>>> 
>>> P
>>> Please consider the environment before printing this e-mail - SAVE TREE.
>> 
>> 
> 
> 
> -- 
> *Name: Ganesh Semalty*
> *Location: Gurgaon,Haryana(India)*
> *Email Id: [email protected] <[email protected]>*
> 
> 
> P
> 
> *Please consider the environment before printing this e-mail - SAVE TREE.*

Re: New Topic: Drill Visualization and Use Support

Reply via email to