Re: New Topic: Drill Visualization and Use Support

ganesh Thu, 22 Oct 2015 07:39:35 -0700

Hello,

Are there any other links as tutorial for TABLEAU v/s HIVE


I have already gone through the one in apache-drill site. I am not able to
proceed with those.

On Thu, Oct 22, 2015 at 8:03 PM, Andries Engelbrecht <
[email protected]> wrote:

> Hive should be visible and usable in Tableau. You can use Drill Views for
> dfs data, or you can Tableau Custom SQL to access the data.
>
> Make sure to install the Tableau TDC file that comes with the ODBC driver.
> https://drill.apache.org/docs/installing-the-tdc-file-on-windows/ <
> https://drill.apache.org/docs/installing-the-tdc-file-on-windows/>
> https://drill.apache.org/docs/using-apache-drill-with-tableau-9-desktop/ <
> https://drill.apache.org/docs/using-apache-drill-with-tableau-9-desktop/>
>
>
> —Andries
>
>
> > On Oct 22, 2015, at 6:40 AM, ganesh <[email protected]> wrote:
> >
> > Hi John,
> >
> > Thanks for suggesting new name:  Apache Zeppelin
> >
> > I was currently trying 14 Days trial version of TABLEAU with not much
> success.
> > Today only I knew that for files in Hadoop or local file system, I would
> need to create view.
> >
> > Still, though I can see my Tables from HIVE in Tableau, I cannot see any
> data.
> > Nor I am able to use TABLEAU from the links help given in apache-drill
> currently (http://drill.apache.org/docs/tableau-examples/ <
> http://drill.apache.org/docs/tableau-examples/>)
> >
> > Snapshot attached, Incase you have worked over TABLEAU
> >
> > I will look into Zeppelin also.
> >
> >
> > On Thu, Oct 22, 2015 at 6:43 PM, John Omernik <[email protected] <mailto:
> [email protected]>> wrote:
> > I separated my response from the original topic to keep any responses
> there
> > focused on the design document.
> >
> > As to ways to use Drill, I have been working with SQL Squirrel quite
> > successfully.  On my list of things to do, if you want to stay in the
> > Apache world is looking at Apache Zeppelin.  In the Git Repo, there is a
> > Drill Plugin so you can run SQL again Drill, look at results, and do
> basic
> > visualizations, I have been trying to way until the PR is merged on
> > Zeppelin, but for your use case, you may want to grab the plugin code and
> > run with it.
> >
> >
> >
> >
> > On Thu, Oct 22, 2015 at 6:06 AM, ganesh <[email protected]
> <mailto:[email protected]>> wrote:
> >
> > > Hello,
> > >
> > > John, you seem to be quite impressed with apache drill .. nice.
> > > I am new to un-structured world and just started 1 week back on APACHE
> > > DRILL after suggestion from my collegues. We have a semi structured
> data
> > > where we have constraint that we do not know number of columns
> > >
> > > I heard that APACHE DRILL is column free applicationa nd with support
> of
> > > JSON format, it allows to create columns on-fly,
> > > I converted my data from CSV-like-format to JSON and trying to figure
> out
> > > if it will work for me.
> > >
> > > Here I hit two issues :-
> > > 1) My column were like : 3100.2.1.2 <tel:3100.2.1.2>  and values like
> '-2303" or
> > > '01/01/2015
> > > 02:02:00"
> > >
> > > Challenge was that column cant be started with Numeric value. So I had
> to
> > > change key as: "t3100.2.1.2"
> > > After that things were quite OK,
> > >
> > > Now I need some help from you guys. To proceed I have to present my
> work to
> > > management as an example.
> > > But querying on apache drill console, doesnt seem to be an attractive
> way
> > > to present things.
> > >
> > > I tried drill explorer too.But didnt find that so good.
> > > One thing to note, I am playing with files on Hadoop standalone mode in
> > > UBUNTU.
> > >
> > > To make it appear more good looking, I started with QLIK SENSE .. but
> was
> > > unable to connect it with hadoop file system. It only showed me HIVE
> FILES.
> > > Then I downloaded TABLEAU Trial version ... but I am unable to get
> Hadoop
> > > data here too...
> > >
> > > Please help me how to proceed. I have presentation on coming Monday.
> > > Queries are quite ready .. I just need to show in visualization form
> > > ........ using OPEN SOURCE applications only.
> > >
> > >
> > > Guys please help me.
> > >
> > >
> > >
> > > On Wed, Oct 21, 2015 at 6:43 PM, John Omernik <[email protected]
> <mailto:[email protected]>> wrote:
> > >
> > > > AWESOME!
> > > >
> > > > I had just been in the process of writing up a long user story to
> ask for
> > > > and support exactly this.   I modified it and included it here:
> > > >
> > > >
> > > > To start out, I want to say how much I love the Drill project, and
> the
> > > > potential it has. I've put this together based on my experiences and
> want
> > > > to contribute a perspective as a user, not just put a bunch of
> critiques
> > > in
> > > > an email.  I hope it's all taken in that spirit.  Additional note, I
> > > wrote
> > > > this prior to seeing the Design Document share by Hsuan Yi Chu
> yesterday.
> > > > If you are reading it, and think to yourself “that wording is odd…”
> > > please
> > > > consider it from the “I didn’t want to throw away the user story”
> > > > perspective and the “I wrote it before the design doc” perspective.
> > > >
> > > >
> > > >
> > > > Additionally, I understand that some of what I am suggesting may not
> be
> > > > easy from a development perspective.  I am just being upfront with my
> > > > experience, so we can look to determine what can be done; I am not
> > > looking
> > > > for a silver bullet here, just looking for improvement.  Some may be
> as
> > > > simple as better documentation, other suggestions may be harder to
> > > > implement.  Either way, I thought a verbose user story might be
> useful to
> > > > the community as a whole.
> > > >
> > > >
> > > >
> > > > John
> > > >
> > > >
> > > >
> > > > *User Story*
> > > >
> > > >
> > > >
> > > > As I have been working with Drill for data exploration, I came across
> > > > multiple "things" that just were hard.  In dealing with some data,
> > > > especially JSON data, it can be ugly, and scaled ugly is even worse!
> > > >
> > > >
> > > >
> > > > For this story, I am working with a JSON dump from MongoDB, and you
> would
> > > > think it would be well structured, and for the most part it is.
> There
> > > are
> > > > some application level mistakes that were made (I will go into that
> in a
> > > > moment), but in general Drill handles this well.  So with this data
> set,
> > > > there are a few main challenges I am seeing:
> > > >
> > > >
> > > >
> > > > 1.     When there is a field that has a float, and then a later
> record
> > > has
> > > > the number 0 in it (which Drill takes as a INT). This is a known
> problem
> > > > and one that Drill has a solution for.
> > > >
> > > > 2.     When there is a field is of one type (a map) and then a later
> > > record
> > > > has a string in it.  No easy solution here.
> > > >
> > > > 3.     Select * where there is a json field with a . in the name. I
> won’t
> > > > go into details here, but I feel this factors into data exploration,
> > > > because it changes the ability to “stay in Drill” to explore their
> data (
> > > > https://issues.apache.org/jira/browse/DRILL-3922 <
> https://issues.apache.org/jira/browse/DRILL-3922>)
> > > >
> > > > 4.     Error reporting challenges
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > With the problem summary laid out, I wanted to walk through my
> process in
> > > > working with this data, and where, if I were a user Drill could have
> been
> > > > much more helpful to the process.
> > > >
> > > >
> > > >
> > > > Here is a description of the process I went through:
> > > >
> > > >
> > > >
> > > > 1.     Copy data into filesystem
> > > >
> > > > 2.     Use drill to “Select * from `path_to/dump.json` limit 1
> > > >
> > > > 3.     (I just want to see what it looks like!)
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Here I get this error:
> > > >
> > > >
> > > >
> > > > > select * from `path_to/ dump.json` limit 1;
> > > >
> > > > Error: DATA_READ ERROR: You tried to write a BigInt type when you are
> > > using
> > > > a ValueWriter of type NullableFloat8WriterImpl.
> > > >
> > > >
> > > >
> > > > File  /data/dev/path_to/dump.json
> > > >
> > > > Record  1
> > > >
> > > > Line  1
> > > >
> > > > Column  9054
> > > >
> > > > Field  entropy
> > > >
> > > > Fragment 0:0
> > > >
> > > >
> > > >
> > > > This isn’t incredibly helpful from a user perspective.  I.e. When I
> > > Google
> > > > around, I realize now that in the docs it talks about “Schema
> Changes”
> > > and
> > > > one possible item is use the setting below. However, examples of the
> data
> > > > that was trying to be displayed (with it’s implied type) may help
> users
> > > > grok what is happening.  At least in this case it showed me the field
> > > name!
> > > >
> > > >
> > > >
> > > > ALTER SYSTEM SET `store.json.read_numbers_as_double` = true;
> > > >
> > > >
> > > >
> > > > This is a great example where since we have known use case (when
> numbers
> > > > are doubles but someone tries to store 0 an INT) it fails, thus dev’s
> > > have
> > > > added a setting to allow a user to get through that, that the error
> > > message
> > > > could be more helpful.   In this case, Showing two record numbers
> (line
> > > > numbers) with different types, the field values with their implied
> types,
> > > > and perhaps a suggestion about using the setting to address the
> problem.
> > > > This could make it more intuitive for the user to stay in Drill, and
> stay
> > > > in the data.   In this case, I looked at a head of the file, and saw
> the
> > > > issue and was able to proceed.
> > > >
> > > >
> > > >
> > > > Also, as a corollary here, the user documentation does not show this
> > > error
> > > > related to the schema change problem. This would be a great place to
> > > state,
> > > > “if you see an error that looks like X, this is what is happening and
> > > what
> > > > you can do for it.”
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > *Side node on documentation*
> > > >
> > > > We should look to have documentation try to be role based.   In this
> > > case,
> > > > the documentation says use “ALTER SYSTEM” I would argue, and I am
> > > guessing
> > > > others would concur, that for this use case, “ALTER SESSION” may be a
> > > > better suggestion as this is specific alteration to address the use
> case
> > > of
> > > > loading/querying a specific data set, and is likely done by a user
> of the
> > > > system.
> > > >
> > > >
> > > >
> > > > If a user is doing self-serve data, then in an enterprise
> environment,
> > > they
> > > > may not have the ability to use ALTER SYSTEM and get an error, thus
> may
> > > be
> > > > confused on how to proceed.   In addition ALTER SYSTEM by a user who
> > > > doesn’t understand that they are changing, yet have the rights to
> change,
> > > > may introduce future data problems they didn’t expect.   I like that
> the
> > > > default is a more constrictive method, because it makes people be
> > > explicit
> > > > about data, yet the documentation should also aim to be explicit
> about
> > > > something like a system wide change.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > *Back to the story*
> > > >
> > > > Ok so now I will do ALTER SESSION SET on the read_numbers_as_double
> > > setting
> > > >
> > > >
> > > >
> > > > I run the query again.
> > > >
> > > >
> > > >
> > > > > select * from `path_to/dump.json` limit 1;
> > > >
> > > > Error: DATA_READ ERROR: Error parsing JSON - You tried to write a
> VarChar
> > > > type when you are using a ValueWriter of type SingleMapWriter.
> > > >
> > > >
> > > >
> > > > File  /data/dev/path_to/dump.json
> > > >
> > > > Record  4009
> > > >
> > > > Fragment 0:0
> > > >
> > > >
> > > >
> > > > Another error   But what does this one mean? Ok, now that I have been
> > > > living in the docs and in the Drill user list, and because it’s
> similar
> > > to
> > > > the schema change issue, that that is what we are looking at here.
> > > Instead
> > > > of double to int, we have one field that is map most of the time,
> and in
> > > > some cases it’s a string.
> > > >
> > > >
> > > >
> > > > But this doesn’t really help me as a user.  To troubleshoot this
> Drill
> > > > doesn’t offer any options. This file is 500 MB of dense and nested
> JSON
> > > > data with 51k records.   My solution? I took the record number, then
> I
> > > went
> > > > to my NFS mounted clustered file system (thank goodness I had MapR
> here,
> > > I
> > > > am not sure how I would have done this with Posix tools)
> > > >
> > > >
> > > >
> > > > My command: $ head -4009 dump.json|tail -1
> > > >
> > > >
> > > >
> > > > That (I hoped) showed me the record in question, note the error from
> > > Drill
> > > > didn’t tell me which field was at fault here, so I had to visually
> align
> > > > things to address that.  However, I was able to spot the difference
> and
> > > > work with the dev to understand why that happened. I removed those
> > > records,
> > > > and things worked correctly.
> > > >
> > > >
> > > >
> > > > Could there have been a way to identify that within drill? My
> solution
> > > was
> > > > to take a python script and read through, and discard those records
> that
> > > > were not a map, however, on 500MB that can work, but what about 500
> GB?
> > > I
> > > > guess a Spark job could clean the data…. But could Drill be given
> some
> > > > tools to help with this situation?
> > > >
> > > >
> > > >
> > > > For example, the first thing I said was: What field is at issue?  I
> had
> > > no
> > > > way to see what was up there.  I had to use other tools to see the
> data
> > > so
> > > > I could understand the problem. Then when I understood the problem,
> I had
> > > > to use Python to produce data that was queryable.
> > > >
> > > >
> > > >
> > > > Based on the design document Hsuan Yi Chu just posted to the mailing
> > > list,
> > > > at this point my post is just a user story to support the design
> > > document.
> > > > To summarize the points I’d like to see included in the design
> document
> > > > (from a user perspective), not understanding “how or why”:
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > *1.     **Error messages that are more verbose in explaining the
> problem*
> > > >
> > > > a.     Filename, row number, column number or name
> > > >
> > > > b.     Option to output the “offending row”
> > > >
> > > > c.     Showing the data that is causing the error WITH the type Drill
> > > > inferred.
> > > >
> > > > d.     If there are options to help work through dirty data, perhaps
> the
> > > > error message could include those: “Data was an double, then drill
> found
> > > > this data: 0 that was a int in File x, at row 24 in column
> > > “myfloatingdata”
> > > > consider using store.json.read_numbers_as_double to address the
> issue.
> > > >
> > > > 2.     *A way to determine how common this exception is*
> > > >
> > > > a.     If I am playing with a messy data set, and this error happens,
> > > does
> > > > it happen on 1 record? 2? 5000?  Knowing that information would:
> > > >
> > > >                                                i.     Help users
> > > understand
> > > > how Drill is seeing that particular column
> > > >
> > > >                                              ii.     Make decisions
> on
> > > > excluding data rather than just removing it. What if the first 10
> records
> > > > were errors, and then you excluded the remaining 10 million because
> they
> > > > were correct yet different from the first 10?
> > > >
> > > > b.     Perhaps there could be a “stats” function that only works if
> it’s
> > > > the only selected item or if the select is all those functions (stats
> > > > functions)?
> > > >
> > > >                                                i.     Select
> > > > type_stats(fieldsname) from data
> > > >
> > > >                                              ii.      (that wouldn’t
> > > error
> > > > on different types)
> > > >
> > > > 3.     *An ability to set a “return null on this field if error or
> if non
> > > > castable to X type, especially in a view, perhaps in a function.*
> > > >
> > > > a.     Allow them to not have to reparse data outside drill
> > > >
> > > > b.     Load it into a sane format (one time loads/ETL to clean data)
> > > >
> > > > c.     Not be system or session wide exception.
> > > >
> > > >                                                i.     I think this is
> > > > important because I may have a field where I want it to read the
> numbers
> > > as
> > > > double, but what if I have another field in the same dataset where I
> > > don’t
> > > > want it to read the numbers as double? A SYSTEM or SESSION level
> variable
> > > > takes away that granularity
> > > >
> > > > d.     Select field1, CASTORNULL(field2, int) as field2,
> > > CASTORNULL(field3,
> > > > double) as field3 from ugly_data.
> > > >
> > > > e.     That’s an example when it’s in the select, but I Could see a
> where
> > > > clause
> > > >
> > > > f.      Select field1, field2, field3 from ugly data where
> ISTYPE(field2,
> > > > int) and ISTYPE(field3, double)
> > > >
> > > > 4.     *Updating of the documentation related to ALTER SESSION vs
> ALTER
> > > > SYSTEM with an eye to the context of the majority use case of the
> > > > documented feature*
> > > >
> > > > a.     For data loads, the documentation uses ALTER SYSTEM and that’s
> > > > problematic because:
> > > >
> > > >                                                i.     Not all users
> have
> > > > the privileges to issue an ALTER SYSTEM. Thus a new user trying to
> figure
> > > > things out may not realize they can just ALTER SESSION after getting
> an
> > > > ALTER SYSTEM error.
> > > >
> > > >                                              ii.     ALTER SYSTEM on
> data
> > > > loading items, especially in areas that make Drill’s data
> interpretation
> > > > more permissive can lead to unintended consequences later. An admin,
> who
> > > > may be a good systems admin, and helps a data user troubleshoot and
> error
> > > > may issue an ALTER SYSTEM not realizing this changes all future data
> > > > imports.
> > > >
> > > > b.     Note, I found a few cases, but I would suggest a thorough
> review
> > > of
> > > > the various use cases throughout the documentation, and in areas
> where it
> > > > really could be either, have a small paragraph indicating the
> > > ramifications
> > > > of either command.
> > > >
> > > > *5.     **A Philosophy within the Drill Community to “Stay in Drill”
> for
> > > > data exploration*
> > > >
> > > > a.     This is obviously not as much of a development thing as a
> mindset.
> > > > If someone says “I tried to do X, and I got and error” and the
> > > communities
> > > > response is Y where Y is “Look through your data and do Z to it so
> Drill
> > > > can read it” then we should reconsider that scenario and try to
> provide
> > > and
> > > > option within Drill to intuitively handle the edge case.  This is
> > > > difficult.
> > > >
> > > > b.     There are cases even in the documentation where this is the
> case:
> > > > https://drill.apache.org/docs/json-data-model/ <
> https://drill.apache.org/docs/json-data-model/> talking about arrays at
> > > the
> > > > root level or reading some empty arrays.  In these cases, we have to
> > > leave
> > > > drill to fix the problem. This works on small data, but may not work
> on
> > > > large or wide data. Consider the  array at root level limitation.
> What
> > > if
> > > > some process out of the users control produces 1000 100mb json files
> and
> > > we
> > > > want to read that. To fix it, we have to address those files. Lots of
> > > work
> > > > there, either manual or automated.
> > > >
> > > > c.     Once again I know this isn’t easy, but we shouldn’t answer
> > > questions
> > > > about how to do something by saying “fix this outside of Drill so
> Drill
> > > can
> > > > read your data” if at all possible.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > I hope this story helps support the design document presented.  I am
> > > happy
> > > > to participate in more discussion around these topics as I have
> enjoying
> > > > digging into the internals of Drill
> > > >
> > > >
> > > >
> > > > John Omernik
> > > >
> > >
> > >
> > >
> > > --
> > > *Name: Ganesh Semalty*
> > > *Location: Gurgaon,Haryana(India)*
> > > *Email Id: [email protected] <mailto:[email protected]>
> <[email protected] <mailto:[email protected]>>*
> > >
> > >
> > > P
> > >
> > > *Please consider the environment before printing this e-mail - SAVE
> TREE.*
> > >
> >
> >
> >
> > --
> > Name: Ganesh Semalty
> > Location: Gurgaon,Haryana(India)
> > Email Id: [email protected] <mailto:[email protected]>
> >
> > P
> > Please consider the environment before printing this e-mail - SAVE TREE.
>
>


-- 
*Name: Ganesh Semalty*
*Location: Gurgaon,Haryana(India)*
*Email Id: [email protected] <[email protected]>*


P

*Please consider the environment before printing this e-mail - SAVE TREE.*

Re: New Topic: Drill Visualization and Use Support

Reply via email to