Hello, Are there any other links as tutorial for TABLEAU v/s HIVE
I have already gone through the one in apache-drill site. I am not able to proceed with those. On Thu, Oct 22, 2015 at 8:03 PM, Andries Engelbrecht < [email protected]> wrote: > Hive should be visible and usable in Tableau. You can use Drill Views for > dfs data, or you can Tableau Custom SQL to access the data. > > Make sure to install the Tableau TDC file that comes with the ODBC driver. > https://drill.apache.org/docs/installing-the-tdc-file-on-windows/ < > https://drill.apache.org/docs/installing-the-tdc-file-on-windows/> > https://drill.apache.org/docs/using-apache-drill-with-tableau-9-desktop/ < > https://drill.apache.org/docs/using-apache-drill-with-tableau-9-desktop/> > > > —Andries > > > > On Oct 22, 2015, at 6:40 AM, ganesh <[email protected]> wrote: > > > > Hi John, > > > > Thanks for suggesting new name: Apache Zeppelin > > > > I was currently trying 14 Days trial version of TABLEAU with not much > success. > > Today only I knew that for files in Hadoop or local file system, I would > need to create view. > > > > Still, though I can see my Tables from HIVE in Tableau, I cannot see any > data. > > Nor I am able to use TABLEAU from the links help given in apache-drill > currently (http://drill.apache.org/docs/tableau-examples/ < > http://drill.apache.org/docs/tableau-examples/>) > > > > Snapshot attached, Incase you have worked over TABLEAU > > > > I will look into Zeppelin also. > > > > > > On Thu, Oct 22, 2015 at 6:43 PM, John Omernik <[email protected] <mailto: > [email protected]>> wrote: > > I separated my response from the original topic to keep any responses > there > > focused on the design document. > > > > As to ways to use Drill, I have been working with SQL Squirrel quite > > successfully. On my list of things to do, if you want to stay in the > > Apache world is looking at Apache Zeppelin. In the Git Repo, there is a > > Drill Plugin so you can run SQL again Drill, look at results, and do > basic > > visualizations, I have been trying to way until the PR is merged on > > Zeppelin, but for your use case, you may want to grab the plugin code and > > run with it. > > > > > > > > > > On Thu, Oct 22, 2015 at 6:06 AM, ganesh <[email protected] > <mailto:[email protected]>> wrote: > > > > > Hello, > > > > > > John, you seem to be quite impressed with apache drill .. nice. > > > I am new to un-structured world and just started 1 week back on APACHE > > > DRILL after suggestion from my collegues. We have a semi structured > data > > > where we have constraint that we do not know number of columns > > > > > > I heard that APACHE DRILL is column free applicationa nd with support > of > > > JSON format, it allows to create columns on-fly, > > > I converted my data from CSV-like-format to JSON and trying to figure > out > > > if it will work for me. > > > > > > Here I hit two issues :- > > > 1) My column were like : 3100.2.1.2 <tel:3100.2.1.2> and values like > '-2303" or > > > '01/01/2015 > > > 02:02:00" > > > > > > Challenge was that column cant be started with Numeric value. So I had > to > > > change key as: "t3100.2.1.2" > > > After that things were quite OK, > > > > > > Now I need some help from you guys. To proceed I have to present my > work to > > > management as an example. > > > But querying on apache drill console, doesnt seem to be an attractive > way > > > to present things. > > > > > > I tried drill explorer too.But didnt find that so good. > > > One thing to note, I am playing with files on Hadoop standalone mode in > > > UBUNTU. > > > > > > To make it appear more good looking, I started with QLIK SENSE .. but > was > > > unable to connect it with hadoop file system. It only showed me HIVE > FILES. > > > Then I downloaded TABLEAU Trial version ... but I am unable to get > Hadoop > > > data here too... > > > > > > Please help me how to proceed. I have presentation on coming Monday. > > > Queries are quite ready .. I just need to show in visualization form > > > ........ using OPEN SOURCE applications only. > > > > > > > > > Guys please help me. > > > > > > > > > > > > On Wed, Oct 21, 2015 at 6:43 PM, John Omernik <[email protected] > <mailto:[email protected]>> wrote: > > > > > > > AWESOME! > > > > > > > > I had just been in the process of writing up a long user story to > ask for > > > > and support exactly this. I modified it and included it here: > > > > > > > > > > > > To start out, I want to say how much I love the Drill project, and > the > > > > potential it has. I've put this together based on my experiences and > want > > > > to contribute a perspective as a user, not just put a bunch of > critiques > > > in > > > > an email. I hope it's all taken in that spirit. Additional note, I > > > wrote > > > > this prior to seeing the Design Document share by Hsuan Yi Chu > yesterday. > > > > If you are reading it, and think to yourself “that wording is odd…” > > > please > > > > consider it from the “I didn’t want to throw away the user story” > > > > perspective and the “I wrote it before the design doc” perspective. > > > > > > > > > > > > > > > > Additionally, I understand that some of what I am suggesting may not > be > > > > easy from a development perspective. I am just being upfront with my > > > > experience, so we can look to determine what can be done; I am not > > > looking > > > > for a silver bullet here, just looking for improvement. Some may be > as > > > > simple as better documentation, other suggestions may be harder to > > > > implement. Either way, I thought a verbose user story might be > useful to > > > > the community as a whole. > > > > > > > > > > > > > > > > John > > > > > > > > > > > > > > > > *User Story* > > > > > > > > > > > > > > > > As I have been working with Drill for data exploration, I came across > > > > multiple "things" that just were hard. In dealing with some data, > > > > especially JSON data, it can be ugly, and scaled ugly is even worse! > > > > > > > > > > > > > > > > For this story, I am working with a JSON dump from MongoDB, and you > would > > > > think it would be well structured, and for the most part it is. > There > > > are > > > > some application level mistakes that were made (I will go into that > in a > > > > moment), but in general Drill handles this well. So with this data > set, > > > > there are a few main challenges I am seeing: > > > > > > > > > > > > > > > > 1. When there is a field that has a float, and then a later > record > > > has > > > > the number 0 in it (which Drill takes as a INT). This is a known > problem > > > > and one that Drill has a solution for. > > > > > > > > 2. When there is a field is of one type (a map) and then a later > > > record > > > > has a string in it. No easy solution here. > > > > > > > > 3. Select * where there is a json field with a . in the name. I > won’t > > > > go into details here, but I feel this factors into data exploration, > > > > because it changes the ability to “stay in Drill” to explore their > data ( > > > > https://issues.apache.org/jira/browse/DRILL-3922 < > https://issues.apache.org/jira/browse/DRILL-3922>) > > > > > > > > 4. Error reporting challenges > > > > > > > > > > > > > > > > > > > > > > > > With the problem summary laid out, I wanted to walk through my > process in > > > > working with this data, and where, if I were a user Drill could have > been > > > > much more helpful to the process. > > > > > > > > > > > > > > > > Here is a description of the process I went through: > > > > > > > > > > > > > > > > 1. Copy data into filesystem > > > > > > > > 2. Use drill to “Select * from `path_to/dump.json` limit 1 > > > > > > > > 3. (I just want to see what it looks like!) > > > > > > > > > > > > > > > > > > > > > > > > Here I get this error: > > > > > > > > > > > > > > > > > select * from `path_to/ dump.json` limit 1; > > > > > > > > Error: DATA_READ ERROR: You tried to write a BigInt type when you are > > > using > > > > a ValueWriter of type NullableFloat8WriterImpl. > > > > > > > > > > > > > > > > File /data/dev/path_to/dump.json > > > > > > > > Record 1 > > > > > > > > Line 1 > > > > > > > > Column 9054 > > > > > > > > Field entropy > > > > > > > > Fragment 0:0 > > > > > > > > > > > > > > > > This isn’t incredibly helpful from a user perspective. I.e. When I > > > Google > > > > around, I realize now that in the docs it talks about “Schema > Changes” > > > and > > > > one possible item is use the setting below. However, examples of the > data > > > > that was trying to be displayed (with it’s implied type) may help > users > > > > grok what is happening. At least in this case it showed me the field > > > name! > > > > > > > > > > > > > > > > ALTER SYSTEM SET `store.json.read_numbers_as_double` = true; > > > > > > > > > > > > > > > > This is a great example where since we have known use case (when > numbers > > > > are doubles but someone tries to store 0 an INT) it fails, thus dev’s > > > have > > > > added a setting to allow a user to get through that, that the error > > > message > > > > could be more helpful. In this case, Showing two record numbers > (line > > > > numbers) with different types, the field values with their implied > types, > > > > and perhaps a suggestion about using the setting to address the > problem. > > > > This could make it more intuitive for the user to stay in Drill, and > stay > > > > in the data. In this case, I looked at a head of the file, and saw > the > > > > issue and was able to proceed. > > > > > > > > > > > > > > > > Also, as a corollary here, the user documentation does not show this > > > error > > > > related to the schema change problem. This would be a great place to > > > state, > > > > “if you see an error that looks like X, this is what is happening and > > > what > > > > you can do for it.” > > > > > > > > > > > > > > > > > > > > > > > > *Side node on documentation* > > > > > > > > We should look to have documentation try to be role based. In this > > > case, > > > > the documentation says use “ALTER SYSTEM” I would argue, and I am > > > guessing > > > > others would concur, that for this use case, “ALTER SESSION” may be a > > > > better suggestion as this is specific alteration to address the use > case > > > of > > > > loading/querying a specific data set, and is likely done by a user > of the > > > > system. > > > > > > > > > > > > > > > > If a user is doing self-serve data, then in an enterprise > environment, > > > they > > > > may not have the ability to use ALTER SYSTEM and get an error, thus > may > > > be > > > > confused on how to proceed. In addition ALTER SYSTEM by a user who > > > > doesn’t understand that they are changing, yet have the rights to > change, > > > > may introduce future data problems they didn’t expect. I like that > the > > > > default is a more constrictive method, because it makes people be > > > explicit > > > > about data, yet the documentation should also aim to be explicit > about > > > > something like a system wide change. > > > > > > > > > > > > > > > > > > > > > > > > *Back to the story* > > > > > > > > Ok so now I will do ALTER SESSION SET on the read_numbers_as_double > > > setting > > > > > > > > > > > > > > > > I run the query again. > > > > > > > > > > > > > > > > > select * from `path_to/dump.json` limit 1; > > > > > > > > Error: DATA_READ ERROR: Error parsing JSON - You tried to write a > VarChar > > > > type when you are using a ValueWriter of type SingleMapWriter. > > > > > > > > > > > > > > > > File /data/dev/path_to/dump.json > > > > > > > > Record 4009 > > > > > > > > Fragment 0:0 > > > > > > > > > > > > > > > > Another error But what does this one mean? Ok, now that I have been > > > > living in the docs and in the Drill user list, and because it’s > similar > > > to > > > > the schema change issue, that that is what we are looking at here. > > > Instead > > > > of double to int, we have one field that is map most of the time, > and in > > > > some cases it’s a string. > > > > > > > > > > > > > > > > But this doesn’t really help me as a user. To troubleshoot this > Drill > > > > doesn’t offer any options. This file is 500 MB of dense and nested > JSON > > > > data with 51k records. My solution? I took the record number, then > I > > > went > > > > to my NFS mounted clustered file system (thank goodness I had MapR > here, > > > I > > > > am not sure how I would have done this with Posix tools) > > > > > > > > > > > > > > > > My command: $ head -4009 dump.json|tail -1 > > > > > > > > > > > > > > > > That (I hoped) showed me the record in question, note the error from > > > Drill > > > > didn’t tell me which field was at fault here, so I had to visually > align > > > > things to address that. However, I was able to spot the difference > and > > > > work with the dev to understand why that happened. I removed those > > > records, > > > > and things worked correctly. > > > > > > > > > > > > > > > > Could there have been a way to identify that within drill? My > solution > > > was > > > > to take a python script and read through, and discard those records > that > > > > were not a map, however, on 500MB that can work, but what about 500 > GB? > > > I > > > > guess a Spark job could clean the data…. But could Drill be given > some > > > > tools to help with this situation? > > > > > > > > > > > > > > > > For example, the first thing I said was: What field is at issue? I > had > > > no > > > > way to see what was up there. I had to use other tools to see the > data > > > so > > > > I could understand the problem. Then when I understood the problem, > I had > > > > to use Python to produce data that was queryable. > > > > > > > > > > > > > > > > Based on the design document Hsuan Yi Chu just posted to the mailing > > > list, > > > > at this point my post is just a user story to support the design > > > document. > > > > To summarize the points I’d like to see included in the design > document > > > > (from a user perspective), not understanding “how or why”: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *1. **Error messages that are more verbose in explaining the > problem* > > > > > > > > a. Filename, row number, column number or name > > > > > > > > b. Option to output the “offending row” > > > > > > > > c. Showing the data that is causing the error WITH the type Drill > > > > inferred. > > > > > > > > d. If there are options to help work through dirty data, perhaps > the > > > > error message could include those: “Data was an double, then drill > found > > > > this data: 0 that was a int in File x, at row 24 in column > > > “myfloatingdata” > > > > consider using store.json.read_numbers_as_double to address the > issue. > > > > > > > > 2. *A way to determine how common this exception is* > > > > > > > > a. If I am playing with a messy data set, and this error happens, > > > does > > > > it happen on 1 record? 2? 5000? Knowing that information would: > > > > > > > > i. Help users > > > understand > > > > how Drill is seeing that particular column > > > > > > > > ii. Make decisions > on > > > > excluding data rather than just removing it. What if the first 10 > records > > > > were errors, and then you excluded the remaining 10 million because > they > > > > were correct yet different from the first 10? > > > > > > > > b. Perhaps there could be a “stats” function that only works if > it’s > > > > the only selected item or if the select is all those functions (stats > > > > functions)? > > > > > > > > i. Select > > > > type_stats(fieldsname) from data > > > > > > > > ii. (that wouldn’t > > > error > > > > on different types) > > > > > > > > 3. *An ability to set a “return null on this field if error or > if non > > > > castable to X type, especially in a view, perhaps in a function.* > > > > > > > > a. Allow them to not have to reparse data outside drill > > > > > > > > b. Load it into a sane format (one time loads/ETL to clean data) > > > > > > > > c. Not be system or session wide exception. > > > > > > > > i. I think this is > > > > important because I may have a field where I want it to read the > numbers > > > as > > > > double, but what if I have another field in the same dataset where I > > > don’t > > > > want it to read the numbers as double? A SYSTEM or SESSION level > variable > > > > takes away that granularity > > > > > > > > d. Select field1, CASTORNULL(field2, int) as field2, > > > CASTORNULL(field3, > > > > double) as field3 from ugly_data. > > > > > > > > e. That’s an example when it’s in the select, but I Could see a > where > > > > clause > > > > > > > > f. Select field1, field2, field3 from ugly data where > ISTYPE(field2, > > > > int) and ISTYPE(field3, double) > > > > > > > > 4. *Updating of the documentation related to ALTER SESSION vs > ALTER > > > > SYSTEM with an eye to the context of the majority use case of the > > > > documented feature* > > > > > > > > a. For data loads, the documentation uses ALTER SYSTEM and that’s > > > > problematic because: > > > > > > > > i. Not all users > have > > > > the privileges to issue an ALTER SYSTEM. Thus a new user trying to > figure > > > > things out may not realize they can just ALTER SESSION after getting > an > > > > ALTER SYSTEM error. > > > > > > > > ii. ALTER SYSTEM on > data > > > > loading items, especially in areas that make Drill’s data > interpretation > > > > more permissive can lead to unintended consequences later. An admin, > who > > > > may be a good systems admin, and helps a data user troubleshoot and > error > > > > may issue an ALTER SYSTEM not realizing this changes all future data > > > > imports. > > > > > > > > b. Note, I found a few cases, but I would suggest a thorough > review > > > of > > > > the various use cases throughout the documentation, and in areas > where it > > > > really could be either, have a small paragraph indicating the > > > ramifications > > > > of either command. > > > > > > > > *5. **A Philosophy within the Drill Community to “Stay in Drill” > for > > > > data exploration* > > > > > > > > a. This is obviously not as much of a development thing as a > mindset. > > > > If someone says “I tried to do X, and I got and error” and the > > > communities > > > > response is Y where Y is “Look through your data and do Z to it so > Drill > > > > can read it” then we should reconsider that scenario and try to > provide > > > and > > > > option within Drill to intuitively handle the edge case. This is > > > > difficult. > > > > > > > > b. There are cases even in the documentation where this is the > case: > > > > https://drill.apache.org/docs/json-data-model/ < > https://drill.apache.org/docs/json-data-model/> talking about arrays at > > > the > > > > root level or reading some empty arrays. In these cases, we have to > > > leave > > > > drill to fix the problem. This works on small data, but may not work > on > > > > large or wide data. Consider the array at root level limitation. > What > > > if > > > > some process out of the users control produces 1000 100mb json files > and > > > we > > > > want to read that. To fix it, we have to address those files. Lots of > > > work > > > > there, either manual or automated. > > > > > > > > c. Once again I know this isn’t easy, but we shouldn’t answer > > > questions > > > > about how to do something by saying “fix this outside of Drill so > Drill > > > can > > > > read your data” if at all possible. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I hope this story helps support the design document presented. I am > > > happy > > > > to participate in more discussion around these topics as I have > enjoying > > > > digging into the internals of Drill > > > > > > > > > > > > > > > > John Omernik > > > > > > > > > > > > > > > > -- > > > *Name: Ganesh Semalty* > > > *Location: Gurgaon,Haryana(India)* > > > *Email Id: [email protected] <mailto:[email protected]> > <[email protected] <mailto:[email protected]>>* > > > > > > > > > P > > > > > > *Please consider the environment before printing this e-mail - SAVE > TREE.* > > > > > > > > > > > -- > > Name: Ganesh Semalty > > Location: Gurgaon,Haryana(India) > > Email Id: [email protected] <mailto:[email protected]> > > > > P > > Please consider the environment before printing this e-mail - SAVE TREE. > > -- *Name: Ganesh Semalty* *Location: Gurgaon,Haryana(India)* *Email Id: [email protected] <[email protected]>* P *Please consider the environment before printing this e-mail - SAVE TREE.*
