See if the videos on this page helps you. https://www.mapr.com/products/apache-drill <https://www.mapr.com/products/apache-drill>
—Andries > On Oct 22, 2015, at 7:38 AM, ganesh <[email protected]> wrote: > > Hello, > > Are there any other links as tutorial for TABLEAU v/s HIVE > > I have already gone through the one in apache-drill site. I am not able to > proceed with those. > > On Thu, Oct 22, 2015 at 8:03 PM, Andries Engelbrecht < > [email protected]> wrote: > >> Hive should be visible and usable in Tableau. You can use Drill Views for >> dfs data, or you can Tableau Custom SQL to access the data. >> >> Make sure to install the Tableau TDC file that comes with the ODBC driver. >> https://drill.apache.org/docs/installing-the-tdc-file-on-windows/ < >> https://drill.apache.org/docs/installing-the-tdc-file-on-windows/> >> https://drill.apache.org/docs/using-apache-drill-with-tableau-9-desktop/ < >> https://drill.apache.org/docs/using-apache-drill-with-tableau-9-desktop/> >> >> >> —Andries >> >> >>> On Oct 22, 2015, at 6:40 AM, ganesh <[email protected]> wrote: >>> >>> Hi John, >>> >>> Thanks for suggesting new name: Apache Zeppelin >>> >>> I was currently trying 14 Days trial version of TABLEAU with not much >> success. >>> Today only I knew that for files in Hadoop or local file system, I would >> need to create view. >>> >>> Still, though I can see my Tables from HIVE in Tableau, I cannot see any >> data. >>> Nor I am able to use TABLEAU from the links help given in apache-drill >> currently (http://drill.apache.org/docs/tableau-examples/ < >> http://drill.apache.org/docs/tableau-examples/>) >>> >>> Snapshot attached, Incase you have worked over TABLEAU >>> >>> I will look into Zeppelin also. >>> >>> >>> On Thu, Oct 22, 2015 at 6:43 PM, John Omernik <[email protected] <mailto: >> [email protected]>> wrote: >>> I separated my response from the original topic to keep any responses >> there >>> focused on the design document. >>> >>> As to ways to use Drill, I have been working with SQL Squirrel quite >>> successfully. On my list of things to do, if you want to stay in the >>> Apache world is looking at Apache Zeppelin. In the Git Repo, there is a >>> Drill Plugin so you can run SQL again Drill, look at results, and do >> basic >>> visualizations, I have been trying to way until the PR is merged on >>> Zeppelin, but for your use case, you may want to grab the plugin code and >>> run with it. >>> >>> >>> >>> >>> On Thu, Oct 22, 2015 at 6:06 AM, ganesh <[email protected] >> <mailto:[email protected]>> wrote: >>> >>>> Hello, >>>> >>>> John, you seem to be quite impressed with apache drill .. nice. >>>> I am new to un-structured world and just started 1 week back on APACHE >>>> DRILL after suggestion from my collegues. We have a semi structured >> data >>>> where we have constraint that we do not know number of columns >>>> >>>> I heard that APACHE DRILL is column free applicationa nd with support >> of >>>> JSON format, it allows to create columns on-fly, >>>> I converted my data from CSV-like-format to JSON and trying to figure >> out >>>> if it will work for me. >>>> >>>> Here I hit two issues :- >>>> 1) My column were like : 3100.2.1.2 <tel:3100.2.1.2> and values like >> '-2303" or >>>> '01/01/2015 >>>> 02:02:00" >>>> >>>> Challenge was that column cant be started with Numeric value. So I had >> to >>>> change key as: "t3100.2.1.2" >>>> After that things were quite OK, >>>> >>>> Now I need some help from you guys. To proceed I have to present my >> work to >>>> management as an example. >>>> But querying on apache drill console, doesnt seem to be an attractive >> way >>>> to present things. >>>> >>>> I tried drill explorer too.But didnt find that so good. >>>> One thing to note, I am playing with files on Hadoop standalone mode in >>>> UBUNTU. >>>> >>>> To make it appear more good looking, I started with QLIK SENSE .. but >> was >>>> unable to connect it with hadoop file system. It only showed me HIVE >> FILES. >>>> Then I downloaded TABLEAU Trial version ... but I am unable to get >> Hadoop >>>> data here too... >>>> >>>> Please help me how to proceed. I have presentation on coming Monday. >>>> Queries are quite ready .. I just need to show in visualization form >>>> ........ using OPEN SOURCE applications only. >>>> >>>> >>>> Guys please help me. >>>> >>>> >>>> >>>> On Wed, Oct 21, 2015 at 6:43 PM, John Omernik <[email protected] >> <mailto:[email protected]>> wrote: >>>> >>>>> AWESOME! >>>>> >>>>> I had just been in the process of writing up a long user story to >> ask for >>>>> and support exactly this. I modified it and included it here: >>>>> >>>>> >>>>> To start out, I want to say how much I love the Drill project, and >> the >>>>> potential it has. I've put this together based on my experiences and >> want >>>>> to contribute a perspective as a user, not just put a bunch of >> critiques >>>> in >>>>> an email. I hope it's all taken in that spirit. Additional note, I >>>> wrote >>>>> this prior to seeing the Design Document share by Hsuan Yi Chu >> yesterday. >>>>> If you are reading it, and think to yourself “that wording is odd…” >>>> please >>>>> consider it from the “I didn’t want to throw away the user story” >>>>> perspective and the “I wrote it before the design doc” perspective. >>>>> >>>>> >>>>> >>>>> Additionally, I understand that some of what I am suggesting may not >> be >>>>> easy from a development perspective. I am just being upfront with my >>>>> experience, so we can look to determine what can be done; I am not >>>> looking >>>>> for a silver bullet here, just looking for improvement. Some may be >> as >>>>> simple as better documentation, other suggestions may be harder to >>>>> implement. Either way, I thought a verbose user story might be >> useful to >>>>> the community as a whole. >>>>> >>>>> >>>>> >>>>> John >>>>> >>>>> >>>>> >>>>> *User Story* >>>>> >>>>> >>>>> >>>>> As I have been working with Drill for data exploration, I came across >>>>> multiple "things" that just were hard. In dealing with some data, >>>>> especially JSON data, it can be ugly, and scaled ugly is even worse! >>>>> >>>>> >>>>> >>>>> For this story, I am working with a JSON dump from MongoDB, and you >> would >>>>> think it would be well structured, and for the most part it is. >> There >>>> are >>>>> some application level mistakes that were made (I will go into that >> in a >>>>> moment), but in general Drill handles this well. So with this data >> set, >>>>> there are a few main challenges I am seeing: >>>>> >>>>> >>>>> >>>>> 1. When there is a field that has a float, and then a later >> record >>>> has >>>>> the number 0 in it (which Drill takes as a INT). This is a known >> problem >>>>> and one that Drill has a solution for. >>>>> >>>>> 2. When there is a field is of one type (a map) and then a later >>>> record >>>>> has a string in it. No easy solution here. >>>>> >>>>> 3. Select * where there is a json field with a . in the name. I >> won’t >>>>> go into details here, but I feel this factors into data exploration, >>>>> because it changes the ability to “stay in Drill” to explore their >> data ( >>>>> https://issues.apache.org/jira/browse/DRILL-3922 < >> https://issues.apache.org/jira/browse/DRILL-3922>) >>>>> >>>>> 4. Error reporting challenges >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> With the problem summary laid out, I wanted to walk through my >> process in >>>>> working with this data, and where, if I were a user Drill could have >> been >>>>> much more helpful to the process. >>>>> >>>>> >>>>> >>>>> Here is a description of the process I went through: >>>>> >>>>> >>>>> >>>>> 1. Copy data into filesystem >>>>> >>>>> 2. Use drill to “Select * from `path_to/dump.json` limit 1 >>>>> >>>>> 3. (I just want to see what it looks like!) >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Here I get this error: >>>>> >>>>> >>>>> >>>>>> select * from `path_to/ dump.json` limit 1; >>>>> >>>>> Error: DATA_READ ERROR: You tried to write a BigInt type when you are >>>> using >>>>> a ValueWriter of type NullableFloat8WriterImpl. >>>>> >>>>> >>>>> >>>>> File /data/dev/path_to/dump.json >>>>> >>>>> Record 1 >>>>> >>>>> Line 1 >>>>> >>>>> Column 9054 >>>>> >>>>> Field entropy >>>>> >>>>> Fragment 0:0 >>>>> >>>>> >>>>> >>>>> This isn’t incredibly helpful from a user perspective. I.e. When I >>>> Google >>>>> around, I realize now that in the docs it talks about “Schema >> Changes” >>>> and >>>>> one possible item is use the setting below. However, examples of the >> data >>>>> that was trying to be displayed (with it’s implied type) may help >> users >>>>> grok what is happening. At least in this case it showed me the field >>>> name! >>>>> >>>>> >>>>> >>>>> ALTER SYSTEM SET `store.json.read_numbers_as_double` = true; >>>>> >>>>> >>>>> >>>>> This is a great example where since we have known use case (when >> numbers >>>>> are doubles but someone tries to store 0 an INT) it fails, thus dev’s >>>> have >>>>> added a setting to allow a user to get through that, that the error >>>> message >>>>> could be more helpful. In this case, Showing two record numbers >> (line >>>>> numbers) with different types, the field values with their implied >> types, >>>>> and perhaps a suggestion about using the setting to address the >> problem. >>>>> This could make it more intuitive for the user to stay in Drill, and >> stay >>>>> in the data. In this case, I looked at a head of the file, and saw >> the >>>>> issue and was able to proceed. >>>>> >>>>> >>>>> >>>>> Also, as a corollary here, the user documentation does not show this >>>> error >>>>> related to the schema change problem. This would be a great place to >>>> state, >>>>> “if you see an error that looks like X, this is what is happening and >>>> what >>>>> you can do for it.” >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> *Side node on documentation* >>>>> >>>>> We should look to have documentation try to be role based. In this >>>> case, >>>>> the documentation says use “ALTER SYSTEM” I would argue, and I am >>>> guessing >>>>> others would concur, that for this use case, “ALTER SESSION” may be a >>>>> better suggestion as this is specific alteration to address the use >> case >>>> of >>>>> loading/querying a specific data set, and is likely done by a user >> of the >>>>> system. >>>>> >>>>> >>>>> >>>>> If a user is doing self-serve data, then in an enterprise >> environment, >>>> they >>>>> may not have the ability to use ALTER SYSTEM and get an error, thus >> may >>>> be >>>>> confused on how to proceed. In addition ALTER SYSTEM by a user who >>>>> doesn’t understand that they are changing, yet have the rights to >> change, >>>>> may introduce future data problems they didn’t expect. I like that >> the >>>>> default is a more constrictive method, because it makes people be >>>> explicit >>>>> about data, yet the documentation should also aim to be explicit >> about >>>>> something like a system wide change. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> *Back to the story* >>>>> >>>>> Ok so now I will do ALTER SESSION SET on the read_numbers_as_double >>>> setting >>>>> >>>>> >>>>> >>>>> I run the query again. >>>>> >>>>> >>>>> >>>>>> select * from `path_to/dump.json` limit 1; >>>>> >>>>> Error: DATA_READ ERROR: Error parsing JSON - You tried to write a >> VarChar >>>>> type when you are using a ValueWriter of type SingleMapWriter. >>>>> >>>>> >>>>> >>>>> File /data/dev/path_to/dump.json >>>>> >>>>> Record 4009 >>>>> >>>>> Fragment 0:0 >>>>> >>>>> >>>>> >>>>> Another error But what does this one mean? Ok, now that I have been >>>>> living in the docs and in the Drill user list, and because it’s >> similar >>>> to >>>>> the schema change issue, that that is what we are looking at here. >>>> Instead >>>>> of double to int, we have one field that is map most of the time, >> and in >>>>> some cases it’s a string. >>>>> >>>>> >>>>> >>>>> But this doesn’t really help me as a user. To troubleshoot this >> Drill >>>>> doesn’t offer any options. This file is 500 MB of dense and nested >> JSON >>>>> data with 51k records. My solution? I took the record number, then >> I >>>> went >>>>> to my NFS mounted clustered file system (thank goodness I had MapR >> here, >>>> I >>>>> am not sure how I would have done this with Posix tools) >>>>> >>>>> >>>>> >>>>> My command: $ head -4009 dump.json|tail -1 >>>>> >>>>> >>>>> >>>>> That (I hoped) showed me the record in question, note the error from >>>> Drill >>>>> didn’t tell me which field was at fault here, so I had to visually >> align >>>>> things to address that. However, I was able to spot the difference >> and >>>>> work with the dev to understand why that happened. I removed those >>>> records, >>>>> and things worked correctly. >>>>> >>>>> >>>>> >>>>> Could there have been a way to identify that within drill? My >> solution >>>> was >>>>> to take a python script and read through, and discard those records >> that >>>>> were not a map, however, on 500MB that can work, but what about 500 >> GB? >>>> I >>>>> guess a Spark job could clean the data…. But could Drill be given >> some >>>>> tools to help with this situation? >>>>> >>>>> >>>>> >>>>> For example, the first thing I said was: What field is at issue? I >> had >>>> no >>>>> way to see what was up there. I had to use other tools to see the >> data >>>> so >>>>> I could understand the problem. Then when I understood the problem, >> I had >>>>> to use Python to produce data that was queryable. >>>>> >>>>> >>>>> >>>>> Based on the design document Hsuan Yi Chu just posted to the mailing >>>> list, >>>>> at this point my post is just a user story to support the design >>>> document. >>>>> To summarize the points I’d like to see included in the design >> document >>>>> (from a user perspective), not understanding “how or why”: >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> *1. **Error messages that are more verbose in explaining the >> problem* >>>>> >>>>> a. Filename, row number, column number or name >>>>> >>>>> b. Option to output the “offending row” >>>>> >>>>> c. Showing the data that is causing the error WITH the type Drill >>>>> inferred. >>>>> >>>>> d. If there are options to help work through dirty data, perhaps >> the >>>>> error message could include those: “Data was an double, then drill >> found >>>>> this data: 0 that was a int in File x, at row 24 in column >>>> “myfloatingdata” >>>>> consider using store.json.read_numbers_as_double to address the >> issue. >>>>> >>>>> 2. *A way to determine how common this exception is* >>>>> >>>>> a. If I am playing with a messy data set, and this error happens, >>>> does >>>>> it happen on 1 record? 2? 5000? Knowing that information would: >>>>> >>>>> i. Help users >>>> understand >>>>> how Drill is seeing that particular column >>>>> >>>>> ii. Make decisions >> on >>>>> excluding data rather than just removing it. What if the first 10 >> records >>>>> were errors, and then you excluded the remaining 10 million because >> they >>>>> were correct yet different from the first 10? >>>>> >>>>> b. Perhaps there could be a “stats” function that only works if >> it’s >>>>> the only selected item or if the select is all those functions (stats >>>>> functions)? >>>>> >>>>> i. Select >>>>> type_stats(fieldsname) from data >>>>> >>>>> ii. (that wouldn’t >>>> error >>>>> on different types) >>>>> >>>>> 3. *An ability to set a “return null on this field if error or >> if non >>>>> castable to X type, especially in a view, perhaps in a function.* >>>>> >>>>> a. Allow them to not have to reparse data outside drill >>>>> >>>>> b. Load it into a sane format (one time loads/ETL to clean data) >>>>> >>>>> c. Not be system or session wide exception. >>>>> >>>>> i. I think this is >>>>> important because I may have a field where I want it to read the >> numbers >>>> as >>>>> double, but what if I have another field in the same dataset where I >>>> don’t >>>>> want it to read the numbers as double? A SYSTEM or SESSION level >> variable >>>>> takes away that granularity >>>>> >>>>> d. Select field1, CASTORNULL(field2, int) as field2, >>>> CASTORNULL(field3, >>>>> double) as field3 from ugly_data. >>>>> >>>>> e. That’s an example when it’s in the select, but I Could see a >> where >>>>> clause >>>>> >>>>> f. Select field1, field2, field3 from ugly data where >> ISTYPE(field2, >>>>> int) and ISTYPE(field3, double) >>>>> >>>>> 4. *Updating of the documentation related to ALTER SESSION vs >> ALTER >>>>> SYSTEM with an eye to the context of the majority use case of the >>>>> documented feature* >>>>> >>>>> a. For data loads, the documentation uses ALTER SYSTEM and that’s >>>>> problematic because: >>>>> >>>>> i. Not all users >> have >>>>> the privileges to issue an ALTER SYSTEM. Thus a new user trying to >> figure >>>>> things out may not realize they can just ALTER SESSION after getting >> an >>>>> ALTER SYSTEM error. >>>>> >>>>> ii. ALTER SYSTEM on >> data >>>>> loading items, especially in areas that make Drill’s data >> interpretation >>>>> more permissive can lead to unintended consequences later. An admin, >> who >>>>> may be a good systems admin, and helps a data user troubleshoot and >> error >>>>> may issue an ALTER SYSTEM not realizing this changes all future data >>>>> imports. >>>>> >>>>> b. Note, I found a few cases, but I would suggest a thorough >> review >>>> of >>>>> the various use cases throughout the documentation, and in areas >> where it >>>>> really could be either, have a small paragraph indicating the >>>> ramifications >>>>> of either command. >>>>> >>>>> *5. **A Philosophy within the Drill Community to “Stay in Drill” >> for >>>>> data exploration* >>>>> >>>>> a. This is obviously not as much of a development thing as a >> mindset. >>>>> If someone says “I tried to do X, and I got and error” and the >>>> communities >>>>> response is Y where Y is “Look through your data and do Z to it so >> Drill >>>>> can read it” then we should reconsider that scenario and try to >> provide >>>> and >>>>> option within Drill to intuitively handle the edge case. This is >>>>> difficult. >>>>> >>>>> b. There are cases even in the documentation where this is the >> case: >>>>> https://drill.apache.org/docs/json-data-model/ < >> https://drill.apache.org/docs/json-data-model/> talking about arrays at >>>> the >>>>> root level or reading some empty arrays. In these cases, we have to >>>> leave >>>>> drill to fix the problem. This works on small data, but may not work >> on >>>>> large or wide data. Consider the array at root level limitation. >> What >>>> if >>>>> some process out of the users control produces 1000 100mb json files >> and >>>> we >>>>> want to read that. To fix it, we have to address those files. Lots of >>>> work >>>>> there, either manual or automated. >>>>> >>>>> c. Once again I know this isn’t easy, but we shouldn’t answer >>>> questions >>>>> about how to do something by saying “fix this outside of Drill so >> Drill >>>> can >>>>> read your data” if at all possible. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> I hope this story helps support the design document presented. I am >>>> happy >>>>> to participate in more discussion around these topics as I have >> enjoying >>>>> digging into the internals of Drill >>>>> >>>>> >>>>> >>>>> John Omernik >>>>> >>>> >>>> >>>> >>>> -- >>>> *Name: Ganesh Semalty* >>>> *Location: Gurgaon,Haryana(India)* >>>> *Email Id: [email protected] <mailto:[email protected]> >> <[email protected] <mailto:[email protected]>>* >>>> >>>> >>>> P >>>> >>>> *Please consider the environment before printing this e-mail - SAVE >> TREE.* >>>> >>> >>> >>> >>> -- >>> Name: Ganesh Semalty >>> Location: Gurgaon,Haryana(India) >>> Email Id: [email protected] <mailto:[email protected]> >>> >>> P >>> Please consider the environment before printing this e-mail - SAVE TREE. >> >> > > > -- > *Name: Ganesh Semalty* > *Location: Gurgaon,Haryana(India)* > *Email Id: [email protected] <[email protected]>* > > > P > > *Please consider the environment before printing this e-mail - SAVE TREE.*
