Sorry guys, accidentally answered the question here, did not realize this was another topic.
On 22 October 2015 at 13:29, Mustafa Engin Sözer < [email protected]> wrote: > Hi, > > If you only need to show the data but not some sort of report (e.g. as in > Microstrategy, Tableau), then you can basically use most SQL clients, for > instance SQuirrel, Heidi SQL etc. via jdbc or odbc connection . The > following links should help you manage that: > > https://drill.apache.org/docs/odbc-jdbc-interfaces/ > https://drill.apache.org/docs/using-the-jdbc-driver/ > > For the Hadoop data problem, can you share your storage plugin settings? > Normally you shouldn't have any problems with seeing or querying hdfs data. > > > > On 22 October 2015 at 13:06, ganesh <[email protected]> wrote: > >> Hello, >> >> John, you seem to be quite impressed with apache drill .. nice. >> I am new to un-structured world and just started 1 week back on APACHE >> DRILL after suggestion from my collegues. We have a semi structured data >> where we have constraint that we do not know number of columns >> >> I heard that APACHE DRILL is column free applicationa nd with support of >> JSON format, it allows to create columns on-fly, >> I converted my data from CSV-like-format to JSON and trying to figure out >> if it will work for me. >> >> Here I hit two issues :- >> 1) My column were like : 3100.2.1.2 and values like '-2303" or >> '01/01/2015 >> 02:02:00" >> >> Challenge was that column cant be started with Numeric value. So I had to >> change key as: "t3100.2.1.2" >> After that things were quite OK, >> >> Now I need some help from you guys. To proceed I have to present my work >> to >> management as an example. >> But querying on apache drill console, doesnt seem to be an attractive way >> to present things. >> >> I tried drill explorer too.But didnt find that so good. >> One thing to note, I am playing with files on Hadoop standalone mode in >> UBUNTU. >> >> To make it appear more good looking, I started with QLIK SENSE .. but was >> unable to connect it with hadoop file system. It only showed me HIVE >> FILES. >> Then I downloaded TABLEAU Trial version ... but I am unable to get Hadoop >> data here too... >> >> Please help me how to proceed. I have presentation on coming Monday. >> Queries are quite ready .. I just need to show in visualization form >> ........ using OPEN SOURCE applications only. >> >> >> Guys please help me. >> >> >> >> On Wed, Oct 21, 2015 at 6:43 PM, John Omernik <[email protected]> wrote: >> >> > AWESOME! >> > >> > I had just been in the process of writing up a long user story to ask >> for >> > and support exactly this. I modified it and included it here: >> > >> > >> > To start out, I want to say how much I love the Drill project, and the >> > potential it has. I've put this together based on my experiences and >> want >> > to contribute a perspective as a user, not just put a bunch of >> critiques in >> > an email. I hope it's all taken in that spirit. Additional note, I >> wrote >> > this prior to seeing the Design Document share by Hsuan Yi Chu >> yesterday. >> > If you are reading it, and think to yourself “that wording is odd…” >> please >> > consider it from the “I didn’t want to throw away the user story” >> > perspective and the “I wrote it before the design doc” perspective. >> > >> > >> > >> > Additionally, I understand that some of what I am suggesting may not be >> > easy from a development perspective. I am just being upfront with my >> > experience, so we can look to determine what can be done; I am not >> looking >> > for a silver bullet here, just looking for improvement. Some may be as >> > simple as better documentation, other suggestions may be harder to >> > implement. Either way, I thought a verbose user story might be useful >> to >> > the community as a whole. >> > >> > >> > >> > John >> > >> > >> > >> > *User Story* >> > >> > >> > >> > As I have been working with Drill for data exploration, I came across >> > multiple "things" that just were hard. In dealing with some data, >> > especially JSON data, it can be ugly, and scaled ugly is even worse! >> > >> > >> > >> > For this story, I am working with a JSON dump from MongoDB, and you >> would >> > think it would be well structured, and for the most part it is. There >> are >> > some application level mistakes that were made (I will go into that in a >> > moment), but in general Drill handles this well. So with this data set, >> > there are a few main challenges I am seeing: >> > >> > >> > >> > 1. When there is a field that has a float, and then a later record >> has >> > the number 0 in it (which Drill takes as a INT). This is a known problem >> > and one that Drill has a solution for. >> > >> > 2. When there is a field is of one type (a map) and then a later >> record >> > has a string in it. No easy solution here. >> > >> > 3. Select * where there is a json field with a . in the name. I >> won’t >> > go into details here, but I feel this factors into data exploration, >> > because it changes the ability to “stay in Drill” to explore their data >> ( >> > https://issues.apache.org/jira/browse/DRILL-3922) >> > >> > 4. Error reporting challenges >> > >> > >> > >> > >> > >> > With the problem summary laid out, I wanted to walk through my process >> in >> > working with this data, and where, if I were a user Drill could have >> been >> > much more helpful to the process. >> > >> > >> > >> > Here is a description of the process I went through: >> > >> > >> > >> > 1. Copy data into filesystem >> > >> > 2. Use drill to “Select * from `path_to/dump.json` limit 1 >> > >> > 3. (I just want to see what it looks like!) >> > >> > >> > >> > >> > >> > Here I get this error: >> > >> > >> > >> > > select * from `path_to/ dump.json` limit 1; >> > >> > Error: DATA_READ ERROR: You tried to write a BigInt type when you are >> using >> > a ValueWriter of type NullableFloat8WriterImpl. >> > >> > >> > >> > File /data/dev/path_to/dump.json >> > >> > Record 1 >> > >> > Line 1 >> > >> > Column 9054 >> > >> > Field entropy >> > >> > Fragment 0:0 >> > >> > >> > >> > This isn’t incredibly helpful from a user perspective. I.e. When I >> Google >> > around, I realize now that in the docs it talks about “Schema Changes” >> and >> > one possible item is use the setting below. However, examples of the >> data >> > that was trying to be displayed (with it’s implied type) may help users >> > grok what is happening. At least in this case it showed me the field >> name! >> > >> > >> > >> > ALTER SYSTEM SET `store.json.read_numbers_as_double` = true; >> > >> > >> > >> > This is a great example where since we have known use case (when numbers >> > are doubles but someone tries to store 0 an INT) it fails, thus dev’s >> have >> > added a setting to allow a user to get through that, that the error >> message >> > could be more helpful. In this case, Showing two record numbers (line >> > numbers) with different types, the field values with their implied >> types, >> > and perhaps a suggestion about using the setting to address the problem. >> > This could make it more intuitive for the user to stay in Drill, and >> stay >> > in the data. In this case, I looked at a head of the file, and saw the >> > issue and was able to proceed. >> > >> > >> > >> > Also, as a corollary here, the user documentation does not show this >> error >> > related to the schema change problem. This would be a great place to >> state, >> > “if you see an error that looks like X, this is what is happening and >> what >> > you can do for it.” >> > >> > >> > >> > >> > >> > *Side node on documentation* >> > >> > We should look to have documentation try to be role based. In this >> case, >> > the documentation says use “ALTER SYSTEM” I would argue, and I am >> guessing >> > others would concur, that for this use case, “ALTER SESSION” may be a >> > better suggestion as this is specific alteration to address the use >> case of >> > loading/querying a specific data set, and is likely done by a user of >> the >> > system. >> > >> > >> > >> > If a user is doing self-serve data, then in an enterprise environment, >> they >> > may not have the ability to use ALTER SYSTEM and get an error, thus may >> be >> > confused on how to proceed. In addition ALTER SYSTEM by a user who >> > doesn’t understand that they are changing, yet have the rights to >> change, >> > may introduce future data problems they didn’t expect. I like that the >> > default is a more constrictive method, because it makes people be >> explicit >> > about data, yet the documentation should also aim to be explicit about >> > something like a system wide change. >> > >> > >> > >> > >> > >> > *Back to the story* >> > >> > Ok so now I will do ALTER SESSION SET on the read_numbers_as_double >> setting >> > >> > >> > >> > I run the query again. >> > >> > >> > >> > > select * from `path_to/dump.json` limit 1; >> > >> > Error: DATA_READ ERROR: Error parsing JSON - You tried to write a >> VarChar >> > type when you are using a ValueWriter of type SingleMapWriter. >> > >> > >> > >> > File /data/dev/path_to/dump.json >> > >> > Record 4009 >> > >> > Fragment 0:0 >> > >> > >> > >> > Another error But what does this one mean? Ok, now that I have been >> > living in the docs and in the Drill user list, and because it’s similar >> to >> > the schema change issue, that that is what we are looking at here. >> Instead >> > of double to int, we have one field that is map most of the time, and in >> > some cases it’s a string. >> > >> > >> > >> > But this doesn’t really help me as a user. To troubleshoot this Drill >> > doesn’t offer any options. This file is 500 MB of dense and nested JSON >> > data with 51k records. My solution? I took the record number, then I >> went >> > to my NFS mounted clustered file system (thank goodness I had MapR >> here, I >> > am not sure how I would have done this with Posix tools) >> > >> > >> > >> > My command: $ head -4009 dump.json|tail -1 >> > >> > >> > >> > That (I hoped) showed me the record in question, note the error from >> Drill >> > didn’t tell me which field was at fault here, so I had to visually align >> > things to address that. However, I was able to spot the difference and >> > work with the dev to understand why that happened. I removed those >> records, >> > and things worked correctly. >> > >> > >> > >> > Could there have been a way to identify that within drill? My solution >> was >> > to take a python script and read through, and discard those records that >> > were not a map, however, on 500MB that can work, but what about 500 >> GB? I >> > guess a Spark job could clean the data…. But could Drill be given some >> > tools to help with this situation? >> > >> > >> > >> > For example, the first thing I said was: What field is at issue? I had >> no >> > way to see what was up there. I had to use other tools to see the data >> so >> > I could understand the problem. Then when I understood the problem, I >> had >> > to use Python to produce data that was queryable. >> > >> > >> > >> > Based on the design document Hsuan Yi Chu just posted to the mailing >> list, >> > at this point my post is just a user story to support the design >> document. >> > To summarize the points I’d like to see included in the design document >> > (from a user perspective), not understanding “how or why”: >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > *1. **Error messages that are more verbose in explaining the >> problem* >> > >> > a. Filename, row number, column number or name >> > >> > b. Option to output the “offending row” >> > >> > c. Showing the data that is causing the error WITH the type Drill >> > inferred. >> > >> > d. If there are options to help work through dirty data, perhaps the >> > error message could include those: “Data was an double, then drill found >> > this data: 0 that was a int in File x, at row 24 in column >> “myfloatingdata” >> > consider using store.json.read_numbers_as_double to address the issue. >> > >> > 2. *A way to determine how common this exception is* >> > >> > a. If I am playing with a messy data set, and this error happens, >> does >> > it happen on 1 record? 2? 5000? Knowing that information would: >> > >> > i. Help users >> understand >> > how Drill is seeing that particular column >> > >> > ii. Make decisions on >> > excluding data rather than just removing it. What if the first 10 >> records >> > were errors, and then you excluded the remaining 10 million because they >> > were correct yet different from the first 10? >> > >> > b. Perhaps there could be a “stats” function that only works if it’s >> > the only selected item or if the select is all those functions (stats >> > functions)? >> > >> > i. Select >> > type_stats(fieldsname) from data >> > >> > ii. (that wouldn’t >> error >> > on different types) >> > >> > 3. *An ability to set a “return null on this field if error or if >> non >> > castable to X type, especially in a view, perhaps in a function.* >> > >> > a. Allow them to not have to reparse data outside drill >> > >> > b. Load it into a sane format (one time loads/ETL to clean data) >> > >> > c. Not be system or session wide exception. >> > >> > i. I think this is >> > important because I may have a field where I want it to read the >> numbers as >> > double, but what if I have another field in the same dataset where I >> don’t >> > want it to read the numbers as double? A SYSTEM or SESSION level >> variable >> > takes away that granularity >> > >> > d. Select field1, CASTORNULL(field2, int) as field2, >> CASTORNULL(field3, >> > double) as field3 from ugly_data. >> > >> > e. That’s an example when it’s in the select, but I Could see a >> where >> > clause >> > >> > f. Select field1, field2, field3 from ugly data where >> ISTYPE(field2, >> > int) and ISTYPE(field3, double) >> > >> > 4. *Updating of the documentation related to ALTER SESSION vs ALTER >> > SYSTEM with an eye to the context of the majority use case of the >> > documented feature* >> > >> > a. For data loads, the documentation uses ALTER SYSTEM and that’s >> > problematic because: >> > >> > i. Not all users have >> > the privileges to issue an ALTER SYSTEM. Thus a new user trying to >> figure >> > things out may not realize they can just ALTER SESSION after getting an >> > ALTER SYSTEM error. >> > >> > ii. ALTER SYSTEM on >> data >> > loading items, especially in areas that make Drill’s data interpretation >> > more permissive can lead to unintended consequences later. An admin, who >> > may be a good systems admin, and helps a data user troubleshoot and >> error >> > may issue an ALTER SYSTEM not realizing this changes all future data >> > imports. >> > >> > b. Note, I found a few cases, but I would suggest a thorough review >> of >> > the various use cases throughout the documentation, and in areas where >> it >> > really could be either, have a small paragraph indicating the >> ramifications >> > of either command. >> > >> > *5. **A Philosophy within the Drill Community to “Stay in Drill” for >> > data exploration* >> > >> > a. This is obviously not as much of a development thing as a >> mindset. >> > If someone says “I tried to do X, and I got and error” and the >> communities >> > response is Y where Y is “Look through your data and do Z to it so Drill >> > can read it” then we should reconsider that scenario and try to provide >> and >> > option within Drill to intuitively handle the edge case. This is >> > difficult. >> > >> > b. There are cases even in the documentation where this is the case: >> > https://drill.apache.org/docs/json-data-model/ talking about arrays at >> the >> > root level or reading some empty arrays. In these cases, we have to >> leave >> > drill to fix the problem. This works on small data, but may not work on >> > large or wide data. Consider the array at root level limitation. What >> if >> > some process out of the users control produces 1000 100mb json files >> and we >> > want to read that. To fix it, we have to address those files. Lots of >> work >> > there, either manual or automated. >> > >> > c. Once again I know this isn’t easy, but we shouldn’t answer >> questions >> > about how to do something by saying “fix this outside of Drill so Drill >> can >> > read your data” if at all possible. >> > >> > >> > >> > >> > >> > >> > >> > I hope this story helps support the design document presented. I am >> happy >> > to participate in more discussion around these topics as I have enjoying >> > digging into the internals of Drill >> > >> > >> > >> > John Omernik >> > >> >> >> >> -- >> *Name: Ganesh Semalty* >> *Location: Gurgaon,Haryana(India)* >> *Email Id: [email protected] <[email protected]>* >> >> >> P >> >> *Please consider the environment before printing this e-mail - SAVE TREE.* >> > > > > -- > > *M. Engin Sözer* > Junior Datawarehouse Manager > [email protected] > > Goodgame Studios > Theodorstr. 42-90, House 9 > 22761 Hamburg, Germany > Phone: +49 (0)40 219 880 -0 > *www.goodgamestudios.com <http://www.goodgamestudios.com>* > > Goodgame Studios is a brand of Altigi GmbH > Altigi GmbH, District court Hamburg, HRB 99869 > Board of directors: Dr. Kai Wawrzinek, Dr. Christian Wawrzinek, Fabian > Ritter > > -- *M. Engin Sözer* Junior Datawarehouse Manager [email protected] Goodgame Studios Theodorstr. 42-90, House 9 22761 Hamburg, Germany Phone: +49 (0)40 219 880 -0 *www.goodgamestudios.com <http://www.goodgamestudios.com>* Goodgame Studios is a brand of Altigi GmbH Altigi GmbH, District court Hamburg, HRB 99869 Board of directors: Dr. Kai Wawrzinek, Dr. Christian Wawrzinek, Fabian Ritter
