Hello, John, you seem to be quite impressed with apache drill .. nice. I am new to un-structured world and just started 1 week back on APACHE DRILL after suggestion from my collegues. We have a semi structured data where we have constraint that we do not know number of columns
I heard that APACHE DRILL is column free applicationa nd with support of JSON format, it allows to create columns on-fly, I converted my data from CSV-like-format to JSON and trying to figure out if it will work for me. Here I hit two issues :- 1) My column were like : 3100.2.1.2 and values like '-2303" or '01/01/2015 02:02:00" Challenge was that column cant be started with Numeric value. So I had to change key as: "t3100.2.1.2" After that things were quite OK, Now I need some help from you guys. To proceed I have to present my work to management as an example. But querying on apache drill console, doesnt seem to be an attractive way to present things. I tried drill explorer too.But didnt find that so good. One thing to note, I am playing with files on Hadoop standalone mode in UBUNTU. To make it appear more good looking, I started with QLIK SENSE .. but was unable to connect it with hadoop file system. It only showed me HIVE FILES. Then I downloaded TABLEAU Trial version ... but I am unable to get Hadoop data here too... Please help me how to proceed. I have presentation on coming Monday. Queries are quite ready .. I just need to show in visualization form ........ using OPEN SOURCE applications only. Guys please help me. On Wed, Oct 21, 2015 at 6:43 PM, John Omernik <[email protected]> wrote: > AWESOME! > > I had just been in the process of writing up a long user story to ask for > and support exactly this. I modified it and included it here: > > > To start out, I want to say how much I love the Drill project, and the > potential it has. I've put this together based on my experiences and want > to contribute a perspective as a user, not just put a bunch of critiques in > an email. I hope it's all taken in that spirit. Additional note, I wrote > this prior to seeing the Design Document share by Hsuan Yi Chu yesterday. > If you are reading it, and think to yourself “that wording is odd…” please > consider it from the “I didn’t want to throw away the user story” > perspective and the “I wrote it before the design doc” perspective. > > > > Additionally, I understand that some of what I am suggesting may not be > easy from a development perspective. I am just being upfront with my > experience, so we can look to determine what can be done; I am not looking > for a silver bullet here, just looking for improvement. Some may be as > simple as better documentation, other suggestions may be harder to > implement. Either way, I thought a verbose user story might be useful to > the community as a whole. > > > > John > > > > *User Story* > > > > As I have been working with Drill for data exploration, I came across > multiple "things" that just were hard. In dealing with some data, > especially JSON data, it can be ugly, and scaled ugly is even worse! > > > > For this story, I am working with a JSON dump from MongoDB, and you would > think it would be well structured, and for the most part it is. There are > some application level mistakes that were made (I will go into that in a > moment), but in general Drill handles this well. So with this data set, > there are a few main challenges I am seeing: > > > > 1. When there is a field that has a float, and then a later record has > the number 0 in it (which Drill takes as a INT). This is a known problem > and one that Drill has a solution for. > > 2. When there is a field is of one type (a map) and then a later record > has a string in it. No easy solution here. > > 3. Select * where there is a json field with a . in the name. I won’t > go into details here, but I feel this factors into data exploration, > because it changes the ability to “stay in Drill” to explore their data ( > https://issues.apache.org/jira/browse/DRILL-3922) > > 4. Error reporting challenges > > > > > > With the problem summary laid out, I wanted to walk through my process in > working with this data, and where, if I were a user Drill could have been > much more helpful to the process. > > > > Here is a description of the process I went through: > > > > 1. Copy data into filesystem > > 2. Use drill to “Select * from `path_to/dump.json` limit 1 > > 3. (I just want to see what it looks like!) > > > > > > Here I get this error: > > > > > select * from `path_to/ dump.json` limit 1; > > Error: DATA_READ ERROR: You tried to write a BigInt type when you are using > a ValueWriter of type NullableFloat8WriterImpl. > > > > File /data/dev/path_to/dump.json > > Record 1 > > Line 1 > > Column 9054 > > Field entropy > > Fragment 0:0 > > > > This isn’t incredibly helpful from a user perspective. I.e. When I Google > around, I realize now that in the docs it talks about “Schema Changes” and > one possible item is use the setting below. However, examples of the data > that was trying to be displayed (with it’s implied type) may help users > grok what is happening. At least in this case it showed me the field name! > > > > ALTER SYSTEM SET `store.json.read_numbers_as_double` = true; > > > > This is a great example where since we have known use case (when numbers > are doubles but someone tries to store 0 an INT) it fails, thus dev’s have > added a setting to allow a user to get through that, that the error message > could be more helpful. In this case, Showing two record numbers (line > numbers) with different types, the field values with their implied types, > and perhaps a suggestion about using the setting to address the problem. > This could make it more intuitive for the user to stay in Drill, and stay > in the data. In this case, I looked at a head of the file, and saw the > issue and was able to proceed. > > > > Also, as a corollary here, the user documentation does not show this error > related to the schema change problem. This would be a great place to state, > “if you see an error that looks like X, this is what is happening and what > you can do for it.” > > > > > > *Side node on documentation* > > We should look to have documentation try to be role based. In this case, > the documentation says use “ALTER SYSTEM” I would argue, and I am guessing > others would concur, that for this use case, “ALTER SESSION” may be a > better suggestion as this is specific alteration to address the use case of > loading/querying a specific data set, and is likely done by a user of the > system. > > > > If a user is doing self-serve data, then in an enterprise environment, they > may not have the ability to use ALTER SYSTEM and get an error, thus may be > confused on how to proceed. In addition ALTER SYSTEM by a user who > doesn’t understand that they are changing, yet have the rights to change, > may introduce future data problems they didn’t expect. I like that the > default is a more constrictive method, because it makes people be explicit > about data, yet the documentation should also aim to be explicit about > something like a system wide change. > > > > > > *Back to the story* > > Ok so now I will do ALTER SESSION SET on the read_numbers_as_double setting > > > > I run the query again. > > > > > select * from `path_to/dump.json` limit 1; > > Error: DATA_READ ERROR: Error parsing JSON - You tried to write a VarChar > type when you are using a ValueWriter of type SingleMapWriter. > > > > File /data/dev/path_to/dump.json > > Record 4009 > > Fragment 0:0 > > > > Another error But what does this one mean? Ok, now that I have been > living in the docs and in the Drill user list, and because it’s similar to > the schema change issue, that that is what we are looking at here. Instead > of double to int, we have one field that is map most of the time, and in > some cases it’s a string. > > > > But this doesn’t really help me as a user. To troubleshoot this Drill > doesn’t offer any options. This file is 500 MB of dense and nested JSON > data with 51k records. My solution? I took the record number, then I went > to my NFS mounted clustered file system (thank goodness I had MapR here, I > am not sure how I would have done this with Posix tools) > > > > My command: $ head -4009 dump.json|tail -1 > > > > That (I hoped) showed me the record in question, note the error from Drill > didn’t tell me which field was at fault here, so I had to visually align > things to address that. However, I was able to spot the difference and > work with the dev to understand why that happened. I removed those records, > and things worked correctly. > > > > Could there have been a way to identify that within drill? My solution was > to take a python script and read through, and discard those records that > were not a map, however, on 500MB that can work, but what about 500 GB? I > guess a Spark job could clean the data…. But could Drill be given some > tools to help with this situation? > > > > For example, the first thing I said was: What field is at issue? I had no > way to see what was up there. I had to use other tools to see the data so > I could understand the problem. Then when I understood the problem, I had > to use Python to produce data that was queryable. > > > > Based on the design document Hsuan Yi Chu just posted to the mailing list, > at this point my post is just a user story to support the design document. > To summarize the points I’d like to see included in the design document > (from a user perspective), not understanding “how or why”: > > > > > > > > > > *1. **Error messages that are more verbose in explaining the problem* > > a. Filename, row number, column number or name > > b. Option to output the “offending row” > > c. Showing the data that is causing the error WITH the type Drill > inferred. > > d. If there are options to help work through dirty data, perhaps the > error message could include those: “Data was an double, then drill found > this data: 0 that was a int in File x, at row 24 in column “myfloatingdata” > consider using store.json.read_numbers_as_double to address the issue. > > 2. *A way to determine how common this exception is* > > a. If I am playing with a messy data set, and this error happens, does > it happen on 1 record? 2? 5000? Knowing that information would: > > i. Help users understand > how Drill is seeing that particular column > > ii. Make decisions on > excluding data rather than just removing it. What if the first 10 records > were errors, and then you excluded the remaining 10 million because they > were correct yet different from the first 10? > > b. Perhaps there could be a “stats” function that only works if it’s > the only selected item or if the select is all those functions (stats > functions)? > > i. Select > type_stats(fieldsname) from data > > ii. (that wouldn’t error > on different types) > > 3. *An ability to set a “return null on this field if error or if non > castable to X type, especially in a view, perhaps in a function.* > > a. Allow them to not have to reparse data outside drill > > b. Load it into a sane format (one time loads/ETL to clean data) > > c. Not be system or session wide exception. > > i. I think this is > important because I may have a field where I want it to read the numbers as > double, but what if I have another field in the same dataset where I don’t > want it to read the numbers as double? A SYSTEM or SESSION level variable > takes away that granularity > > d. Select field1, CASTORNULL(field2, int) as field2, CASTORNULL(field3, > double) as field3 from ugly_data. > > e. That’s an example when it’s in the select, but I Could see a where > clause > > f. Select field1, field2, field3 from ugly data where ISTYPE(field2, > int) and ISTYPE(field3, double) > > 4. *Updating of the documentation related to ALTER SESSION vs ALTER > SYSTEM with an eye to the context of the majority use case of the > documented feature* > > a. For data loads, the documentation uses ALTER SYSTEM and that’s > problematic because: > > i. Not all users have > the privileges to issue an ALTER SYSTEM. Thus a new user trying to figure > things out may not realize they can just ALTER SESSION after getting an > ALTER SYSTEM error. > > ii. ALTER SYSTEM on data > loading items, especially in areas that make Drill’s data interpretation > more permissive can lead to unintended consequences later. An admin, who > may be a good systems admin, and helps a data user troubleshoot and error > may issue an ALTER SYSTEM not realizing this changes all future data > imports. > > b. Note, I found a few cases, but I would suggest a thorough review of > the various use cases throughout the documentation, and in areas where it > really could be either, have a small paragraph indicating the ramifications > of either command. > > *5. **A Philosophy within the Drill Community to “Stay in Drill” for > data exploration* > > a. This is obviously not as much of a development thing as a mindset. > If someone says “I tried to do X, and I got and error” and the communities > response is Y where Y is “Look through your data and do Z to it so Drill > can read it” then we should reconsider that scenario and try to provide and > option within Drill to intuitively handle the edge case. This is > difficult. > > b. There are cases even in the documentation where this is the case: > https://drill.apache.org/docs/json-data-model/ talking about arrays at the > root level or reading some empty arrays. In these cases, we have to leave > drill to fix the problem. This works on small data, but may not work on > large or wide data. Consider the array at root level limitation. What if > some process out of the users control produces 1000 100mb json files and we > want to read that. To fix it, we have to address those files. Lots of work > there, either manual or automated. > > c. Once again I know this isn’t easy, but we shouldn’t answer questions > about how to do something by saying “fix this outside of Drill so Drill can > read your data” if at all possible. > > > > > > > > I hope this story helps support the design document presented. I am happy > to participate in more discussion around these topics as I have enjoying > digging into the internals of Drill > > > > John Omernik > -- *Name: Ganesh Semalty* *Location: Gurgaon,Haryana(India)* *Email Id: [email protected] <[email protected]>* P *Please consider the environment before printing this e-mail - SAVE TREE.*
