I know I have a few threads going here on my trials and tribulations, but wanted to wrap a summary up here on what I am seeing and where I am with support. First of all, thanks to all who have been pointing me in the right directions on things, it's greatly appreciated.
So a quick summary is I have some Parquet files in directories by day, created on a Cloudera Cluster running parquet-mr 1.5-cdh. We are using snappy compression, dictionary encoding and version 1_0 as a parquet setting. You can see the sizes of three days of data below. Also below, it shows I am running this through a view because string columns show up as binary in Drill, I use the convert_from(field, 'UTF8') to get the proper strings. My goal was to take this data and CTAS to drill created parquet files for optimal performance. Problem 1: The array-index-out-of-bounds happened on particular field. This did not happen in Impala on the same exact files. (See "Reading and converting Parquet files intended for Impala") Problem 2: When experimenting, I found that I could set store.parquet.use_new_reader and the CTAS would work. That said, that setting did add 155 seconds to one day that was working. Lots of time added. Problem 3: All methods of CTAS (with or without the reader) created much larger files than the Map Reduce job. My guess is the lack of dictionary encoding... Problem 4: When I enabled dictionary encoding, the Array out of Bounds issue still existed for the days with the troubled data, but did eventually work on the other day. The query took a LONG time, but made files that were similar in size to the originals. Problem 5: When I tried to use the new reader and the dictionary settings together, I could put my cluster in a nasty state, it appears one of the drill bits had a SIGSEGV. (see below) I have more information there, but this is interesting, because instead of failing the query, it just hung everything. (Theory: I have supervision on my drill bits, could it be that a timeout period to fail out everything if a drillbit went down, wasn't actually met, because as soon the drillbit crashed, my supervision restarted the drillbit and that's what put the cluster into a bad state? This is something to explore...) So a number of problems listed here, these are only the "unresolved" problems, in that via Paul, we identified that my gclogging wasn't actually happening due to a bug in the drill startup scripts. I implemented a work around there. So, MapR Support pointed out to me that for the hung cluster issue, my use of the two variables isn't actually supported per https://drill.apache.org/docs/configuration-options-introduction/. This leaves my Array out of bounds issue still active then, because my only solution there was not supported. Question 1: On the dictionary encoding, isn't this a standard part of Parquet? Why doesn't Drill support this? If it's planned, what is the timeline for allowing a "supported" use of this feature (vs. "For internal use. Do not change.") Question 2: Similar question, while this is not as related to the standard parquet project, but is there a timeline/roadmap for the planned (or not) support of the new reader? Question 3: I am working to get more data about the parquet files using the parquet-tools, what other approaches may I take here? Question 4: Am I missing anything crazy here? John *use_new_reader* *enable_dictionary_encoding* *2016-05-11* *2016-05-12* *2016-05-13* Size of Input Parquet 129G 131G 124G Number of Row 293145837 302062341 281465763 FALSE FALSE Array out of bounds/OK Array out of bounds/OK Success/OK/139G/440.5s TRUE FALSE Success/OK/144G/664.8s Success/OK/147G/589.2s Success/OK/139G/595.0s FALSE TRUE Array out of bounds/OK Array out of bounds/OK Success/OK/125G/869.9s TRUE TRUE Hung/BadState Did not test Did not test Query Status/Cluster Status/Size/Time 103 Columns All string columns run through CONVERT_FROM(field, 'UTF8') in view Time during Hung Query 14:34 Start 14:50 Web Server unresponsive, sqlline hung, no errors in logs, profile gone Error in .out file on drillbit that was crashed and restarted. Jun 1, 2016 2:34:16 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 233,314B for [threat_name] BINARY: 635,320 values, 364,413B raw, 232,979B comp, 5 pages, encodings: [BIT_PACKED, PLAIN_DICTIONARY, RLE], dic { 1,481 entries, 76,680B raw, 1,481B comp} Jun 1, 2016 2:34:16 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 296B for [parent_observation_event_id] INT64: 635,320 values, 71B raw, 81B comp, 5 pages, encodings: [BIT_PACKED, PLAIN_DICTIONARY,# # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007fc3380520d0, pid=115847, tid=140474354837248 # # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14) # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 compressed oops) # Problematic frame: # v ~StubRoutines::jbyte_disjoint_arraycopy # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # /opt/mapr/mesos/tmp/slave/slaves/2c6e4a98-1407-487a-b767-2e180b2147d4-S4/frameworks/2c6e4a98-1407-487a-b767-2e180b2147d4-0000/executors/drillprod.28450c22-2805-11e6-8dd0-0242a8b995fe/runs/17e487cf-e1ed-4961-9a4b-2055a6644313/hs_err_pid115847.log [thread 140474362205952 also had an error][thread 140474370627328 also had an error][thread 140474384312064 also had an error][thread 140474381154048 also had an error][thread 140474367469312 also had an error][thread 140474369574656 also had an error][thread 140474366416640 also had an error][thread 140474382206720 also had an error][thread 140474375890688 also had an error][thread 140474387470080 also had an error][thread 140474360100608 also had an error][thread 140474374838016 also had an error]
