Hello,

some time passed by since I have written the email below.

Meanwhile I have taken a closer look at the Apache Drill sources and I can 
answer a few questions myself:


*         The HBase plugin performs for each query a Scan over all columns 
requested by the query (refer to class 
org.apache.drill.exec.store.hbase.HBaseRecordReader).

o   The scan is split over the available regions. In my case with small rows 
sizes a large number of regions (>100) in HBase improves performance therefore 
significantly (for me factor 0.22).

o   The query is performed on the data loaded into memory during the scan. 
Hence the RAM should be large enough such that the whole set fits into it.

o   Decompression of data stored in HBase is transparent for Drill, hence this 
is done every time within the HBase code. On my machine this did have a small 
but not a significant impact on execution time.

o   When the data can be cached in memory by HBase, the next query does not 
need disc I/O and hence all subsequent queries to the same dataset are served 
from memory.

But one question is still open. Will Apache Drill support in a future version 
queries on JSON data nested in HBase columns?

                select t.json.dateOfBirth from (select 
convert_from(p.filterable.filterable, 'JSON') json from hbase.im_t_person p) t 
where t.json.dateOfBirth = '2014-09-07';

Best Regards,
Martin Mois


Von: MOIS Martin (MORPHO)
Gesendet: Mittwoch, 13. Mai 2015 11:14
An: '[email protected]'
Betreff: Columnar data model for JSON stored in HBase column?

Hello,

currently I am evaluating Apache Drill and have a few questions regarding the 
implementation details using the HBase Storage Plugin.

The documentation explains that Drill optimizes storage and execution by using 
an in-memory data model that is hierarchical and columnar 
(http://drill.apache.org/docs/performance/). I understand the term "columnar" 
as it is described in the "Dremel" paper 
(http://research.google.com/pubs/pub36632.html).

In my use case I have an HBase table that stores in one column data in JSON 
format:

Put put = new Put(Bytes.toBytes("my-rowkey..."));
put.add(Bytes.toBytes("filterable"), Bytes.toBytes("filterable"), 
Bytes.toBytes("{\"firstName\": \"Martin\", \"lastName\": \"Mois\", ...}"));

As far as I have understood, I have to convert the data in the column to JSON 
in order to query them:

0: jdbc:drill:> select t.json.dateOfBirth from (select 
convert_from(p.filterable.filterable, 'JSON') json from hbase.person p);
+------------+
|   EXPR$0   |
+------------+
| 2007-02-04 |
...

If I now append a condition, I get the following error message:

select t.json.dateOfBirth from (select convert_from(p.filterable.filterable, 
'JSON') json from hbase.im_t_person p) t where t.json.dateOfBirth = 
'2014-09-07';
Query failed: SYSTEM ERROR: Unexpected exception during fragment 
initialization: null


[a2c6cdd8-e5bb-45ab-bd2a-39e728492e58 on trafodion.local:31010]
Error: exception while executing query: Failure while executing query. 
(state=,code=0)

The same happens when I create a view for the query above and set filter 
conditions on this view.

With the above use case in mind, I have the following questions:

1.       Is it possible to query the JSON data inside a column of an HBase 
table with conditions?

2.       When I query an HBase table, does Apache Drill create a  columnar data 
structure in memory for the JSON data contained in the HBase column? Is this 
in-memory structure re-used by similar queries on the view?

3.       If the column family "person" has been created with compression 
enabled, when does decompression happen? Once while the in-memory structure is 
build or again and again for each query?

4.       When we assume that another process updates a row in my HBase table 
while the query is running, how does Apache Drill sync the in-memory structure 
with updates made to the underlying HBase storage?

Please note that data conversion using the option 'store.format' as explained 
in the section "Data Type Conversion" 
(http://drill.apache.org/docs/data-type-conversion/) is not an option, as I 
want to use Apache Drill as some kind of OLAP system where I can query the data 
ad-hoc without any further data conversions.

Is there any kind of documentation (except the source code itself) that 
explains such kind of implementation details?

Best Regards,
Martin Mois
#
" This e-mail and any attached documents may contain confidential or 
proprietary information. If you are not the intended recipient, you are 
notified that any dissemination, copying of this e-mail and any attachments 
thereto or use of their contents by any means whatsoever is strictly 
prohibited. If you have received this e-mail in error, please advise the sender 
immediately and delete this e-mail and all attached documents from your 
computer system."
#

Reply via email to