Hello,
some time passed by since I have written the email below.
Meanwhile I have taken a closer look at the Apache Drill sources and I can
answer a few questions myself:
* The HBase plugin performs for each query a Scan over all columns
requested by the query (refer to class
org.apache.drill.exec.store.hbase.HBaseRecordReader).
o The scan is split over the available regions. In my case with small rows
sizes a large number of regions (>100) in HBase improves performance therefore
significantly (for me factor 0.22).
o The query is performed on the data loaded into memory during the scan.
Hence the RAM should be large enough such that the whole set fits into it.
o Decompression of data stored in HBase is transparent for Drill, hence this
is done every time within the HBase code. On my machine this did have a small
but not a significant impact on execution time.
o When the data can be cached in memory by HBase, the next query does not
need disc I/O and hence all subsequent queries to the same dataset are served
from memory.
But one question is still open. Will Apache Drill support in a future version
queries on JSON data nested in HBase columns?
select t.json.dateOfBirth from (select
convert_from(p.filterable.filterable, 'JSON') json from hbase.im_t_person p) t
where t.json.dateOfBirth = '2014-09-07';
Best Regards,
Martin Mois
Von: MOIS Martin (MORPHO)
Gesendet: Mittwoch, 13. Mai 2015 11:14
An: '[email protected]'
Betreff: Columnar data model for JSON stored in HBase column?
Hello,
currently I am evaluating Apache Drill and have a few questions regarding the
implementation details using the HBase Storage Plugin.
The documentation explains that Drill optimizes storage and execution by using
an in-memory data model that is hierarchical and columnar
(http://drill.apache.org/docs/performance/). I understand the term "columnar"
as it is described in the "Dremel" paper
(http://research.google.com/pubs/pub36632.html).
In my use case I have an HBase table that stores in one column data in JSON
format:
Put put = new Put(Bytes.toBytes("my-rowkey..."));
put.add(Bytes.toBytes("filterable"), Bytes.toBytes("filterable"),
Bytes.toBytes("{\"firstName\": \"Martin\", \"lastName\": \"Mois\", ...}"));
As far as I have understood, I have to convert the data in the column to JSON
in order to query them:
0: jdbc:drill:> select t.json.dateOfBirth from (select
convert_from(p.filterable.filterable, 'JSON') json from hbase.person p);
+------------+
| EXPR$0 |
+------------+
| 2007-02-04 |
...
If I now append a condition, I get the following error message:
select t.json.dateOfBirth from (select convert_from(p.filterable.filterable,
'JSON') json from hbase.im_t_person p) t where t.json.dateOfBirth =
'2014-09-07';
Query failed: SYSTEM ERROR: Unexpected exception during fragment
initialization: null
[a2c6cdd8-e5bb-45ab-bd2a-39e728492e58 on trafodion.local:31010]
Error: exception while executing query: Failure while executing query.
(state=,code=0)
The same happens when I create a view for the query above and set filter
conditions on this view.
With the above use case in mind, I have the following questions:
1. Is it possible to query the JSON data inside a column of an HBase
table with conditions?
2. When I query an HBase table, does Apache Drill create a columnar data
structure in memory for the JSON data contained in the HBase column? Is this
in-memory structure re-used by similar queries on the view?
3. If the column family "person" has been created with compression
enabled, when does decompression happen? Once while the in-memory structure is
build or again and again for each query?
4. When we assume that another process updates a row in my HBase table
while the query is running, how does Apache Drill sync the in-memory structure
with updates made to the underlying HBase storage?
Please note that data conversion using the option 'store.format' as explained
in the section "Data Type Conversion"
(http://drill.apache.org/docs/data-type-conversion/) is not an option, as I
want to use Apache Drill as some kind of OLAP system where I can query the data
ad-hoc without any further data conversions.
Is there any kind of documentation (except the source code itself) that
explains such kind of implementation details?
Best Regards,
Martin Mois
#
" This e-mail and any attached documents may contain confidential or
proprietary information. If you are not the intended recipient, you are
notified that any dissemination, copying of this e-mail and any attachments
thereto or use of their contents by any means whatsoever is strictly
prohibited. If you have received this e-mail in error, please advise the sender
immediately and delete this e-mail and all attached documents from your
computer system."
#