Hi Andy and Ted!
Thanks for your reply. Basically, I'm currently trying a range scan and a regex
row filter on a very small table (~ 115K rows), just to get used to.
Hadoop/HBase ... is running in the available Cloudera VM.
I have the following row key, as already discussed in other threads.
vehicle_id: up to 16 characters
device_id: up to 16 characters
timestamp: YYYYMMDDhhmmss
Pretty much one row every 5 minutes for a particular vehicle and device.
Now I want to get the rows for an entire day for a particular vehicle and
device.
The following range scan implementation:
Scan scan = new Scan();
String startKey =
String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT, "57").replace('
', '0') // Vehicle ID
+ "-"
+ String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT,
"1").replace(' ', '0') // Device ID
+ "-"
+ "20110808000000";
String endKey =
String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT, "57").replace('
', '0') // Vehicle ID
+ "-"
+ String.format(HBASE_ROWKEY_DATASOURCEID_FORMAT,
"1").replace(' ', '0') // Device ID
+ "-"
+ "20110808235959";
scan.setStartRow(Bytes.toBytes(startKey));
scan.setStopRow(Bytes.toBytes(endKey));
scan.addColumn(Bytes.toBytes("data_details"),
Bytes.toBytes("temperature1_value"));
Takes < 1 sec.
Whereas the following regex based row filter implementation:
List<Filter> filters = new ArrayList<Filter>();
RowFilter rf = new RowFilter(
CompareFilter.CompareOp.EQUAL
, new RegexStringComparator(".{14}57\\-.{15}1\\-20110808.{6}")
);
filters.add(rf);
QualifierFilter qf = new QualifierFilter(
CompareFilter.CompareOp.EQUAL
, new RegexStringComparator("temperature1_value")
);
filters.add(qf);
FilterList filterList1 = new FilterList(filters);
scan.setFilter(filterList1);
Takes around 6 sec on a very small table.
We aren't sure if we need the regex row filter capabilities at all or if range
scans are sufficient for our access pattern. But a better understanding on how
to optimize regex stuff would be helpful.
Thanks!
Thomas
-----Original Message-----
From: Andrew Purtell [mailto:[email protected]]
Sent: Mittwoch, 27. Juli 2011 08:25
To: [email protected]
Subject: Re: Something like Execution Plan as in the RDBMS world?
> Or is this a complete different thinking?
Yes.
There isn't an "execution plan" when using HBase, as that term is commonly
understood from RDBMS systems. The commands you issue against HBase using the
client API are executed in order as you issue them.
> Depending on the access pattern, we might be in a situation to use
>e.g. RegEx filters on rowkeys. I wonder if there is some kind of an
>execution plan when running a HBase query to better understand
Exposing filter statistics (hit/skip ratio etc.) and other per-query metrics
like number of store files read, how many keys examined, etc. is an interesting
idea perhaps along the lines of what you ask, but HBase does not have support
for that level of query performance introspection at the moment.
What people do is measure the application metrics of interest and try different
approaches to optimize them.
Best regards,
- Andy
Problems worthy of attack prove their worth by hitting back. - Piet Hein (via
Tom White)
>________________________________
>From: Steinmaurer Thomas <[email protected]>
>To: [email protected]
>Sent: Tuesday, July 26, 2011 11:10 PM
>Subject: Something like Execution Plan as in the RDBMS world?
>
>Hello,
>
>
>
>we have a three part row-key taking into account that the first part is
>important for distribution/partitioning when the system grows.
>Depending on the access pattern, we might be in a situation to use e.g.
>RegEx filters on rowkeys. I wonder if there is some kind of an
>execution plan (as known in RDBMS) when running a HBase query to better
>understand how HBase processes the query and what execution path it
>takes to generate the result set.
>
>
>
>Or is this a complete different thinking?
>
>
>
>Thanks,
>
>Thomas
>
>
>
>
>
>