I have a table with row keys representing file names, a single column family, 
and file creation time as the column qualifier. The value of these columns is a 
serialized JSON representation of an object. My program goes through the 
records, performs an operation on the file, and modifies the JSON object to 
indicate that the file has been processed. On each run of the program I only 
want to grab up to a specified number of records that have yet to be processed. 
Previously I was grabbing all of the records and filtering at the client side. 
I am now attempting to move the filtering to the server side to reduce network 
traffic and hopefully streamline the process a bit.

I am using a ValueFilter with a SubstringComparator to get the rows that meet 
my conditions.
Scan scan = new Scan();
String filterString = "\"jobState\":\"new\"";

scan.setFilter(new ValueFilter(CompareFilter.CompareOp.EQUAL, new 
SubstringComparator(filterString)));

When records are added they have a jobState of "new" and when they have been 
processed the jobState is set to "processed" and the record in HBase is 
updated. If I do a scan from HBase shell or do a scan of the full table from 
Java I get the most recent version (maximum versions for this table is set to 
1). When I scan using the filter I still get the original version of this row, 
and if I change the filter to use "processed" I get the updated version.

The end result of this is that I process the same files several times. The 
process repeats itself until HBase performs a flush or compaction, verified by 
flushing manually from HBase shell.

I am currently using hbase-shaded-client v1.1.2 for my Java API and I have 
HBase v1.0.0-cdh5.4.8 running on my cluster under Cloudera Manager v5.4.8. I 
believe I found a similar issue posted in December, 2013 
(http://mail-archives.apache.org/mod_mbox/hbase-user/201312.mbox/%3ccadoizqpxq64l75v3t3rgsks-82krymfmnynys-+2u0-f2a0...@mail.gmail.com%3E)
 but there didn't appear to be any resolution to the issue other than creating 
a custom filter.

Is there a newer version of HBase that doesn't have this issue? Is there a 
better way for me to do the filtering that I need to do?

If there is any further information I can provide please let me know. Any 
recommendations/help would be greatly appreciated.

Reply via email to