Hey Guys, In "Accumulo 1.3.5", I wrote a "Top N" table structure, services and a FilteringIterator that would allow us to drop in several keys/values associated with a UUID (similar to a document id). The UUID was further associated with an "index" (or type). The purpose of the TopN table was to keep the keys/values separated so that they could still be queried back with cell-level tagging, but when I performed a query for an index, I would get the last N UUIDs and further be able to query the keys/values for each of those UUIDs.
This problem seemed simple to solve in Accumulo 1.3.5, as I was able to provide 2 FilteringIterators for compaction time to perform data cleanup of the table so that any keys/values kept around were guaranteed to be inside of the range of those keys being managed by the versioning iterator. Just to recap, I have the following table structure. I also hash the keys/values and run a filter before the versioning iterator to clean up any duplicates. There are two types of columns: index & key/value. Index: R: index (or "type" of data) F: '\x00index' Q: empty V: uuid\x00hashOfKeys&Values Key/Value: R: index (or "type" of data) F: uuid Q: key\x00value V: empty The filtering iterator that makes sure any key/value rows are in the index manages a hashset internally. The index rows are purposefully indexed before the key/value rows so that the filter can build up the hashset containing those uuids in the index. As the filter iterates into the key/value rows, it will return true only if the uuid of the key/value exists inside of the hashset containing the uuids in the index. This worked with older versions of accumulo but I'm now getting a weird artifact where INIT() is called on my Filter in the middle of iterating through an index row. More specifically, the Filter will iterate through the index rows of a specific "index" and build up a hashset, then init() will be called which wipes away the hashset of uuids, then the further goes on to iterate through the key/value rows. Keep in mind, we are talking about maybe 400k entries, not enough to have more than 1 tablet. Any idea why this may have worked on 1.3.5 but doesn't work any longer? I know it has got to be a huge nono to be storing state inside of a filter, but I haven't had any issues until trying to update my code for the new version. If I'm doing this completely wrong, any ideas on how to make this better? Thanks! -- Corey Nolet Senior Software Engineer TexelTek, inc. [Office] 301.880.7123 [Cell] 410-903-2110
