Are you testing this in scan time or via actual minor/major compactions? I know at scan time, there is no guarantee that the iterator remains intact through the entire scan, and it instead may be reconstructed, causing state to be lost. I don't think this is the case for compaction time iterators, but I'm not positive.
On Thu, Jan 3, 2013 at 5:41 PM, Corey Nolet <[email protected]> wrote: > Hey Guys, > > In "Accumulo 1.3.5", I wrote a "Top N" table structure, services and a > FilteringIterator that would allow us to drop in several keys/values > associated with a UUID (similar to a document id). The UUID was further > associated with an "index" (or type). The purpose of the TopN table was to > keep the keys/values separated so that they could still be queried back > with cell-level tagging, but when I performed a query for an index, I would > get the last N UUIDs and further be able to query the keys/values for each > of those UUIDs. > > This problem seemed simple to solve in Accumulo 1.3.5, as I was able to > provide 2 FilteringIterators for compaction time to perform data cleanup of > the table so that any keys/values kept around were guaranteed to be inside > of the range of those keys being managed by the versioning iterator. > > Just to recap, I have the following table structure. I also hash the > keys/values and run a filter before the versioning iterator to clean up any > duplicates. There are two types of columns: index & key/value. > > > Index: > > R: index (or "type" of data) > F: '\x00index' > Q: empty > V: uuid\x00hashOfKeys&Values > > > Key/Value: > > R: index (or "type" of data) > F: uuid > Q: key\x00value > V: empty > > > The filtering iterator that makes sure any key/value rows are in the index > manages a hashset internally. The index rows are purposefully indexed > before the key/value rows so that the filter can build up the hashset > containing those uuids in the index. As the filter iterates into the > key/value rows, it will return true only if the uuid of the key/value > exists inside of the hashset containing the uuids in the index. This worked > with older versions of accumulo but I'm now getting a weird artifact where > INIT() is called on my Filter in the middle of iterating through an index > row. > > More specifically, the Filter will iterate through the index rows of a > specific "index" and build up a hashset, then init() will be called which > wipes away the hashset of uuids, then the further goes on to iterate > through the key/value rows. Keep in mind, we are talking about maybe 400k > entries, not enough to have more than 1 tablet. > > Any idea why this may have worked on 1.3.5 but doesn't work any longer? I > know it has got to be a huge nono to be storing state inside of a filter, > but I haven't had any issues until trying to update my code for the new > version. If I'm doing this completely wrong, any ideas on how to make this > better? > > > Thanks! > > > -- > Corey Nolet > Senior Software Engineer > TexelTek, inc. > [Office] 301.880.7123 > [Cell] 410-903-2110 >
