John thanks for the quick response! Crazy enough, I'm not doing much differently than the VersioningIterator as it is storing the max number of versions that ti should be returning- right? And that's a scan time iterator (as well as majc/minc).
I am testing it as a scan time iterator (set on the table but using accumulo shell to scan). Perhaps I should force a couple compactions and see what's left afterwards. On Jan 3, 2013, at 5:53 PM, John Vines wrote: > Are you testing this in scan time or via actual minor/major compactions? I > know at scan time, there is no guarantee that the iterator remains intact > through the entire scan, and it instead may be reconstructed, causing state > to be lost. I don't think this is the case for compaction time iterators, but > I'm not positive. > > > On Thu, Jan 3, 2013 at 5:41 PM, Corey Nolet <[email protected]> wrote: > Hey Guys, > > In "Accumulo 1.3.5", I wrote a "Top N" table structure, services and a > FilteringIterator that would allow us to drop in several keys/values > associated with a UUID (similar to a document id). The UUID was further > associated with an "index" (or type). The purpose of the TopN table was to > keep the keys/values separated so that they could still be queried back with > cell-level tagging, but when I performed a query for an index, I would get > the last N UUIDs and further be able to query the keys/values for each of > those UUIDs. > > This problem seemed simple to solve in Accumulo 1.3.5, as I was able to > provide 2 FilteringIterators for compaction time to perform data cleanup of > the table so that any keys/values kept around were guaranteed to be inside of > the range of those keys being managed by the versioning iterator. > > Just to recap, I have the following table structure. I also hash the > keys/values and run a filter before the versioning iterator to clean up any > duplicates. There are two types of columns: index & key/value. > > > Index: > > R: index (or "type" of data) > F: '\x00index' > Q: empty > V: uuid\x00hashOfKeys&Values > > > Key/Value: > > R: index (or "type" of data) > F: uuid > Q: key\x00value > V: empty > > > The filtering iterator that makes sure any key/value rows are in the index > manages a hashset internally. The index rows are purposefully indexed before > the key/value rows so that the filter can build up the hashset containing > those uuids in the index. As the filter iterates into the key/value rows, it > will return true only if the uuid of the key/value exists inside of the > hashset containing the uuids in the index. This worked with older versions of > accumulo but I'm now getting a weird artifact where INIT() is called on my > Filter in the middle of iterating through an index row. > > More specifically, the Filter will iterate through the index rows of a > specific "index" and build up a hashset, then init() will be called which > wipes away the hashset of uuids, then the further goes on to iterate through > the key/value rows. Keep in mind, we are talking about maybe 400k entries, > not enough to have more than 1 tablet. > > Any idea why this may have worked on 1.3.5 but doesn't work any longer? I > know it has got to be a huge nono to be storing state inside of a filter, but > I haven't had any issues until trying to update my code for the new version. > If I'm doing this completely wrong, any ideas on how to make this better? > > > Thanks! > > > -- > Corey Nolet > Senior Software Engineer > TexelTek, inc. > [Office] 301.880.7123 > [Cell] 410-903-2110 >
