I mean if a user asked for all terms that started with "term" is there a way to get term1 and term2 just once while scanning or would I get each twice, once for each docid and need to filter client side? On Jan 26, 2014 1:33 AM, "Christopher" <[email protected]> wrote:
> If you use the Range constructor that takes two arguments, then yes, > you'd get two entries. However, "count" would come before "doc_id", > though, because the qualifier is part of the Key, and therefore, part > of the sort order. There's also a Range constructor that allows you to > specify whether you want the startKey and endKey to be inclusive or > exclusive. > > I don't know of a specific document that outlines various strategies > that I can link to. Perhaps I'll put one together, when I get some > spare time, if nobody else does. I think most people do a lot of > experimentation to figure out which strategies work best. > > I'm not entirely sure what you mean about "getting an iterator over > all terms without duplicates". I'm assuming you don't mean duplicate > versions of a single entry, which is handled by the > VersioningIterator, which should be on new tables by default, and set > to retain the recent 1 version, to support updates. With the scheme I > suggested, your table would look something like the following, > instead: > > RowID ColumnFamily Column Qualifier Value > <term1>=<doc_id1> index count 10 > <term1>=<doc_id2> index count 5 > <term2>=<doc_id3> index count 3 > <term3>=<doc_id1> index count 12 > > With this scheme, you'd have only a single entry (a count) for each > row, and a single row for each term/document combination, so you > wouldn't have any duplicate counts for any given term/document. If > that's what you mean by duplicates... > > > -- > Christopher L Tubbs II > http://gravatar.com/ctubbsii > > > On Sat, Jan 25, 2014 at 12:19 AM, Jamie Johnson <[email protected]> wrote: > > Thanks for the reply Chris. Say I had the following > > > > RowID ColumnFamily Column Qualifier Value > > term Occurrence~1 doc_id 1 > > term Occurrence~1 count 10 > > term2 Occurrence~2 doc_id 2 > > term2 Occurrence~2 count 1 > > > > creating a scanner with start key new Key(new Text("term"), new > > Text("Occurrence~1")) and end key new Key(new Text("term"), new > > Text("Occurrence~1")) I would get an iterator with two entries, the first > > key would be doc_id and the second would be count. Is that accurate? > > > > In regards to the other strategies is there anywhere that some of these > are > > captured? Also in the your example, how would you go about getting an > > iterator over all terms without duplicates? Again thanks > > > > > > On Fri, Jan 24, 2014 at 11:34 PM, Christopher <[email protected]> > wrote: > >> > >> It's not quite clear what you mean by "load", but I think you mean > >> "iterate over"? > >> > >> A simplified explanation is this: > >> > >> When you scan an Accumulo table, you are streaming each entry > >> (Key/Value pair), one at a time, through your client code. They are > >> only held in memory if you do that yourself in your client code. A row > >> in Accumulo is the set of entries that share a particular value of the > >> Row portion of the Key. They are logically grouped, but are not > >> grouped in memory unless you do that. > >> > >> One additional note is regarding your index schema of a row being a > >> search term and columns being documents. You will likely have issues > >> with this strategy, as the number of documents for high frequency > >> terms grows, because tablets do not split in the middle of a row. With > >> your schema, a row could get too large to manage on a single tablet > >> server. A slight variation, like concatenating the search term with a > >> document identifier in the row (term=doc1, term=doc2, ....) would > >> allow the high frequency terms to split into multiple tablets if they > >> get too large. There are better strategies, but that's just one simple > >> option. > >> > >> > >> -- > >> Christopher L Tubbs II > >> http://gravatar.com/ctubbsii > >> > >> > >> On Fri, Jan 24, 2014 at 10:23 PM, Jamie Johnson <[email protected]> > wrote: > >> > If I have a row that as the key is a particular term and a set of > >> > columns > >> > that stores the documents that the term appears in if I load the row > is > >> > the > >> > contents of all of the columns also loaded? Is there a way to page > over > >> > the > >> > columns such that only N columns are in memory at any point? In this > >> > particular case the documents are all in a particular column family > (say > >> > docs) and the column qualifier is created dynamically, for arguments > >> > sake we > >> > can say they are UUIDs. > > > > >
