Thanks for the ideas. Filters are client side right? I need to read the documentation more as I don't know how to just query a column family. Would it be possible to get all terms that start with a particular value? I was thinking that we would need a special prefix for this but if something could be done without needing it that would work well. On Jan 26, 2014 5:44 PM, "Christopher" <[email protected]> wrote:
> Ah, I see. Well, you could do that with a custom filter (iterator), > but otherwise, no, not unless you had some other special per-term > entry to query (rather than per-term/document pair). The design of > this kind of table though, seems focused on finding documents which > contain the given terms, though, not listing all terms seen. If you > need that additional feature and don't want to write a custom filter, > you could achieve that by putting a special entry in its own row for > each term, in addition to the entries per-term/document pair, as in: > > RowID ColumnFamily Column Qualifier Value > <term1> term - > - > <term1>=<doc_id2> index count 5 > > Then, you could list terms by querying the "term" column family > without getting duplicates. And, you could get decent performance with > this scan if you put the "term" column family and the "index" column > family in separate locality groups. You could even make this entry an > aggregated count for all documents (see documentation for combiners), > in case you want corpus-wide term frequencies (for something like > TF-IDF computations). > > -- > Christopher L Tubbs II > http://gravatar.com/ctubbsii > > > On Sun, Jan 26, 2014 at 7:55 AM, Jamie Johnson <[email protected]> wrote: > > I mean if a user asked for all terms that started with "term" is there a > way > > to get term1 and term2 just once while scanning or would I get each > twice, > > once for each docid and need to filter client side? > > > > On Jan 26, 2014 1:33 AM, "Christopher" <[email protected]> wrote: > >> > >> If you use the Range constructor that takes two arguments, then yes, > >> you'd get two entries. However, "count" would come before "doc_id", > >> though, because the qualifier is part of the Key, and therefore, part > >> of the sort order. There's also a Range constructor that allows you to > >> specify whether you want the startKey and endKey to be inclusive or > >> exclusive. > >> > >> I don't know of a specific document that outlines various strategies > >> that I can link to. Perhaps I'll put one together, when I get some > >> spare time, if nobody else does. I think most people do a lot of > >> experimentation to figure out which strategies work best. > >> > >> I'm not entirely sure what you mean about "getting an iterator over > >> all terms without duplicates". I'm assuming you don't mean duplicate > >> versions of a single entry, which is handled by the > >> VersioningIterator, which should be on new tables by default, and set > >> to retain the recent 1 version, to support updates. With the scheme I > >> suggested, your table would look something like the following, > >> instead: > >> > >> RowID ColumnFamily Column Qualifier Value > >> <term1>=<doc_id1> index count 10 > >> <term1>=<doc_id2> index count 5 > >> <term2>=<doc_id3> index count 3 > >> <term3>=<doc_id1> index count 12 > >> > >> With this scheme, you'd have only a single entry (a count) for each > >> row, and a single row for each term/document combination, so you > >> wouldn't have any duplicate counts for any given term/document. If > >> that's what you mean by duplicates... > >> > >> > >> -- > >> Christopher L Tubbs II > >> http://gravatar.com/ctubbsii > >> > >> > >> On Sat, Jan 25, 2014 at 12:19 AM, Jamie Johnson <[email protected]> > wrote: > >> > Thanks for the reply Chris. Say I had the following > >> > > >> > RowID ColumnFamily Column Qualifier Value > >> > term Occurrence~1 doc_id 1 > >> > term Occurrence~1 count 10 > >> > term2 Occurrence~2 doc_id 2 > >> > term2 Occurrence~2 count 1 > >> > > >> > creating a scanner with start key new Key(new Text("term"), new > >> > Text("Occurrence~1")) and end key new Key(new Text("term"), new > >> > Text("Occurrence~1")) I would get an iterator with two entries, the > >> > first > >> > key would be doc_id and the second would be count. Is that accurate? > >> > > >> > In regards to the other strategies is there anywhere that some of > these > >> > are > >> > captured? Also in the your example, how would you go about getting an > >> > iterator over all terms without duplicates? Again thanks > >> > > >> > > >> > On Fri, Jan 24, 2014 at 11:34 PM, Christopher <[email protected]> > >> > wrote: > >> >> > >> >> It's not quite clear what you mean by "load", but I think you mean > >> >> "iterate over"? > >> >> > >> >> A simplified explanation is this: > >> >> > >> >> When you scan an Accumulo table, you are streaming each entry > >> >> (Key/Value pair), one at a time, through your client code. They are > >> >> only held in memory if you do that yourself in your client code. A > row > >> >> in Accumulo is the set of entries that share a particular value of > the > >> >> Row portion of the Key. They are logically grouped, but are not > >> >> grouped in memory unless you do that. > >> >> > >> >> One additional note is regarding your index schema of a row being a > >> >> search term and columns being documents. You will likely have issues > >> >> with this strategy, as the number of documents for high frequency > >> >> terms grows, because tablets do not split in the middle of a row. > With > >> >> your schema, a row could get too large to manage on a single tablet > >> >> server. A slight variation, like concatenating the search term with a > >> >> document identifier in the row (term=doc1, term=doc2, ....) would > >> >> allow the high frequency terms to split into multiple tablets if they > >> >> get too large. There are better strategies, but that's just one > simple > >> >> option. > >> >> > >> >> > >> >> -- > >> >> Christopher L Tubbs II > >> >> http://gravatar.com/ctubbsii > >> >> > >> >> > >> >> On Fri, Jan 24, 2014 at 10:23 PM, Jamie Johnson <[email protected]> > >> >> wrote: > >> >> > If I have a row that as the key is a particular term and a set of > >> >> > columns > >> >> > that stores the documents that the term appears in if I load the > row > >> >> > is > >> >> > the > >> >> > contents of all of the columns also loaded? Is there a way to page > >> >> > over > >> >> > the > >> >> > columns such that only N columns are in memory at any point? In > this > >> >> > particular case the documents are all in a particular column family > >> >> > (say > >> >> > docs) and the column qualifier is created dynamically, for > arguments > >> >> > sake we > >> >> > can say they are UUIDs. > >> > > >> > >
