After a little reading...if I use fetchColumnFamily does that skip any rows that does not have the column family? On Jan 26, 2014 7:27 PM, "Jamie Johnson" <[email protected]> wrote:
> Thanks for the ideas. Filters are client side right? > > I need to read the documentation more as I don't know how to just query a > column family. Would it be possible to get all terms that start with a > particular value? I was thinking that we would need a special prefix for > this but if something could be done without needing it that would work well. > On Jan 26, 2014 5:44 PM, "Christopher" <[email protected]> wrote: > >> Ah, I see. Well, you could do that with a custom filter (iterator), >> but otherwise, no, not unless you had some other special per-term >> entry to query (rather than per-term/document pair). The design of >> this kind of table though, seems focused on finding documents which >> contain the given terms, though, not listing all terms seen. If you >> need that additional feature and don't want to write a custom filter, >> you could achieve that by putting a special entry in its own row for >> each term, in addition to the entries per-term/document pair, as in: >> >> RowID ColumnFamily Column Qualifier Value >> <term1> term - >> - >> <term1>=<doc_id2> index count 5 >> >> Then, you could list terms by querying the "term" column family >> without getting duplicates. And, you could get decent performance with >> this scan if you put the "term" column family and the "index" column >> family in separate locality groups. You could even make this entry an >> aggregated count for all documents (see documentation for combiners), >> in case you want corpus-wide term frequencies (for something like >> TF-IDF computations). >> >> -- >> Christopher L Tubbs II >> http://gravatar.com/ctubbsii >> >> >> On Sun, Jan 26, 2014 at 7:55 AM, Jamie Johnson <[email protected]> wrote: >> > I mean if a user asked for all terms that started with "term" is there >> a way >> > to get term1 and term2 just once while scanning or would I get each >> twice, >> > once for each docid and need to filter client side? >> > >> > On Jan 26, 2014 1:33 AM, "Christopher" <[email protected]> wrote: >> >> >> >> If you use the Range constructor that takes two arguments, then yes, >> >> you'd get two entries. However, "count" would come before "doc_id", >> >> though, because the qualifier is part of the Key, and therefore, part >> >> of the sort order. There's also a Range constructor that allows you to >> >> specify whether you want the startKey and endKey to be inclusive or >> >> exclusive. >> >> >> >> I don't know of a specific document that outlines various strategies >> >> that I can link to. Perhaps I'll put one together, when I get some >> >> spare time, if nobody else does. I think most people do a lot of >> >> experimentation to figure out which strategies work best. >> >> >> >> I'm not entirely sure what you mean about "getting an iterator over >> >> all terms without duplicates". I'm assuming you don't mean duplicate >> >> versions of a single entry, which is handled by the >> >> VersioningIterator, which should be on new tables by default, and set >> >> to retain the recent 1 version, to support updates. With the scheme I >> >> suggested, your table would look something like the following, >> >> instead: >> >> >> >> RowID ColumnFamily Column Qualifier Value >> >> <term1>=<doc_id1> index count 10 >> >> <term1>=<doc_id2> index count 5 >> >> <term2>=<doc_id3> index count 3 >> >> <term3>=<doc_id1> index count 12 >> >> >> >> With this scheme, you'd have only a single entry (a count) for each >> >> row, and a single row for each term/document combination, so you >> >> wouldn't have any duplicate counts for any given term/document. If >> >> that's what you mean by duplicates... >> >> >> >> >> >> -- >> >> Christopher L Tubbs II >> >> http://gravatar.com/ctubbsii >> >> >> >> >> >> On Sat, Jan 25, 2014 at 12:19 AM, Jamie Johnson <[email protected]> >> wrote: >> >> > Thanks for the reply Chris. Say I had the following >> >> > >> >> > RowID ColumnFamily Column Qualifier Value >> >> > term Occurrence~1 doc_id 1 >> >> > term Occurrence~1 count 10 >> >> > term2 Occurrence~2 doc_id 2 >> >> > term2 Occurrence~2 count 1 >> >> > >> >> > creating a scanner with start key new Key(new Text("term"), new >> >> > Text("Occurrence~1")) and end key new Key(new Text("term"), new >> >> > Text("Occurrence~1")) I would get an iterator with two entries, the >> >> > first >> >> > key would be doc_id and the second would be count. Is that accurate? >> >> > >> >> > In regards to the other strategies is there anywhere that some of >> these >> >> > are >> >> > captured? Also in the your example, how would you go about getting >> an >> >> > iterator over all terms without duplicates? Again thanks >> >> > >> >> > >> >> > On Fri, Jan 24, 2014 at 11:34 PM, Christopher <[email protected]> >> >> > wrote: >> >> >> >> >> >> It's not quite clear what you mean by "load", but I think you mean >> >> >> "iterate over"? >> >> >> >> >> >> A simplified explanation is this: >> >> >> >> >> >> When you scan an Accumulo table, you are streaming each entry >> >> >> (Key/Value pair), one at a time, through your client code. They are >> >> >> only held in memory if you do that yourself in your client code. A >> row >> >> >> in Accumulo is the set of entries that share a particular value of >> the >> >> >> Row portion of the Key. They are logically grouped, but are not >> >> >> grouped in memory unless you do that. >> >> >> >> >> >> One additional note is regarding your index schema of a row being a >> >> >> search term and columns being documents. You will likely have issues >> >> >> with this strategy, as the number of documents for high frequency >> >> >> terms grows, because tablets do not split in the middle of a row. >> With >> >> >> your schema, a row could get too large to manage on a single tablet >> >> >> server. A slight variation, like concatenating the search term with >> a >> >> >> document identifier in the row (term=doc1, term=doc2, ....) would >> >> >> allow the high frequency terms to split into multiple tablets if >> they >> >> >> get too large. There are better strategies, but that's just one >> simple >> >> >> option. >> >> >> >> >> >> >> >> >> -- >> >> >> Christopher L Tubbs II >> >> >> http://gravatar.com/ctubbsii >> >> >> >> >> >> >> >> >> On Fri, Jan 24, 2014 at 10:23 PM, Jamie Johnson <[email protected]> >> >> >> wrote: >> >> >> > If I have a row that as the key is a particular term and a set of >> >> >> > columns >> >> >> > that stores the documents that the term appears in if I load the >> row >> >> >> > is >> >> >> > the >> >> >> > contents of all of the columns also loaded? Is there a way to >> page >> >> >> > over >> >> >> > the >> >> >> > columns such that only N columns are in memory at any point? In >> this >> >> >> > particular case the documents are all in a particular column >> family >> >> >> > (say >> >> >> > docs) and the column qualifier is created dynamically, for >> arguments >> >> >> > sake we >> >> >> > can say they are UUIDs. >> >> > >> >> > >> >
