Re: Using CouchDB to represent the tokenized text of a book

Freddy Bowen Wed, 03 Nov 2010 07:24:54 -0700

CouchDB views have a feature called linked documents:
http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Linked_documents


<http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Linked_documents>You
could store each token as a doc.  Then store the order of tokens in
a separate doc.  To change the order of the tokens you'd update the "order"
doc.

Consider this position doc:
{ _id:"Genesis-1:1", type:"position",
position:["token987","token123","token456"]}

And these token docs:
[
  { _id:"token123", type:"token", word:"the"},
  { _id:"token987", type:"token", word:"In"},
  { _id:"token456", type:"token", word:"beginning"}
]

Then a view like this:
function(doc) {
  if (doc.type=="position") {
    var token=doc.position;
    for (var i=0; i<token.length; i++) {
      emit([doc._id, i], token[i]);
    }
  }
}

Emits this:
{"total_rows":3,"offset":0,"rows":[
{"id":"Genesis-1:1","key":["Genesis-1:1",0],"doc":{"_id":"token987",
type:"token", word:"In"}},
{"id":"Genesis-1:1","key":["Genesis-1:1",1],"doc":{"_id":"token123",
type:"token", word:"the"}},
{"id":"Genesis-1:1","key":["Genesis-1:1",2],"doc":{"_id":"token456",
type:"token", word:"beginning"}}
]}

Maybe you can make an approach like this work for you?

FB


On Wed, Nov 3, 2010 at 9:16 AM, Dirkjan Ochtman <[email protected]> wrote:

> On Wed, Nov 3, 2010 at 14:04, Weston Ruter <[email protected]> wrote:
> > That is a good idea, but the problem with Bible translations in
> particular
> > is the issue of overlapping hierarchies: like chapter and verse don't
> always
> > fall along same divisions as section and paragraph. So the data model
> I've
> > been moving toward is standoff markup, where there is a set of tokens
> > (words, punctuation) for the entire book and then a set of structures
> > (paragraphs, verses, etc) that refer to the start token and end token, so
> > when getting a structure it needs to retrieve all tokens from start to
> end.
> > The use of standoff markup and overlapping hierarchies makes your idea of
> > using sorting buckets not feasible, I don't think. Thanks for the idea
> > though!
>
> Not sure I agree. My "buckets" are somewhat arbitrary and don't
> actually have to be mapped to any real structure. The trick is just
> that by prefixing with a bucket index, you don't have to update all
> tokens anymore, you only have to update tokens inside the bucket (or
> the next bucket if you happened to be moving a token to the next
> bucket). Your standoff thing (I'm not really used to that term, so no
> clue if I'm using it correctly) would still work, only you now
> reference tokens by bucket and token index, not just token index.
>
> Cheers,
>
> Dirkjan
>

Re: Using CouchDB to represent the tokenized text of a book

Reply via email to