Re: Large document design question (updated)

Nicolas Peeters Fri, 25 Feb 2011 08:41:39 -0800

Thanks.

The update scenario's are as follows:
- Either we "crawl" a whole website and would add a new document for each page 
(Alt. 1)
- either we crawl the whole site, add one document that represents the "crawl" 
(with some metadata) (with a reference to the pages documents being updated for 
every new page document that is added). This would be very similar to the 
traditional http://wiki.apache.org/couchdb/EntityRelationship model... 
(probably not ideal)


If I think about it, I still think that Alternative 1 might be best. Any 
thoughts?

Also storing the text as attachments seems to be a bit cumbersome... (If we use 
the model describe in Alt.1 there's not some much text after all in the "page" 
doc). 
On Thursday, February 24, 2011 at 4:51 PM, Zachary Zolton wrote:
Nicolas,
> 
> Storing that much text in your documents will add a lot of overhead to
> your view functions—or any of the other JavaScript design doc
> functions you may want to use.
> 
> Therefore, if you don't need to access the raw text of each page to
> create your views, you may want try storing them as attachments to
> your web site document. This will result in smaller JSON strings
> getting marshalled over to the JavaScript view server, needing to be
> parsed.
> 
> As for answering what the "best practice" is for how to model
> one-to-many relationship, it totally depends on what kind of update
> scenarios and methods of access your application requires.
> 
> 
> Cheers,
> 
> Zach
> 
> On Thu, Feb 24, 2011 at 2:27 AM, Nicolas Peeters <[email protected]> wrote:
> > Thanks for your reply. Actually it's either Alt 1. or Alt 2., I guess. I
> > don't see why I should be combining. I'm really wondering what the best
> > practice is (I'm leaning toward Alt. 1, by the way). It seems like Alt 2. is
> > more like hacking the document model to make it look and behave like a
> > relational model!
> > 
> > Hoping to get some more advice from the experts out there!
> > 
> > Cheers,
> > 
> > Nicolas
> > 
> > On Wed, Feb 23, 2011 at 7:14 PM, Javier Julio <[email protected]> wrote:
> > 
> > > Nicolas,
> > > 
> > > Great question. I think what you want here and from what I've learned from
> > > reading the guide and wiki is a combination of Alternative 1 and 2. While 
> > > it
> > > is suggested to do what you have done there are limits and since you are
> > > hitting those limits that's when the alternative approaches come in and 
> > > are
> > > usually best I would think. You might not know if at a later point what
> > > you're storing will get to big or if multiple users can work with it 
> > > (think
> > > comments for a blog post). So Alternative 1 and 2 would be great to start
> > > with.
> > > 
> > > So basically you can break it down into 2 different document "types". One
> > > document with a type of say "website" that just contains the general site
> > > info and then a second document with a type of "page" that has the page
> > > content as well as the website id, whether that's a URL or you just use 
> > > the
> > > generated id's CouchDB creates.
> > > 
> > > Interesting considering storing the pages as attachments (Alternative 3).
> > > No idea if this is beneficial to you in any way so will let others comment
> > > on that.
> > > 
> > > Hope this helps.
> > > 
> > > Ciao!
> > > Javi
> > > 
> > > On Feb 23, 2011, at 12:25 PM, Nicolas Peeters wrote:
> > > 
> > > > Hi CouchDB community,
> > > > 
> > > > *//Sorry, the previous email was sent too quickly...
> > > > *
> > > > I have basically a design "best practices" question. We are using 
> > > > CouchDB
> > > to
> > > > store crawled web content. The document is pretty self explanatory, the
> > > id
> > > > is the URL and there's a "pages" array that contains all the text from
> > > the
> > > > web pages.
> > > > Potentially, this document can grow very quickly to a large size (> 20
> > > MB).
> > > > It seems that we run into issues (
> > > > https://issues.apache.org/jira/browse/COUCHDB-893) when creating a view
> > > with
> > > > objects that are larger than 9 MB (in our case).
> > > > 
> > > > {
> > > >  "_id": "http://www.website.com/";,
> > > >  "_rev": "1-33c75795126ff81b0125156b88593cc0",
> > > > *"metadata1" : "blabla",
> > > > **  "metadata2" : "blabla",*
> > > >  "pages": [
> > > >  {
> > > >  "description": "",
> > > >  "text": "A lot of text comes here....:",
> > > >  "url": "http://www.website.com/";,
> > > >  "title": "The title of this website /",
> > > >  "keywords": "",
> > > >  },
> > > >  {
> > > >  "description": "",
> > > >  "text": "A lot of text comes here....:",
> > > >  "url": "http://www.website.com/contact/";,
> > > >  "title": "Contact Page",
> > > >  "keywords": "",
> > > >  }
> > > > 
> > > > // MANY other pages here
> > > > ],
> > > > "crawlDate": "2011-02-10T12:30:07.416+01:00"
> > > > }
> > > > 
> > > > This document structure is not working very well for us. We are thinking
> > > > about the following alternatives. We would really appreciate if you 
> > > > could
> > > > give expert modelling advice.
> > > > 
> > > > *- Alternative 1)* Create a "page" document where we would have 1 page
> > > > (description, text, *parent_url *(which would be the _id of the original
> > > > doc)*,* url,...) per document. The rest of the data contained in the
> > > > original doc would be duplicated/denormalized. We could then create view
> > > > that "assembles" all the pages for a given parent_url (which in essence
> > > > would have the same effect of the original implementation).
> > > > 
> > > > *-* *Alternative 2)* Model in One to Many fashion as described here:
> > > > http://wiki.apache.org/couchdb/EntityRelationship
> > > > *
> > > > - Alternative 3) *Keep the design as is, but store the "page" content as
> > > > attachment when we store the object. (Subquestion: would that influence
> > > the
> > > > size of the doc?)
> > > > 
> > > > *- Alternative 4) *Keep the design as is and change some settings in the
> > > > configuration that I don't know about.
> > > > *
> > > > *Subquestion: any particular design reason why this issue (
> > > > https://issues.apache.org/jira/browse/COUCHDB-893) is occuring? Any good
> > > > workaround (apart from recompilation!). Any ETC when this will be fixed
> > > in a
> > > > release version?
> > > > 
> > > > Thank you for your help and advice.
> > > > 
> > > > Nicolas
> > > > 
> > > > PS: The reason that we need a view is that we are using Document Update
> > > > handler <http://wiki.apache.org/couchdb/Document_Update_Handlers> to do
> > > > incremental updates, view requires some kind of view. The incremental
> > > > updates works fine for normal sizes documents.
>

Re: Large document design question (updated)

Reply via email to