Thanks. The update scenario's are as follows: - Either we "crawl" a whole website and would add a new document for each page (Alt. 1) - either we crawl the whole site, add one document that represents the "crawl" (with some metadata) (with a reference to the pages documents being updated for every new page document that is added). This would be very similar to the traditional http://wiki.apache.org/couchdb/EntityRelationship model... (probably not ideal)
If I think about it, I still think that Alternative 1 might be best. Any thoughts? Also storing the text as attachments seems to be a bit cumbersome... (If we use the model describe in Alt.1 there's not some much text after all in the "page" doc). On Thursday, February 24, 2011 at 4:51 PM, Zachary Zolton wrote: Nicolas, > > Storing that much text in your documents will add a lot of overhead to > your view functions—or any of the other JavaScript design doc > functions you may want to use. > > Therefore, if you don't need to access the raw text of each page to > create your views, you may want try storing them as attachments to > your web site document. This will result in smaller JSON strings > getting marshalled over to the JavaScript view server, needing to be > parsed. > > As for answering what the "best practice" is for how to model > one-to-many relationship, it totally depends on what kind of update > scenarios and methods of access your application requires. > > > Cheers, > > Zach > > On Thu, Feb 24, 2011 at 2:27 AM, Nicolas Peeters <[email protected]> wrote: > > Thanks for your reply. Actually it's either Alt 1. or Alt 2., I guess. I > > don't see why I should be combining. I'm really wondering what the best > > practice is (I'm leaning toward Alt. 1, by the way). It seems like Alt 2. is > > more like hacking the document model to make it look and behave like a > > relational model! > > > > Hoping to get some more advice from the experts out there! > > > > Cheers, > > > > Nicolas > > > > On Wed, Feb 23, 2011 at 7:14 PM, Javier Julio <[email protected]> wrote: > > > > > Nicolas, > > > > > > Great question. I think what you want here and from what I've learned from > > > reading the guide and wiki is a combination of Alternative 1 and 2. While > > > it > > > is suggested to do what you have done there are limits and since you are > > > hitting those limits that's when the alternative approaches come in and > > > are > > > usually best I would think. You might not know if at a later point what > > > you're storing will get to big or if multiple users can work with it > > > (think > > > comments for a blog post). So Alternative 1 and 2 would be great to start > > > with. > > > > > > So basically you can break it down into 2 different document "types". One > > > document with a type of say "website" that just contains the general site > > > info and then a second document with a type of "page" that has the page > > > content as well as the website id, whether that's a URL or you just use > > > the > > > generated id's CouchDB creates. > > > > > > Interesting considering storing the pages as attachments (Alternative 3). > > > No idea if this is beneficial to you in any way so will let others comment > > > on that. > > > > > > Hope this helps. > > > > > > Ciao! > > > Javi > > > > > > On Feb 23, 2011, at 12:25 PM, Nicolas Peeters wrote: > > > > > > > Hi CouchDB community, > > > > > > > > *//Sorry, the previous email was sent too quickly... > > > > * > > > > I have basically a design "best practices" question. We are using > > > > CouchDB > > > to > > > > store crawled web content. The document is pretty self explanatory, the > > > id > > > > is the URL and there's a "pages" array that contains all the text from > > > the > > > > web pages. > > > > Potentially, this document can grow very quickly to a large size (> 20 > > > MB). > > > > It seems that we run into issues ( > > > > https://issues.apache.org/jira/browse/COUCHDB-893) when creating a view > > > with > > > > objects that are larger than 9 MB (in our case). > > > > > > > > { > > > > "_id": "http://www.website.com/", > > > > "_rev": "1-33c75795126ff81b0125156b88593cc0", > > > > *"metadata1" : "blabla", > > > > ** "metadata2" : "blabla",* > > > > "pages": [ > > > > { > > > > "description": "", > > > > "text": "A lot of text comes here....:", > > > > "url": "http://www.website.com/", > > > > "title": "The title of this website /", > > > > "keywords": "", > > > > }, > > > > { > > > > "description": "", > > > > "text": "A lot of text comes here....:", > > > > "url": "http://www.website.com/contact/", > > > > "title": "Contact Page", > > > > "keywords": "", > > > > } > > > > > > > > // MANY other pages here > > > > ], > > > > "crawlDate": "2011-02-10T12:30:07.416+01:00" > > > > } > > > > > > > > This document structure is not working very well for us. We are thinking > > > > about the following alternatives. We would really appreciate if you > > > > could > > > > give expert modelling advice. > > > > > > > > *- Alternative 1)* Create a "page" document where we would have 1 page > > > > (description, text, *parent_url *(which would be the _id of the original > > > > doc)*,* url,...) per document. The rest of the data contained in the > > > > original doc would be duplicated/denormalized. We could then create view > > > > that "assembles" all the pages for a given parent_url (which in essence > > > > would have the same effect of the original implementation). > > > > > > > > *-* *Alternative 2)* Model in One to Many fashion as described here: > > > > http://wiki.apache.org/couchdb/EntityRelationship > > > > * > > > > - Alternative 3) *Keep the design as is, but store the "page" content as > > > > attachment when we store the object. (Subquestion: would that influence > > > the > > > > size of the doc?) > > > > > > > > *- Alternative 4) *Keep the design as is and change some settings in the > > > > configuration that I don't know about. > > > > * > > > > *Subquestion: any particular design reason why this issue ( > > > > https://issues.apache.org/jira/browse/COUCHDB-893) is occuring? Any good > > > > workaround (apart from recompilation!). Any ETC when this will be fixed > > > in a > > > > release version? > > > > > > > > Thank you for your help and advice. > > > > > > > > Nicolas > > > > > > > > PS: The reason that we need a view is that we are using Document Update > > > > handler <http://wiki.apache.org/couchdb/Document_Update_Handlers> to do > > > > incremental updates, view requires some kind of view. The incremental > > > > updates works fine for normal sizes documents. >
