Re: [CODE4LIB] Running a repository on Debian Stable
Thanks to all who responded to this. I went with EPrints, using the Debian/Ubuntu package pointed out by Thomas and others, and it seems to be working OK. On 8 April 2010 16:25, Thomas Krichel kric...@openlib.org wrote: Mike Taylor writes I was surprised to find that there seems to be no package for DSpace, EPrints, http://wiki.eprints.org/w/Installing_EPrints_3_via_apt_%28Debian/Ubuntu%29 Fedora, The problem there, as I understand it is that Fedora expects everything to be in one directory. This setup in inimical to the Debian setup. Most of all, I want something that I can install from the standard operating system packages, using apt-get. I suggest you use aptitude instead. It has superior dependency resolution. Cheers, Thomas Krichel http://openlib.org/home/krichel http://authorclaim.org/profile/pkr1 skype: thomaskrichel
[CODE4LIB] code4lib.hu workshop
Dear code4lib-ers, during last week (wendesday afternoon) we held the first code4lib.hu workshop in Debrecen, at the University Library. The purpose of the meeting was that the library developers, and library information system's power users meet and talk each other, on order, that in the future different systems could communicate over standard protocols, which is the base condition of any mashupable, shareable service. Preliminary only 9 person said that they will be there for sure, but finally 28 developers participated, from libraries and developer companies. The result was not a workshop for hardcore coders, but an interesting and (more important) productive talking. Since participants were not tied to any concrete project, we could discuss a somehow 'ideal' state-of-art: how to get there, what development and library policy steps would be involved. The discussion focused on the uniform library authentication (one entry oint for all Hungarian library) and the inter-library loan. Some important statements: - the services should be based on standards, either international, or if we couldn't find a proper one, we could form a doemstic (Hungarian) standard - the authentication system provided by the National Infrastructure Agency does not fit for all libraries, since even the university libraries have users, who are not university citizens, so they lack university identifiers - bilateral agreement between libraries is a must have for the unified authentication, that A library accepts the authentication system of B library, and it will provide services for the users of B library - the current statistical measurements are outdated, and could not reflect such a shared services, but since the statistics are the most important measuring tool for the owner of libraries, the libraries tend to not develop shared services, because they could loose some of their resources (they spend on things, which do not reflect in the statistics...) - the inter-library loans could be initialized by the users, and such way, it releases some burden from the librarians. The librarians could controll the whole process, but not as the only player. The meeting was not aimed to agree on anything, so we do not created any document or manifestation, but there were some ideas about the continuation. Since then, one of the participants bought the code4lib.hu domain, and offered it for free to community usage. We restarted an older listserv (at http://groups.google.com/group/ikr-fejlesztok), and we decided, that we will continue the meeting in the near future with lighting talks and discussions on library standards (like NCIP, inter library loans etc.), and personally I hope, that we could do mashaton-like meeting. Final note: somebody said on the code4lib IRC, that we will miss bbq. Well, we didn't have bbq, but as I promissed we had slambuc, a traditional shepherds' dish near Debrecen. Thank you for your support! Király Péter http://eXtensibleCatalog.org
[CODE4LIB] NoSQL - is this a real thing or a flash in the pan?
So let's say (hypothetically, of course) that a colleague tells you he's considering a NoSQL database like MongoDB or CouchDB, to store a couple tens of millions of documents, where a document is pretty much an article citation, abstract, and the location of full text (not the full text itself). Would your reaction be: That's a sensible, forward-looking approach. Lots of sites are putting lots of data into these databases and they'll only get better. This guy's on the bleeding edge. Personally, I'd hold off, but it could work. Schedule that 2012 re-migration to Oracle or Postgres now. Bwahahahah!!! Or something else? (http://en.wikipedia.org/wiki/NoSQL is a good jumping-in point.) -- Thomas Dowling tdowl...@ohiolink.edu
Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan?
I personally would vote for: This guy's on the bleeding edge. Personally, I'd hold off, but it could work. However, I attended a webinar on MongoDB and apparently the representative stated that SourceForge has moved to a NoSQL platform using MongoDB and tested their load with 100x growth and visits of what they are already seeing and had zero issues with scalability. That's pretty impressive. Oh, it also managed to be more efficient than a traditional RDBMS. Brendon Kozlowski Web Administrator Saratoga Springs Public Library 49 Henry Street Saratoga Springs, NY, 12866 [518] 584-7860 x217 From: Code for Libraries on behalf of Thomas Dowling Sent: Mon 4/12/2010 10:55 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan? So let's say (hypothetically, of course) that a colleague tells you he's considering a NoSQL database like MongoDB or CouchDB, to store a couple tens of millions of documents, where a document is pretty much an article citation, abstract, and the location of full text (not the full text itself). Would your reaction be: That's a sensible, forward-looking approach. Lots of sites are putting lots of data into these databases and they'll only get better. This guy's on the bleeding edge. Personally, I'd hold off, but it could work. Schedule that 2012 re-migration to Oracle or Postgres now. Bwahahahah!!! Or something else? (http://en.wikipedia.org/wiki/NoSQL is a good jumping-in point.) -- Thomas Dowling tdowl...@ohiolink.edu To report this message as spam, offensive, or if you feel you have received this in error, please send e-mail to ab...@sals.edu including the entire contents and subject of the message. It will be reviewed by staff and acted upon appropriately.
Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan?
Depends on the sort of features required, in particular the access patterns, and the hardware it's going to run on. In my experience, NoSQL systems (for example apache's Cassandra) have extremely good distribution properties over multiple machines, much better than SQL databases. Essentially, it's easier to store a bunch of key/values in a distributed fashion, as you don't need to do joins across tables (there aren't any) and eventually consistent systems (such as Cassandra) don't even need to always be internally consistent between nodes. If many concurrent write accesses are required, then NoSQL can also be a good choice, for the same reasons as it's easily distributed. And for the same reasons, it can be much faster than SQL systems with the same data given a data model that fits the access patterns. The flip side is that if later you want to do something that just requires the equivalent of table joins, it has to be done at the application level. This is going to be MUCH MUCH slower and harder than if there was SQL underneath. Rob On Mon, Apr 12, 2010 at 7:55 AM, Thomas Dowling tdowl...@ohiolink.edu wrote: So let's say (hypothetically, of course) that a colleague tells you he's considering a NoSQL database like MongoDB or CouchDB, to store a couple tens of millions of documents, where a document is pretty much an article citation, abstract, and the location of full text (not the full text itself). Would your reaction be: That's a sensible, forward-looking approach. Lots of sites are putting lots of data into these databases and they'll only get better. This guy's on the bleeding edge. Personally, I'd hold off, but it could work. Schedule that 2012 re-migration to Oracle or Postgres now. Bwahahahah!!! Or something else? (http://en.wikipedia.org/wiki/NoSQL is a good jumping-in point.) -- Thomas Dowling tdowl...@ohiolink.edu
[CODE4LIB] Job Posting: Associate Vice President for Library and Information Services at Wheaton College in Norton, MA
Please excuse cross-postings. -- Associate Vice President for Library and Information Services at Wheaton College in Norton, MA Located between Boston and Providence, Wheaton College is a four-year, private liberal arts college with 1,550 students. The College invites applications and nominations for the Associate Vice President for Library and Information Services. This person provides leadership for Library and Information Services in developing innovative strategies and cultivating strong partnerships in the delivery and use of academic information and technologies to support the mission and priorities of the college. The Wheaton Curriculum offers more than 600 courses in 40 majors and 50 minors. Interdisciplinarity, which lies at the heart of our curriculum, is implemented through connected courses. Our student-faculty ratio of 10-1 and average class size of 15-20 students help foster the close collaborative relationships that develop between our undergraduates and faculty. With a nationally recognized record of achievement in using technology and information resources to enhance teaching and learning, Wheaton College considers a unified vision of library and information technology critical to fulfilling its liberal arts mission. In 2004, the College merged the Library, Academic Computing, and Information and Technology Services to create Library and Information Services (LIS), which encompasses the functions of research and instruction, collections and public access, technology support and infrastructure. The Associate Vice President for Library and Information Services will lead a team of five individuals who oversee these areas, to continue the development of current programs, provide support for the research and teaching activities of faculty and students, and raise funds by seeking further grant support. In addition, the successful candidate will chair the Administrative Technology Committee and work with the faculty's Educational Policy Committee and the Library, Technology, and Learning Committee, to develop new initiatives that fulfill curricular goals, including integrating information fluency, new media, and digital scholarship in the educational experience of Wheaton College students. The successful candidate will create a shared vision through leading collaborative, team-based processes; manage external relationships; and work strategically with college leaders for both library and college-wide interests. This person will -implement technology and strategic plans to deliver comprehensive, integrated library and information services for the college - manage resources, facilities, and services that respond to the needs of students, faculty, and staff -oversee personnel and resource administration, budget planning and allocation, and overall project management -practice outreach and communication with students, faculty members and administrative staff -set and maintain standards of service and quality -and establish instruments for benchmarking and continuous assessment. The successful candidate for this position will report directly to the Provost (Chief Academic Officer), work closely with the Vice President for Finance and Operations, and regularly consult with the President's Council of senior advisors. Where appropriate, the position will carry faculty status. Minimum Qualifications:Wheaton College seeks a collaborative and visionary leader with extensive experience in one or more areas of information technology and service in an academic setting. The successful candidate has demonstrated the ability to foster teamwork and work effectively with faculty members and staff at all levels. The new Associate Vice President for Library and Information Services should possess a graduate degree in a relevant field, such as librarianship, information science, computer science, or related field, or have equivalent experience or certification.
[CODE4LIB] Job Posting: Senior Programmer Analyst - Office of Digital Assets and Infrastructure, Yale University
Senior Programmer Analyst Office of Digital Assets and Infrastructure, Yale University New Haven, CT ( http://tinyurl.com/yyn7dgz ) ODAI is charged with developing a digital information management strategy for Yale and building digital collections and technical infrastructure in a coordinated and collaborative manner across the entire campus. Programs include the development and deployment of large-scale digital asset management systems, long-term preservation repositories for Yale digital content in all formats, cross-collection search capabilities to enable discovery of collections hosted by numerous departments and many other innovative initiatives. The Senior Programmer Analyst will lead the planning, development, implementation, maintenance, and support of software applications that stand alone, extend functionality of existing systems, bridge systems through interoperability, and provide end-user functionality to the academic community. The software development includes but is not limited to digital asset management systems, digital library systems, knowledge management systems, media processing systems, storage systems, and related ancillary products and services. -- Michael Appleby Senior Software Developer Office of Digital Assets and Infrastructure Yale University e michael.appl...@yale.edu
Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan?
The advantage of the NoSQL DBs is that they're schema-less which allows much more flexibility in your data going in. However, it sounds like your schema may be pretty standardized -- I'm not sure of a huge advantage (outside the aforementioned replication functionality) you'd get. -Ross. On Mon, Apr 12, 2010 at 10:55 AM, Thomas Dowling tdowl...@ohiolink.edu wrote: So let's say (hypothetically, of course) that a colleague tells you he's considering a NoSQL database like MongoDB or CouchDB, to store a couple tens of millions of documents, where a document is pretty much an article citation, abstract, and the location of full text (not the full text itself). Would your reaction be: That's a sensible, forward-looking approach. Lots of sites are putting lots of data into these databases and they'll only get better. This guy's on the bleeding edge. Personally, I'd hold off, but it could work. Schedule that 2012 re-migration to Oracle or Postgres now. Bwahahahah!!! Or something else? (http://en.wikipedia.org/wiki/NoSQL is a good jumping-in point.) -- Thomas Dowling tdowl...@ohiolink.edu
Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan?
I'd opt for the first response. I hope NoSQL is not flash in the pan. It makes eminent sense to me. SQL is just one way of looking at data. A level of abstraction. What authority says that SQL is the only or the best way of looking at a dataset? Or the MARC record format for that matter? They certainly weren't inscribed on stone tablets. These things can become mind prisons. I think it's refreshing that there are those willing to look at databases beyond SQL. Peter Schlumpf www.avantilibrarysystems.com -Original Message- From: Thomas Dowling tdowl...@ohiolink.edu Sent: Apr 12, 2010 10:55 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan? So let's say (hypothetically, of course) that a colleague tells you he's considering a NoSQL database like MongoDB or CouchDB, to store a couple tens of millions of documents, where a document is pretty much an article citation, abstract, and the location of full text (not the full text itself). Would your reaction be: That's a sensible, forward-looking approach. Lots of sites are putting lots of data into these databases and they'll only get better. This guy's on the bleeding edge. Personally, I'd hold off, but it could work. Schedule that 2012 re-migration to Oracle or Postgres now. Bwahahahah!!! Or something else? (http://en.wikipedia.org/wiki/NoSQL is a good jumping-in point.) -- Thomas Dowling tdowl...@ohiolink.edu
Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan?
I'd actually vote for the sensible, forward-looking approach. The BBC (for one) is already using CouchDB in a production: http://damienkatz.net/2010/03/bbc_and_couchdb.html That said, NoSQL as a movement is as wide and varied as the RDBMS world, and there are pros and cons to each. I'm personally a proponent of CouchDB because it's RESTful API, JSON storage system, and JavaScript (or Erlang, PHP, Python, Ruby, etc) map/reduce view engine. If your project need replication at all (whether for scaling, data sharing, etc), I'd take a good hard look at CouchDB as that's it's core distinction among the other NoSQL databases. Hope that helps, Benjamin -- President BigBlueHat P: 864.232.9553 W: http://www.bigbluehat.com/ http://www.linkedin.com/in/benjaminyoung On 4/12/10 10:55 AM, Thomas Dowling wrote: So let's say (hypothetically, of course) that a colleague tells you he's considering a NoSQL database like MongoDB or CouchDB, to store a couple tens of millions of documents, where a document is pretty much an article citation, abstract, and the location of full text (not the full text itself). Would your reaction be: That's a sensible, forward-looking approach. Lots of sites are putting lots of data into these databases and they'll only get better. This guy's on the bleeding edge. Personally, I'd hold off, but it could work. Schedule that 2012 re-migration to Oracle or Postgres now. Bwahahahah!!! Or something else? (http://en.wikipedia.org/wiki/NoSQL is a good jumping-in point.)
Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan?
SQL-style JOINs can be done in CouchDB (can't speak for the other NoSQL DB's). In CouchDB, it's called view collation: http://chrischandler.name/couchdb/view-collation-for-join-like-behavior-in-couchdb/ It's a different way of thinking (as there are no tables, and map/reduce goes through every document to generate it's output), but it is possible to get interestingly combined data out of the whole database. Later, Benjamin -- President BigBlueHat P: 864.232.9553 W: http://www.bigbluehat.com/ http://www.linkedin.com/in/benjaminyoung On 4/12/10 11:08 AM, Robert Sanderson wrote: Depends on the sort of features required, in particular the access patterns, and the hardware it's going to run on. In my experience, NoSQL systems (for example apache's Cassandra) have extremely good distribution properties over multiple machines, much better than SQL databases. Essentially, it's easier to store a bunch of key/values in a distributed fashion, as you don't need to do joins across tables (there aren't any) and eventually consistent systems (such as Cassandra) don't even need to always be internally consistent between nodes. If many concurrent write accesses are required, then NoSQL can also be a good choice, for the same reasons as it's easily distributed. And for the same reasons, it can be much faster than SQL systems with the same data given a data model that fits the access patterns. The flip side is that if later you want to do something that just requires the equivalent of table joins, it has to be done at the application level. This is going to be MUCH MUCH slower and harder than if there was SQL underneath. Rob On Mon, Apr 12, 2010 at 7:55 AM, Thomas Dowlingtdowl...@ohiolink.edu wrote: So let's say (hypothetically, of course) that a colleague tells you he's considering a NoSQL database like MongoDB or CouchDB, to store a couple tens of millions of documents, where a document is pretty much an article citation, abstract, and the location of full text (not the full text itself). Would your reaction be: That's a sensible, forward-looking approach. Lots of sites are putting lots of data into these databases and they'll only get better. This guy's on the bleeding edge. Personally, I'd hold off, but it could work. Schedule that 2012 re-migration to Oracle or Postgres now. Bwahahahah!!! Or something else? (http://en.wikipedia.org/wiki/NoSQL is a good jumping-in point.) -- Thomas Dowling tdowl...@ohiolink.edu
Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan?
The thing is, the NoSQL stuff is pretty much just a key-value store. There's generally no way to query the store, instead you can simply look up a document by ID. If this meets the needs of your application, all you need is a key-value store, and not any kind of query, then it's definitely going to be a lot less overhead than an actual SQL rdbms, and simpler to manage, with advantages for scalability and replication etc. The reason it's simpler and more performant, is well, because it's _simpler_, you don't actually have querrying or joining abilities. But if you are actually going to need querrying on values other than ID... SQL rdbms is a pretty standardized, well understood way to do this. There are certainly other ways -- you could combine a noSQL key-value store with Solr/Lucene, for instance. Which in some cases may get you even better performance and more flexiblity than an rdbms solution. But it's (IMO) going to be a bit harder to set up and manage and use in your favorite development environment, precisely because rdbms is such a time-tested standardized mature approach. So, as usual, the right tool for the job. If all you really need is a key-value store on ID, then a NoSQL solution may be the right thing. But if you need actual querrying and joining, then personally I'd stick with rdbms unless I had some concrete reason to think a more complicated nosql+solr solution was required. Certainly if you are planning on using Solr _anyway_ because your application is a search engine of some type, that would lessen the incremental 'cost' of a nosql+solr solution. [ Note that if all you want is a schemaless storage, you CAN just stick large chunks of binary or text in an rdbms 'blob' or 'text' column. You won't be able to efficiently search on these -- but you aren't able to efficiently search in a 'nosql' solution either. So you _can_ use an rdbms like a nosql solution to store arbitrary data, no problem. If you're using an rdbms, you can have _other_ columns in addition to your blob/text one, that you can populate for select and join. If you _aren't_ going to need those -- then there's be no reason to do it in an rdbms (even though you could), you would indeed then just want to use a 'nosql' key-value store solution which will be higher performance. So the conclusion again I think is that rdbms is _more powerful_ than nosql, but that power comes with a performance cost. If you don't need it, nosql. If you do need it -- there's no reason you can't store structureless units of data in text/blob in an rdbms too. ] Peter Schlumpf wrote: I'd opt for the first response. I hope NoSQL is not flash in the pan. It makes eminent sense to me. SQL is just one way of looking at data. A level of abstraction. What authority says that SQL is the only or the best way of looking at a dataset? Or the MARC record format for that matter? They certainly weren't inscribed on stone tablets. These things can become mind prisons. I think it's refreshing that there are those willing to look at databases beyond SQL. Peter Schlumpf www.avantilibrarysystems.com -Original Message- From: Thomas Dowling tdowl...@ohiolink.edu Sent: Apr 12, 2010 10:55 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan? So let's say (hypothetically, of course) that a colleague tells you he's considering a NoSQL database like MongoDB or CouchDB, to store a couple tens of millions of documents, where a document is pretty much an article citation, abstract, and the location of full text (not the full text itself). Would your reaction be: That's a sensible, forward-looking approach. Lots of sites are putting lots of data into these databases and they'll only get better. This guy's on the bleeding edge. Personally, I'd hold off, but it could work. Schedule that 2012 re-migration to Oracle or Postgres now. Bwahahahah!!! Or something else? (http://en.wikipedia.org/wiki/NoSQL is a good jumping-in point.) -- Thomas Dowling tdowl...@ohiolink.edu
Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan?
On Mon, 12 Apr 2010, Jonathan Rochkind wrote: So, as usual, the right tool for the job. If all you really need is a key-value store on ID, then a NoSQL solution may be the right thing. But if you need actual querrying and joining, then personally I'd stick with rdbms unless I had some concrete reason to think a more complicated nosql+solr solution was required. Certainly if you are planning on using Solr _anyway_ because your application is a search engine of some type, that would lessen the incremental 'cost' of a nosql+solr solution. I'm surprised that I keep hearing so much about NoSQL for key-value stores, and everyone seems to forget the *old* key-value stores, such as directory services (X.500 and LDAP, although that's actually the protocol used to query them, not the storage implementation). Yes, there are things that LDAP doesn't do so well (relationships being one of them), but it supports querying, you can adjust the matching by attribute (ie, this one's matched as a number, this one's matched as a string, this one's a case insensitive string ... I think some implementations have functionality to run the search term through a functions for things like soundex, so it might be possible add hooks for stemming and query expansion, etc.) I think that NoSQL got a lot of press because of Google having used it (and their having a *VERY* large data system -- but not everyone has that large of a system; also, Google did it 10+ years ago -- you can now through a lot more CPU and RAM at an RDBMS, so the point at which the database becomes a problem isn't the same as it was when Google first came out.) ... So, I think that there are cases where NoSQL is the right solution for the job, and I think there are times when an DRBMS is the right solution ... there are also plenty of times for flat file databases, XML, LDAP, and a slew of other storage standards. -Joe hmm ... now I'm going to have to try to bring back my attempt to put my catalogs into a directory service ... I have a feeling I'm going to run into issues with unit conversions when searching.
Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan?
On Mon, Apr 12, 2010 at 12:22 PM, Jonathan Rochkind rochk...@jhu.edu wrote: The thing is, the NoSQL stuff is pretty much just a key-value store. There's generally no way to query the store, instead you can simply look up a document by ID. Actually, this depends largely on the NoSQL DBMS in question. Some are key value stores (Redis, Tokyo Cabinet, Cassandra), some are document-based (CouchDB, MongoDB), some are graph-based (Neo4J), so I think blanket statements like this are somewhat misleading. CouchDB and MongoDB (for example) have the capacity to index the values within the document - you don't just have to look up things by document ID. -Ross.
Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan?
On Mon, Apr 12, 2010 at 12:22 PM, Jonathan Rochkind rochk...@jhu.eduwrote: The thing is, the NoSQL stuff is pretty much just a key-value store. There's generally no way to query the store, instead you can simply look up a document by ID. Schemaless != no way to query. Key-value stores, like memcache, are just one end of what most consider the nosql spectrum. For instance, I can query my CouchDB instances through the different views I create. I thought this blog post had an interesting take on NoSQL, although this guy, Mike Stonebreaker of VoltDB, obviously has a horse in the race. http://cacm.acm.org/blogs/blog-cacm/50678-the-nosql-discussion-has-nothing-to-do-with-sql/fulltext --jay
Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan?
Yeah, I may have gotten it completely wrong. Okay, help this grasshopper (possibly by pointing me to relevant documentation), what's the difference between document-based and key-value store? When I've looked at CouchDB before, despite it describing itself as document based, I haven't been able to tell what the difference is between it and a key value store. It seemed to support storing a document by key, and retrieving it by key. It didn't seem to _do_ anything special with the document other than storing it there (maybe it DOES, but I missed it?). So you can call it a document instead of a value, but I couldn't figure out how that differed from a key-value store. I guess it's that CouchDB _does_ let you build indexes on values other than the key? Wacky, wonder how I missed that when I reviewed it last. Jonathan Ross Singer wrote: On Mon, Apr 12, 2010 at 12:22 PM, Jonathan Rochkind rochk...@jhu.edu wrote: The thing is, the NoSQL stuff is pretty much just a key-value store. There's generally no way to query the store, instead you can simply look up a document by ID. Actually, this depends largely on the NoSQL DBMS in question. Some are key value stores (Redis, Tokyo Cabinet, Cassandra), some are document-based (CouchDB, MongoDB), some are graph-based (Neo4J), so I think blanket statements like this are somewhat misleading. CouchDB and MongoDB (for example) have the capacity to index the values within the document - you don't just have to look up things by document ID. -Ross.
Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan?
On Mon, Apr 12, 2010 at 10:55 AM, Thomas Dowling tdowl...@ohiolink.edu wrote: So let's say (hypothetically, of course) that a colleague tells you he's considering a NoSQL database like MongoDB or CouchDB, to store a couple tens of millions of documents, where a document is pretty much an article citation, abstract, and the location of full text (not the full text itself). Would your reaction be: There's really two reactions in here. One about NoSQL and the other about your colleague. As for NoSQL i would be on the side that the ecosystem is here to stay although individual projects may or may not take off/evolve. The best description I've seen about nosql as a whole is choice[1]. Not having to shove everything in a similar style database for every project and making the database fit the data/use. Theres a large number of projects now, each with their own priorities and the trade-offs they've made to reach them. Some care about consistency, others eventual consistency is good enough and others go as far as distributed transactions over nodes. Some do lazy writes to disk, others not. How you query your data also varies quite a bit with sql-like, map/reduce, hadoop, etc. From your brief description it sounds like quite a few projects could fit the bill, including rdbms-types, and which one you want would probably depend on what you think you might do in the future. If you foresee yourself having lots of fields that might only cover certain subsets of the dataset then couchdb or the like are probably worth looking at. As for the colleague, I guess the question is why? If it is because of trendiness then Bwahahahah!!! might be the best answer. But I'm guessing they've thought about the data and what benefits they would get out of the backend. [1] http://blog.couch.io/post/511008668/nosql-is-about
Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan?
On Mon, 12 Apr 2010, Ryan Eby wrote: [trimmed] But I'm guessing they've thought about the data and what benefits they would get out of the backend. Wow. You obviously don't work with the same folks that I do. I've been attached to one project for about 16 months now, while the rest of the team's been together for 4 years ... I've been trying to get a few changes made to better support my user community (basically, all of the people who don't have access to their system, or don't want to spend the 6 months using the system 'to be able to do something almost useful'. About 2-3 months ago, the main project team finally realized that they have *no*idea* what the user community wants or needs. Oh, and they have to go live on April 21st. I'm expecting a major 'wtf?' reaction from the majority of the community. -Joe
Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan?
From my understanding of key/value stores, one can put documents on the other side of the key, but any and all parsing/processing of that value happens outside of the database. In CouchDB, the entire document is query-able from within map/reduce views. After being querying on, those keys are indexed for faster future queries. So, in that way, CouchDB jumps over the key/value limitations and becomes a document database. In addition to map/reduce output, there's also a handy _update system that can be used to validate a JSON document prior to it's insertion in the database--again, something not possible with key/value storage. You can, though, use CouchDB in a key/value fashion by storing binary data (or HTML, XML, RDF, etc) as attachments or JSON encoded strings (where possible). In that case, you would just be retrieving them by id (or URL), but you could store all kinds of ad hoc metadata about those attachments and use those to query with later. Also, the blog article Ryan Eby just posted, is a great (and quick) overview of the varied noSQL ecosystem. In many ways, these systems are as different as they are similar. Hope you (re)search goes well, Benjamin -- President BigBlueHat P: 864.232.9553 W: http://www.bigbluehat.com/ http://www.linkedin.com/in/benjaminyoung On 4/12/10 2:42 PM, Jonathan Rochkind wrote: Yeah, I may have gotten it completely wrong. Okay, help this grasshopper (possibly by pointing me to relevant documentation), what's the difference between document-based and key-value store? When I've looked at CouchDB before, despite it describing itself as document based, I haven't been able to tell what the difference is between it and a key value store. It seemed to support storing a document by key, and retrieving it by key. It didn't seem to _do_ anything special with the document other than storing it there (maybe it DOES, but I missed it?). So you can call it a document instead of a value, but I couldn't figure out how that differed from a key-value store. I guess it's that CouchDB _does_ let you build indexes on values other than the key? Wacky, wonder how I missed that when I reviewed it last. Jonathan Ross Singer wrote: On Mon, Apr 12, 2010 at 12:22 PM, Jonathan Rochkind rochk...@jhu.edu wrote: The thing is, the NoSQL stuff is pretty much just a key-value store. There's generally no way to query the store, instead you can simply look up a document by ID. Actually, this depends largely on the NoSQL DBMS in question. Some are key value stores (Redis, Tokyo Cabinet, Cassandra), some are document-based (CouchDB, MongoDB), some are graph-based (Neo4J), so I think blanket statements like this are somewhat misleading. CouchDB and MongoDB (for example) have the capacity to index the values within the document - you don't just have to look up things by document ID. -Ross.
Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan?
Michael Stonebraker *is* the horse, and yet has pointed pointed out that RDBMSs aren't always the hammer you're looking for. Next time you use a B-tree or R-tree (spatial search, anyone?), give him a toast with your favorite beverage. http://cacm.acm.org/blogs/blog-cacm/32212-the-end-of-a-dbms-era-might-be-upon-us/fulltext http://en.wikipedia.org/wiki/Michael_Stonebraker -Original Message- From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Jay Luker Sent: Monday, April 12, 2010 10:38 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan? On Mon, Apr 12, 2010 at 12:22 PM, Jonathan Rochkind rochk...@jhu.eduwrote: The thing is, the NoSQL stuff is pretty much just a key-value store. There's generally no way to query the store, instead you can simply look up a document by ID. Schemaless != no way to query. Key-value stores, like memcache, are just one end of what most consider the nosql spectrum. For instance, I can query my CouchDB instances through the different views I create. I thought this blog post had an interesting take on NoSQL, although this guy, Mike Stonebreaker of VoltDB, obviously has a horse in the race. http://cacm.acm.org/blogs/blog-cacm/50678-the-nosql-discussion-has-nothing-to-do-with-sql/fulltext --jay
Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan?
On 04/12/2010 03:26 PM, Ryan Eby wrote: As for the colleague, I guess the question is why?... He's hoping it'll impress the babes. :-) Seriously (and not to draw the conversation to a close), thanks to all for their insights. -- Thomas Dowling tdowl...@ohiolink.edu
Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan?
So let's say (hypothetically, of course) that a colleague tells you he's considering a NoSQL database like MongoDB or CouchDB, to store a couple tens of millions of documents, where a document is pretty much an article citation, abstract, and the location of full text (not the full text itself). Would your reaction be: Noo!!! NoSQL is terrible for startup projects ;) http://labs.mudynamics.com/2010/04/01/why-nosql-is-bad-for-startups/ But seriously, it depends. You know, a lotta ins, lotta outs, lotta what-have-yous. I sort of like MongoDB's characterization of the landscape as tradeoffs between scale performance on the one hand and depth of functionality on the other: http://www.mongodb.org/display/DOCS/Philosophy I suspect we'll continue to see more hybrid systems for some time to come with various data stores handling the pieces they do best.
Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan?
On 4/12/10 4:47 PM, Ryan Eby wrote: You could put your logs, marc records broken out by fields or arrays/hashes (types in couchdb) in any of them but the approach each takes would limit you (or empower you) differently. Once there's a good marc2json script (and format) out there, it'd be grand to see marc records dumped into CouchDB to allow them to be replicated between groups of librarians (and even up to OpenLibrary). I'm still up for helping make that possible if anyone's into that. :)
Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan?
Couldn't you do MARC - MARCXML - JSON? -Andrew On 2010-04-12, at 5:00 PM, Benjamin Young wrote: On 4/12/10 4:47 PM, Ryan Eby wrote: You could put your logs, marc records broken out by fields or arrays/hashes (types in couchdb) in any of them but the approach each takes would limit you (or empower you) differently. Once there's a good marc2json script (and format) out there, it'd be grand to see marc records dumped into CouchDB to allow them to be replicated between groups of librarians (and even up to OpenLibrary). I'm still up for helping make that possible if anyone's into that. :)
Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan?
There are at least TWO good marc2json formats, and several open source scripts at least for Bill Dueber's, no? Benjamin Young wrote: On 4/12/10 4:47 PM, Ryan Eby wrote: You could put your logs, marc records broken out by fields or arrays/hashes (types in couchdb) in any of them but the approach each takes would limit you (or empower you) differently. Once there's a good marc2json script (and format) out there, it'd be grand to see marc records dumped into CouchDB to allow them to be replicated between groups of librarians (and even up to OpenLibrary). I'm still up for helping make that possible if anyone's into that. :)
Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan?
On 4/12/10 5:04 PM, Andrew Hankinson wrote: Couldn't you do MARC - MARCXML - JSON? -Andrew Certainly, but the hard part is knowing what you want MARC to look like once it's in JSON. XML 2 JSON conversions generally need some love to make the data meaningful on the JSON side (as attributes and such make a 1-to-1 conversion complicated--though there have been attempts at general conversion scripts). Once a JSON output format for MARC is done, then converting from MARCXML to marc.json (or whatever) would be an easy first step. On 2010-04-12, at 5:00 PM, Benjamin Young wrote: On 4/12/10 4:47 PM, Ryan Eby wrote: You could put your logs, marc records broken out by fields or arrays/hashes (types in couchdb) in any of them but the approach each takes would limit you (or empower you) differently. Once there's a good marc2json script (and format) out there, it'd be grand to see marc records dumped into CouchDB to allow them to be replicated between groups of librarians (and even up to OpenLibrary). I'm still up for helping make that possible if anyone's into that. :)