Re: [CODE4LIB] code4lib.hu codesprint report
- Original Message - From: "Mark A. Matienzo" On Wed, Jun 16, 2010 at 11:13 AM, Karen Coyle wrote: Would it be appropriate for the C4L site to link to Péter's group's page? Of course it would. I have done it. Thanks! Péter eXtensible Catalog
[CODE4LIB] code4lib.hu codesprint report
Hi! I gladly report, that we had the first code4lib.hu codesprint yesterday. The purpose was to code with each other, and learn something from each other. It was a 3,5 hour session at the National Széchényi Library, Budapest. We created a script, which extracts ISBN numbers and book cover images from an OAI-PMH data provider, embeded as METS records. Hopefuly this code will be part in two or three different library or book related services in the next months. We have discussed the technical details, and the advantages, and the right problems of uploading a local history photo collection to Flickr. Unfortunatelly we didn't have time to code the Flickr part. There was only a couple of coders, but we had a goot talk, new acquaintances. (For those in #code4lib: this time we had no bbq, nor 'slambuc', but lots of biscuits and mineral water. ;-) If - for whatever reason - you want to follow or join us, see our group page: http://groups.google.com/group/ikr-fejlesztok/ The meeting was run as a section of the Library's K2 (library 2.0) task force's workshop about the usage of library 2.0 tools. http://blog.konyvtar.hu/k2/ Some technical details: - we use PHP as the common language - for OAI-PMH harvesting we use Omeka's OAI harvester plugin - for Flickr communication we planned to use Phlickr, a PHP library - the OAI server we harvested run at University of Debrecen, and based on DSpace - we found a bug in the Ubuntu version of PHP 5.2.10 (SimpleXMLElement have a problem with xpath() method) - but we found a workaround as well. Regards, Péter
Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?
Hi Fernando, I have started my experience with MARC in Mongo. I have import ~6 million MARC records (auth and bib) to MongoDB. The steps I took: 1) the source was MARCXML I created with XC OAI Toolkit. 2) I created an XSLT file which creates MARC-JSON from MARCXML I followed the MARC-JSON draft and not Bill Dueber's MARC HASH http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11. The conversion is not 100% perfect, but from the 6 million records only 20 were converted with some errors, which is enogh error rate for a home made project. 3) imported the files 4) indexed the files Lessons learned: - the import process is moch more quicker than any other part of the workflow. The 6 million records was imported about 30 minutes, while indexing took 3 hours. - count() is very slow method for complex queries even after intensive indexing. but iterating over the results is more quicker. - there is no way to index part of strings (e.g. splitting the leader or 006/007/008 fields) - full text search is not too quick - before indexing the size of the index was 9 GB, after full index it was 28 GB (I should note, that on 32-bit operation system the max size of mongo index is 2 GB). Conclusions: - the MARC-JSON format is good for data exchange, but it is not enough precise for searching, since - MARC heritage - distinct information are combined together to single fields (Leader, 008 etc). We should split them into smaller information chunks before indexing. - I should learn more about the possibilities of MongoDB I can give you more technical details, if you interested. Péter eXtensible Catalog - Original Message - From: "Fernando Gómez" To: Sent: Thursday, May 13, 2010 2:59 PM Subject: [CODE4LIB] Indexing MARC(-JSON) with MongoDB? There's been some talk in code4lib about using MongoDB to store MARC records in some kind of JSON format. I'd like to know if you have experimented with indexing those documents in MongoDB. From my limited exposure to MongoDB, it seems difficult, unless MongoDB supports some kind of "custom indexing" functionality. According to the MongoDB docs [1], "you can create an index by calling the ensureIndex() function, and providing a document that specifies one or more keys to index." Examples of this are: db.things.ensureIndex({"city": 1}) db.things.ensureIndex({"address.city": 1}) That is, you specify the keys giving a path from the root of the document to the data element you are interested in. Such a path acts both as the index's name, and as an specification of how to get the keys's values. In the case of two proposed MARC-JSON formats [2, 3], I can't see such "path". For example, say you want an index on field 001. Simplifying, the JSON docs would look like this { "fields" : [ ["001", "001 value"], ... ] } or this { "controlfield" : [ { "tag" : "001", "data" : "fst01312614" }, ... ] } How would you specify field 001 to MongoDB? It would be nice to have some kind of custom indexing, where one could provide an index name and separately a JavaScript function specifying how to obtain the keys's values for that index. Any suggestions? Do other document oriented databases offer a better solution for this? BTW, I fed MongoDB with the example MARC records in [2] and [3], and it choked on them. Both are missing some commas :-) [1] http://www.mongodb.org/display/DOCS/Indexes [2] http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/ [3] http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11 -- Fernando Gómez Biblioteca "Antonio Monteiro" INMABB (Conicet / Universidad Nacional del Sur) Av. Alem 1253 B8000CPB Bahía Blanca, Argentina Tel. +54 (291) 459 5116 http://inmabb.criba.edu.ar/
Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?
Hi Fernando, Yesterday I changed the Ubuntu to 64 bit version, because I'd like to try out MongoDB indexing library records, and the 32 bit version has some limitation (the maximal database could not exceed 2 GB). I haven't tried MARC yet, only XC records, which is a derivative of MARC, but from the documentation I read that the idea is absolutely possible. This is an example from Mongo's document [1]: doc = { author: 'joe', created : new Date('03-28-2009'), title : 'Yet another blog post', text : 'Here is the text...', tags : [ 'example', 'joe' ], comments : [ { author: 'jim', comment: 'I disagree' }, { author: 'nancy', comment: 'Good post' } ] } db.post.insert(doc) db.posts.find( { "comments.author" : "jim" } ) The most exciting here - for me - that is is not just a simple key-value storage (a Lucene/Solr), but provides embeding field, so you can bravely insert subfields, indicators etc. The will remain compact and findable. So you can combine the relations known from traditional relational databases and the flexibility and speed known from Solr. I will let you know as soon I could insert first MARC records to Mongo. [1] http://www.mongodb.org/display/DOCS/Inserting regards, Péter eXtensible Catalog - Original Message - From: "Fernando Gómez" To: Sent: Thursday, May 13, 2010 2:59 PM Subject: [CODE4LIB] Indexing MARC(-JSON) with MongoDB? There's been some talk in code4lib about using MongoDB to store MARC records in some kind of JSON format. I'd like to know if you have experimented with indexing those documents in MongoDB. From my limited exposure to MongoDB, it seems difficult, unless MongoDB supports some kind of "custom indexing" functionality. According to the MongoDB docs [1], "you can create an index by calling the ensureIndex() function, and providing a document that specifies one or more keys to index." Examples of this are: db.things.ensureIndex({"city": 1}) db.things.ensureIndex({"address.city": 1}) That is, you specify the keys giving a path from the root of the document to the data element you are interested in. Such a path acts both as the index's name, and as an specification of how to get the keys's values. In the case of two proposed MARC-JSON formats [2, 3], I can't see such "path". For example, say you want an index on field 001. Simplifying, the JSON docs would look like this { "fields" : [ ["001", "001 value"], ... ] } or this { "controlfield" : [ { "tag" : "001", "data" : "fst01312614" }, ... ] } How would you specify field 001 to MongoDB? It would be nice to have some kind of custom indexing, where one could provide an index name and separately a JavaScript function specifying how to obtain the keys's values for that index. Any suggestions? Do other document oriented databases offer a better solution for this? BTW, I fed MongoDB with the example MARC records in [2] and [3], and it choked on them. Both are missing some commas :-) [1] http://www.mongodb.org/display/DOCS/Indexes [2] http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/ [3] http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11 -- Fernando Gómez Biblioteca "Antonio Monteiro" INMABB (Conicet / Universidad Nacional del Sur) Av. Alem 1253 B8000CPB Bahía Blanca, Argentina Tel. +54 (291) 459 5116 http://inmabb.criba.edu.ar/
[CODE4LIB] code4lib.hu workshop
Dear code4lib-ers, during last week (wendesday afternoon) we held the first code4lib.hu workshop in Debrecen, at the University Library. The purpose of the meeting was that the library developers, and library information system's power users meet and talk each other, on order, that in the future different systems could communicate over standard protocols, which is the base condition of any mashupable, shareable service. Preliminary only 9 person said that they will be there for sure, but finally 28 developers participated, from libraries and developer companies. The result was not a workshop for hardcore coders, but an interesting and (more important) productive talking. Since participants were not tied to any concrete project, we could discuss a somehow 'ideal' state-of-art: how to get there, what development and library policy steps would be involved. The discussion focused on the uniform library authentication (one entry oint for all Hungarian library) and the inter-library loan. Some important statements: - the services should be based on standards, either international, or if we couldn't find a proper one, we could form a doemstic (Hungarian) standard - the authentication system provided by the National Infrastructure Agency does not fit for all libraries, since even the university libraries have users, who are not university citizens, so they lack university identifiers - bilateral agreement between libraries is a must have for the unified authentication, that A library accepts the authentication system of B library, and it will provide services for the users of B library - the current statistical measurements are outdated, and could not reflect such a shared services, but since the statistics are the most important measuring tool for the owner of libraries, the libraries tend to not develop shared services, because they could loose some of their resources (they spend on things, which do not reflect in the statistics...) - the inter-library loans could be initialized by the users, and such way, it releases some burden from the librarians. The librarians could controll the whole process, but not as the only player. The meeting was not aimed to agree on anything, so we do not created any document or manifestation, but there were some ideas about the continuation. Since then, one of the participants bought the code4lib.hu domain, and offered it for free to community usage. We restarted an older listserv (at http://groups.google.com/group/ikr-fejlesztok), and we decided, that we will continue the meeting in the near future with lighting talks and discussions on library standards (like NCIP, inter library loans etc.), and personally I hope, that we could do mashaton-like meeting. Final note: somebody said on the code4lib IRC, that we will miss bbq. Well, we didn't have bbq, but as I promissed we had slambuc, a traditional shepherds' dish near Debrecen. Thank you for your support! Király Péter http://eXtensibleCatalog.org
Re: [CODE4LIB] planet code4lib code (was: newbie)
- Original Message - From: "Aaron Rubinstein" I would like to see: 1. Code snippets/gists. For the interface I can imagine a similar something as http://pastebin.com/, like http://drupal.pastebin.com/41WtCpTY, maybe with library-tech related categories (UI, search, circ, admin UI, DB, XML, ...) Péter http://eXtensibleCatalog.org
Re: [CODE4LIB] elag
Hi Eric, I see some information reported by other ELAG participants (mainly from Lukas Koster, Amsterdam) in Twitter: http://twitter.com/#search?q=%23elag2010 Maybe they have got more information about the conf. Péter - Original Message - From: "Eric Lease Morgan" To: Sent: Friday, March 12, 2010 2:11 AM Subject: [CODE4LIB] elag Does anybody here know the status of the ELAG conference taking place in Helsinki this year? [1] I would like to attend, but I haven't seen anything but a call for papers. (I'm too lazy to submit a paper proposal.) [1] conference site - http://elag2010.nationallibrary.fi/ -- Eric Morgan University of Notre Dame
Re: [CODE4LIB] code4lib.hu meetup
Dear Jonathan and Edward, Thank you for your kindness. I will let you know, if the initiative were successfull. Regards, Péter ps. Edward: if you come to Hungary, and you would like to hear some advice about nice places here, drop me a private email, maybe I can help you. - Original Message - From: "Edward M. Corrado" To: Sent: Wednesday, March 10, 2010 5:14 PM Subject: Re: [CODE4LIB] code4lib.hu meetup As Jonathan pointed out, there is nobody to ask formal permission - just go ahead and do it. Personally, I would love to see some of these regional code4lib conferences/meetups/symposium/whatever happen around the world. Who knows, I might even show up to one :-). Edward - who actually plans to be in Hungry for a day or two in late June on his way to Romania. Jonathan Rochkind wrote: There's nobody to ask formal permission for, but I think you've done the right thing by suggesting it on this listserv and seeing what "the community" thinks. As one member of the community, I think that's a great idea and an appropriate use of the code4Lib name, and I expect that everyone else will think so too. You are also welcome to use the Code4Lib wiki if it's useful for your local group/meeting. You can see that other local/regional/national Code4Lib meetups very similar to what you envision have already listed themselves on the wiki and make use of the wiki. Look under "Local / Regional Groups" on http://wiki.code4lib.org/index.php/Main_Page . You are welcome to list your group on the wiki and use the wiki if you like. Jonathan Király Péter wrote: Hi, I would like to ask you, whether is there somebody, from whom I can ask permissions, to use the name code4lib.hu for an unconference meetup, where Hungarian library coders could talk, and pair-program in a style of a Drupal codesprint or OCLC mashaton? Péter eXtensible Catalog
[CODE4LIB] code4lib.hu meetup
Hi, I would like to ask you, whether is there somebody, from whom I can ask permissions, to use the name code4lib.hu for an unconference meetup, where Hungarian library coders could talk, and pair-program in a style of a Drupal codesprint or OCLC mashaton? Péter eXtensible Catalog
Re: [CODE4LIB] faceted browsing
Hi Jill, The eXtensible Catalog (http://eXtensibleCatalog.org) provides similar funtionality. The user interface of the XC is a set of Drupal modules, and it runs inside Drupal, which probably the most popular PHP CMS application. Our modules (called Drupal Toolkit), are able to harvest metadata from OAI-PMH repositories, then process XML, save fields inside MySQL and in Solr. We provided administrator interfaces, where you can decide how to index different fields, what kind of facets do you want to build from the fields, and -- still inside the admin interface -- you can create search and browse interfaces, including search forms, navigationable lists, tempates for results. You can interact with your ILS for circulation data or authentication. You can mashup the results with additional data from external sources, like table of contents, cover images, reviews. The Drupal Toolkit is still in alpha release, we plan to issue the first more stable release in weeks. You can see more in the eXtensible Catalog screencast: http://www.screencast.com/users/eXtensibleCatalog (the second part is about the Drupal Toolkit). You can download the software from here: http://drupal.org/project/xc. If you have any question don't hesitate to contact me, or the leaders of the project. Regards, Péter Király http://eXtensibleCatalog.org - Original Message - From: "Earles, Jill Denae" To: Sent: Monday, February 08, 2010 5:58 PM Subject: [CODE4LIB] faceted browsing I would like recommendations for faceted browsing systems that include authentication, and easily support multimedia content and metadata. The ability to add comments and tags to content, and browse by tag cloud is also desirable. My skills include ColdFusion, PHP, CakePHP, and XML/XSL. The only system I've worked with that includes faceted browsing is XTF, and I don't think it's well suited to this. I am willing to learn a new language/technology if there is a system that includes most of what I'm looking for. Please let me know of any open-source systems you know of that might be suited to this. If you have time and interest, see the detailed description of the system below. Thank you, Jill Earles Detailed description: I am planning to build a system to manage a collection of multimedia artwork, to include audio, video, images, and text along with accompanying metadata. The system should allow for uploading the content and entering metadata, and discovery of content via searching and faceted browsing. Ideally it will also include a couple of ways of visually representing the relationships between items (for example, a video and the images and audio files that are included in the video, and notes about the creative process). The views we've conceived of at this point include a "flow" view that shows relationships with arrows between them (showing chronology or this begat that relationship), and a "constellation" view that shows all of the related items, with or without lines between them. It needs to have security built in so that only contributing members can search and browse the contributions by default. Ideally, there would be an approval process so that a contributor could propose making a work public, and if all contributors involved in the work (including any components of the work, i.e. the images and audio files included in the video) give their approval, the work would be made public. The public site would also have faceted browsing, searching by all metadata that we make public, and possibly tag clouds, and the ability to add tags and comments about the work.
Re: [CODE4LIB] solr - search query count | highlighting
Hi Eric, If you use &debugQuery=on parameter, you'll receive the "explain" structure, which tell you about the score number calculation factors. An example: 1.5076942 = (MATCH) fieldWeight(text:chant in 0), product of: 1.4142135 = tf(termFreq(text:chant)=2) 6.8230457 = idf(docFreq=1, numDocs=676) 0.15625 = fieldNorm(field=text, doc=0) Here tf(termFreq(text:chant)=2) tell you, that the queried term found two times in the document. You should apply a regex to extract this info from the explain string. Since this term is an analyzed term, it is possible that it not equals with the user input, but debug's 'parsedquery' parameter tell you the terms Solr search behind the scene. In Lucene, if the field stores the termVector's positions, there are API calls, that you can get the exact place of the term within the field (as character positions, or as the n-th token), but I don't know how to extract this info through Solr. Hope this helps. Király Péter eXtensible Catalog http://xcproject.org - Original Message - From: "Eric James" To: Sent: Friday, October 16, 2009 9:52 PM Subject: Re: [CODE4LIB] solr - search query count | highlighting Thanks for your response. But, yes I'm able to use facets in general, and yes I'm able to do highlighting on stored fields. But finding how many times the query appears in the full text is my question. For example say you search on "Heisenberg" We'd like to see: Hit 1: Your search for Heisenberg appears 10 times within the Finding Aid Hit 2: Your search for Heisenberg appears 3 times within the Finding Aid Hit 3: Your search for Heisenberg appears 88 times within the Finding Aid etc Could there be a solr parameter that calculates this? Otherwise a klugey, not very scalable method could be that once you retrieve a solr result xml, find the fedora pid, retrieve the EAD full text, run a standard function to count how many times the query appears in the text for each hit, and add parameters back into the xml with these counts. Date: Fri, 16 Oct 2009 15:27:42 -0400 From: ewg4x...@gmail.com Subject: Re: [CODE4LIB] solr - search query count | highlighting To: CODE4LIB@LISTSERV.ND.EDU Hi Eric, You do not have to store the entire text content of the EAD guide in order to enable facets. Here's an example: http://kittredgecollection.org/results?q=*:* . There are about 15 facets enabled on a collection of almost 1500 EAD documents (though quite small in filesize compared to traditional EAD finding aids), and there's no slowdown whatsoever. I don't believe you need to store the guides to enable highlighting either, though I have heard there is some dropoff in performance with highlighting enabled. I've never done benchmarking on highlighting enabled versus disabled, so I can't tell you how much of a dropoff there is. In an index of only several hundred documents, I would think that the dropoff with highlighting enabled would be fairly negligible. Ethan On Fri, Oct 16, 2009 at 3:12 PM, Eric James wrote: > For our finding aids, we are using fedoragenericsearch 2.2 with solr as > index. Because the EADs can be huge, the EADs are indexed but not stored > (with stored EADs, search time for ~500 objects = 20 min rather than < 1 > sec). > > > > However, we would like to have number of search terms found within each > hit. For example, CDL's collection: > > http://www.oac.cdlib.org/search?query=Donner > > > > Also we would like highlighting/snippets of the search term similar to > CDL's. > > > > Is it a lost cause to have this functionality without storing the EAD? > Is > there a way to store the EAD and have a reasonable response time? > > > > --- > > Eric James > > Yale University Libraries > > > > >