[CODE4LIB] code4lib.hu codesprint report

2010-06-15 Thread Király Péter

Hi!

I gladly report, that we had the first code4lib.hu codesprint yesterday.
The purpose was to code with each other, and learn something from
each other. It was a 3,5 hour session at the National Széchényi Library,
Budapest. We created a script, which extracts ISBN numbers and book
cover images from an OAI-PMH data provider, embeded as METS
records. Hopefuly this code will be part in two or three different library
or book related services in the next months. We have discussed the
technical details, and the advantages, and the right problems of uploading
a local history photo collection to Flickr. Unfortunatelly we didn't
have time to code the Flickr part.
There was only a couple of coders, but we had a goot talk, new 
acquaintances.

(For those in #code4lib: this time we had no bbq, nor 'slambuc', but lots of
biscuits and mineral water. ;-)

If - for whatever reason - you want to follow or join us, see our group 
page:

http://groups.google.com/group/ikr-fejlesztok/

The meeting was run as a section of the Library's K2 (library 2.0)
task force's workshop about the usage of library 2.0 tools.
http://blog.konyvtar.hu/k2/

Some technical details:
- we use PHP as the common language
- for OAI-PMH harvesting we use Omeka's OAI harvester plugin
- for Flickr communication we planned to use Phlickr, a PHP library
- the OAI server we harvested run at University of Debrecen, and based on 
DSpace

- we found a bug in the Ubuntu version of PHP 5.2.10 (SimpleXMLElement have
a problem with xpath() method) - but we found a workaround as well.

Regards,
Péter


Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?

2010-05-14 Thread Király Péter

Hi Fernando,

I have started my experience with MARC in Mongo. I have import ~6 million
MARC records (auth and bib) to MongoDB. The steps I took:

1) the source was MARCXML I created with XC OAI Toolkit.
2) I created an XSLT file which creates MARC-JSON from MARCXML
I followed the MARC-JSON draft and not Bill Dueber's MARC HASH
http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11. The
conversion is not 100% perfect, but from the 6 million records only 20
were converted with some errors, which is enogh error rate for a home
made project.
3) imported the files
4) indexed the files

Lessons learned:
- the import process is moch more quicker than any other part of the 
workflow.

The 6 million records was imported about 30 minutes, while indexing took
3 hours.
- count() is very slow method for complex queries even after intensive 
indexing.

but iterating over the results is more quicker.
- there is no way to index part of strings (e.g. splitting the leader or 
006/007/008

fields)
- full text search is not too quick
- before indexing the size of the index was 9 GB, after full index it was 28 
GB

(I should note, that on 32-bit operation system the max size of mongo index
is 2 GB).

Conclusions:
- the MARC-JSON format is good for data exchange, but it is not enough 
precise
for searching, since - MARC heritage - distinct information are combined 
together to
single fields (Leader, 008 etc). We should split them into smaller 
information chunks

before indexing.
- I should learn more about the possibilities of MongoDB

I can give you more technical details, if you interested.

Péter
eXtensible Catalog


- Original Message - 
From: Fernando Gómez fjgo...@gmail.com

To: CODE4LIB@LISTSERV.ND.EDU
Sent: Thursday, May 13, 2010 2:59 PM
Subject: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?



There's been some talk in code4lib about using MongoDB to store MARC
records in some kind of JSON format. I'd like to know if you have
experimented with indexing those documents in MongoDB. From my limited
exposure to MongoDB, it seems difficult, unless MongoDB supports some
kind of custom indexing functionality.

According to the MongoDB docs [1], you can create an index by calling
the ensureIndex() function, and providing a document that specifies
one or more keys to index. Examples of this are:

   db.things.ensureIndex({city: 1})
   db.things.ensureIndex({address.city: 1})

That is, you specify the keys giving a path from the root of the
document to the data element you are interested in. Such a path acts
both as the index's name, and as an specification of how to get the
keys's values.

In the case of two proposed MARC-JSON formats [2, 3], I can't see such
path. For example, say you want an index on field 001. Simplifying,
the JSON docs would look like this

   { fields : [ [001, 001 value], ... ] }

or this

   { controlfield : [ { tag : 001, data : fst01312614 }, ... ] }

How would you specify field 001 to MongoDB?

It would be nice to have some kind of custom indexing, where one could
provide an index name and separately a JavaScript function specifying
how to obtain the keys's values for that index.

Any suggestions? Do other document oriented databases offer a better
solution for this?


BTW, I fed MongoDB with the example MARC records in [2] and [3], and
it choked on them. Both are missing some commas :-)


[1] http://www.mongodb.org/display/DOCS/Indexes
[2] http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/
[3] http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11


--
Fernando Gómez
Biblioteca Antonio Monteiro
INMABB (Conicet / Universidad Nacional del Sur)
Av. Alem 1253
B8000CPB Bahía Blanca, Argentina
Tel. +54 (291) 459 5116
http://inmabb.criba.edu.ar/



Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?

2010-05-13 Thread Király Péter

Hi Fernando,

Yesterday I changed the Ubuntu to 64 bit version, because I'd like to try 
out

MongoDB indexing library records, and the 32 bit version has some limitation
(the maximal database could not exceed 2 GB). I haven't tried MARC yet, only
XC records, which is a derivative of MARC, but from the documentation I read
that the idea is absolutely possible.

This is an example from Mongo's document [1]:

doc = { author: 'joe',
 created : new Date('03-28-2009'),
 title : 'Yet another blog post',
 text : 'Here is the text...',
 tags : [ 'example', 'joe' ],
 comments : [ { author: 'jim', comment: 'I disagree' },
 { author: 'nancy', comment: 'Good post' }
 ]
}
db.post.insert(doc)
db.posts.find( { comments.author : jim } )

The most exciting here - for me - that is is not just a simple key-value
storage (a Lucene/Solr), but provides embeding field, so you can bravely
insert subfields, indicators etc. The will remain compact and findable.
So you can combine the relations known from traditional relational
databases and the flexibility and speed known from Solr.

I will let you know as soon I could insert first MARC records to Mongo.

[1] http://www.mongodb.org/display/DOCS/Inserting

regards,
Péter
eXtensible Catalog

- Original Message - 
From: Fernando Gómez fjgo...@gmail.com

To: CODE4LIB@LISTSERV.ND.EDU
Sent: Thursday, May 13, 2010 2:59 PM
Subject: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?



There's been some talk in code4lib about using MongoDB to store MARC
records in some kind of JSON format. I'd like to know if you have
experimented with indexing those documents in MongoDB. From my limited
exposure to MongoDB, it seems difficult, unless MongoDB supports some
kind of custom indexing functionality.

According to the MongoDB docs [1], you can create an index by calling
the ensureIndex() function, and providing a document that specifies
one or more keys to index. Examples of this are:

   db.things.ensureIndex({city: 1})
   db.things.ensureIndex({address.city: 1})

That is, you specify the keys giving a path from the root of the
document to the data element you are interested in. Such a path acts
both as the index's name, and as an specification of how to get the
keys's values.

In the case of two proposed MARC-JSON formats [2, 3], I can't see such
path. For example, say you want an index on field 001. Simplifying,
the JSON docs would look like this

   { fields : [ [001, 001 value], ... ] }

or this

   { controlfield : [ { tag : 001, data : fst01312614 }, ... ] }

How would you specify field 001 to MongoDB?

It would be nice to have some kind of custom indexing, where one could
provide an index name and separately a JavaScript function specifying
how to obtain the keys's values for that index.

Any suggestions? Do other document oriented databases offer a better
solution for this?


BTW, I fed MongoDB with the example MARC records in [2] and [3], and
it choked on them. Both are missing some commas :-)


[1] http://www.mongodb.org/display/DOCS/Indexes
[2] http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/
[3] http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11


--
Fernando Gómez
Biblioteca Antonio Monteiro
INMABB (Conicet / Universidad Nacional del Sur)
Av. Alem 1253
B8000CPB Bahía Blanca, Argentina
Tel. +54 (291) 459 5116
http://inmabb.criba.edu.ar/



[CODE4LIB] code4lib.hu workshop

2010-04-12 Thread Király Péter

Dear code4lib-ers,

during last week (wendesday afternoon) we held the first
code4lib.hu workshop in Debrecen, at the University Library.
The purpose of the meeting was that the library developers,
and library information system's power users meet and talk
each other, on order, that in the future different systems
could communicate over standard protocols, which is the
base condition of any mashupable, shareable service.

Preliminary only 9 person said that they will be there for
sure, but finally 28 developers participated, from libraries
and developer companies. The result was not a workshop for
hardcore coders, but an interesting and (more important)
productive talking. Since participants were not tied to any
concrete project, we could discuss a somehow 'ideal' state-of-art:
how to get there, what development and library policy steps
would be involved. The discussion focused on the uniform
library authentication (one entry oint for all Hungarian library)
and the inter-library loan. Some important statements:

- the services should be based on standards, either international,
or if we couldn't find a proper one, we could form a doemstic
(Hungarian) standard

- the authentication system provided by the National
Infrastructure Agency does not fit for all libraries, since
even the university libraries have users, who are not university
citizens, so they lack university identifiers

- bilateral agreement between libraries is a must have for the
unified authentication, that A library accepts the authentication
system of B library, and it will provide services for the users
of B library

- the current statistical measurements are outdated, and could
not reflect such a shared services, but since the statistics are
the most important measuring tool for the owner of libraries,
the libraries tend to not develop shared services, because they
could loose some of their resources (they spend on things,
which do not reflect in the statistics...)

- the inter-library loans could be initialized by the users, and
such way, it releases some burden from the librarians. The librarians
could controll the whole process, but not as the only player.

The meeting was not aimed to agree on anything, so we do not created
any document or manifestation, but there were some ideas about the
continuation. Since then, one of the participants bought the code4lib.hu
domain, and offered it for free to community usage. We restarted
an older listserv (at http://groups.google.com/group/ikr-fejlesztok),
and we decided, that we will continue the meeting in the near
future with lighting talks and discussions on library standards
(like NCIP, inter library loans etc.), and personally I hope,
that we could do mashaton-like meeting.

Final note: somebody said on the code4lib IRC, that we
will miss bbq. Well, we didn't have bbq, but as I promissed
we had slambuc, a traditional shepherds' dish near Debrecen.

Thank you for your support!

Király Péter
http://eXtensibleCatalog.org 


Re: [CODE4LIB] planet code4lib code (was: newbie)

2010-03-29 Thread Király Péter
- Original Message - 
From: Aaron Rubinstein arubi...@library.umass.edu

I would like to see:

1.  Code snippets/gists.


For the interface I can imagine a similar something as http://pastebin.com/,
like http://drupal.pastebin.com/41WtCpTY, maybe with library-tech related
categories (UI, search, circ, admin UI, DB, XML, ...)

Péter
http://eXtensibleCatalog.org 


[CODE4LIB] code4lib.hu meetup

2010-03-10 Thread Király Péter

Hi,

I would like to ask you, whether is there somebody, from whom I can ask
permissions, to use the name code4lib.hu for an unconference meetup, where
Hungarian library coders could talk, and pair-program in a style of a Drupal
codesprint or OCLC mashaton?

Péter
eXtensible Catalog 


Re: [CODE4LIB] code4lib.hu meetup

2010-03-10 Thread Király Péter

Dear Jonathan and Edward,

Thank you for your kindness. I will let you know, if the initiative
were successfull.

Regards,
Péter

ps. Edward: if you come to Hungary, and you would like to hear some
advice about nice places here, drop me a private email, maybe I can
help you.


- Original Message - 
From: Edward M. Corrado ecorr...@ecorrado.us

To: CODE4LIB@LISTSERV.ND.EDU
Sent: Wednesday, March 10, 2010 5:14 PM
Subject: Re: [CODE4LIB] code4lib.hu meetup


As Jonathan pointed out, there is nobody to ask formal permission - just 
go ahead and do it. Personally, I would love to see some of these regional 
code4lib conferences/meetups/symposium/whatever happen around the world. 
Who knows, I might even show up to one :-).


Edward - who actually plans to be in Hungry for a day or two in late June 
on his way to Romania.




Jonathan Rochkind wrote:
There's nobody to ask formal permission for, but I think you've done the 
right thing by suggesting it on this listserv and seeing what the 
community thinks.


As one member of the community, I think that's a great idea and an 
appropriate use of the code4Lib name, and I expect that everyone else 
will think so too.
You are also welcome to use the Code4Lib wiki if it's useful for your 
local group/meeting.  You can see that other local/regional/national 
Code4Lib meetups very similar to what you envision have already listed 
themselves on the wiki and make use of the wiki. Look under Local / 
Regional Groups on http://wiki.code4lib.org/index.php/Main_Page  .   You 
are welcome to list your group on the wiki and use the wiki if you like.


Jonathan

Király Péter wrote:

Hi,

I would like to ask you, whether is there somebody, from whom I can ask
permissions, to use the name code4lib.hu for an unconference meetup, 
where
Hungarian library coders could talk, and pair-program in a style of a 
Drupal

codesprint or OCLC mashaton?

Péter
eXtensible Catalog





Re: [CODE4LIB] faceted browsing

2010-02-09 Thread Király Péter

Hi Jill,

The eXtensible Catalog (http://eXtensibleCatalog.org) provides similar
funtionality. The user interface of the XC is a set of Drupal modules, and
it runs inside Drupal, which probably the most popular PHP CMS application.

Our modules (called Drupal Toolkit), are able to harvest metadata from
OAI-PMH repositories, then process XML, save fields inside MySQL and in 
Solr.

We provided administrator interfaces, where you can decide how to
index different fields, what kind of facets do you want to build from
the fields, and -- still inside the admin interface -- you can create search
and browse interfaces, including search forms, navigationable lists, 
tempates for
results. You can interact with your ILS for circulation data or 
authentication.

You can mashup the results with additional data from external sources, like
table of contents, cover images, reviews.

The Drupal Toolkit is still in alpha release, we plan to issue the first
more stable release in weeks.

You can see more in the eXtensible Catalog screencast:
http://www.screencast.com/users/eXtensibleCatalog
(the second part is about the Drupal Toolkit).

You can download the software from here: http://drupal.org/project/xc.
If you have any question don't hesitate to contact me, or the
leaders of the project.

Regards,

Péter Király
http://eXtensibleCatalog.org



- Original Message - 
From: Earles, Jill Denae jdear...@ku.edu

To: CODE4LIB@LISTSERV.ND.EDU
Sent: Monday, February 08, 2010 5:58 PM
Subject: [CODE4LIB] faceted browsing



I would like recommendations for faceted browsing systems that include
authentication, and easily support multimedia content and metadata.  The
ability to add comments and tags to content, and browse by tag cloud is
also desirable.

My skills include ColdFusion, PHP, CakePHP, and XML/XSL.  The only
system I've worked with that includes faceted browsing is XTF, and I
don't think it's well suited to this.  I am willing to learn a new
language/technology if there is a system that includes most of what I'm
looking for.

Please let me know of any open-source systems you know of that might be
suited to this.  If you have time and interest, see the detailed
description of the system below.

Thank you,
Jill Earles

Detailed description:

I am planning to build a system to manage a collection of multimedia
artwork, to include audio, video, images, and text along with
accompanying metadata.  The system should allow for uploading the
content and entering metadata, and discovery of content via searching
and faceted browsing.  Ideally it will also include a couple of ways of
visually representing the relationships between items (for example, a
video and the images and audio files that are included in the video, and
notes about the creative process).  The views we've conceived of at this
point include a flow view that shows relationships with arrows between
them (showing chronology or this begat that relationship), and a
constellation view that shows all of the related items, with or
without lines between them.

It needs to have security built in so that only contributing members can
search and browse the contributions by default.  Ideally, there would be
an approval process so that a contributor could propose making a work
public, and if all contributors involved in the work (including any
components of the work, i.e. the images and audio files included in the
video) give their approval, the work would be made public.  The public
site would also have faceted browsing, searching by all metadata that we
make public, and possibly tag clouds, and the ability to add tags and
comments about the work.



Re: [CODE4LIB] solr - search query count | highlighting

2009-10-16 Thread Király Péter

Hi Eric,

If you use debugQuery=on parameter, you'll receive the explain structure, 
which tell

you about the score number calculation factors. An example:

str name=oai:URMST:Transformation_Service/1
1.5076942 = (MATCH) fieldWeight(text:chant in 0), product of:
 1.4142135 = tf(termFreq(text:chant)=2)
 6.8230457 = idf(docFreq=1, numDocs=676)
 0.15625 = fieldNorm(field=text, doc=0)
/str

Here tf(termFreq(text:chant)=2) tell you, that the queried term found two 
times
in the document. You should apply a regex to extract this info from the 
explain
string. Since this term is an analyzed term, it is possible that it not 
equals with the
user input, but debug's 'parsedquery' parameter tell you the terms Solr 
search

behind the scene.

In Lucene, if the field stores the termVector's positions, there are API 
calls, that
you can get the exact place of the term within the field (as character 
positions,
or as the n-th token), but I don't know how to extract this info through 
Solr.


Hope this helps.

Király Péter
eXtensible Catalog
http://xcproject.org

- Original Message - 
From: Eric James cirese...@hotmail.com

To: CODE4LIB@LISTSERV.ND.EDU
Sent: Friday, October 16, 2009 9:52 PM
Subject: Re: [CODE4LIB] solr - search query count | highlighting


Thanks for your response.  But, yes I'm able to use facets in general, and 
yes I'm able to do highlighting on stored fields.




But finding how many times the query appears in the full text is my 
question. For example say you search on Heisenberg   We'd like to see:




Hit 1: Your search for Heisenberg appears 10 times within the Finding Aid

Hit 2: Your search for Heisenberg appears 3 times within the Finding Aid

Hit 3: Your search for Heisenberg appears 88 times within the Finding Aid

etc



Could there be a solr parameter that calculates this? Otherwise a klugey, 
not very scalable method could be that once you retrieve a solr result xml, 
find the fedora pid, retrieve the EAD full text, run a standard function to 
count how many times the query appears in the text for each hit, and add 
parameters back into the xml with these counts.






Date: Fri, 16 Oct 2009 15:27:42 -0400
From: ewg4x...@gmail.com
Subject: Re: [CODE4LIB] solr - search query count | highlighting
To: CODE4LIB@LISTSERV.ND.EDU

Hi Eric,

You do not have to store the entire text content of the EAD guide in order
to enable facets. Here's an example:
http://kittredgecollection.org/results?q=*:* . There are about 15 facets
enabled on a collection of almost 1500 EAD documents (though quite small 
in
filesize compared to traditional EAD finding aids), and there's no 
slowdown

whatsoever. I don't believe you need to store the guides to enable
highlighting either, though I have heard there is some dropoff in
performance with highlighting enabled. I've never done benchmarking on
highlighting enabled versus disabled, so I can't tell you how much of a
dropoff there is. In an index of only several hundred documents, I would
think that the dropoff with highlighting enabled would be fairly 
negligible.


Ethan

On Fri, Oct 16, 2009 at 3:12 PM, Eric James cirese...@hotmail.com wrote:

 For our finding aids, we are using fedoragenericsearch 2.2 with solr as
 index. Because the EADs can be huge, the EADs are indexed but not stored
 (with stored EADs, search time for ~500 objects = 20 min rather than  1
 sec).



 However, we would like to have number of search terms found within each
 hit. For example, CDL's collection:

 http://www.oac.cdlib.org/search?query=Donner



 Also we would like highlighting/snippets of the search term similar to
 CDL's.



 Is it a lost cause to have this functionality without storing the EAD? 
 Is

 there a way to store the EAD and have a reasonable response time?



 ---

 Eric James

 Yale University Libraries