To add to the melting pot, we're (Oxford) working on both development of a data repository (DataBank) and a data catalogue as part of the JISC funded Damaro project.
http://damaro.oucs.ox.ac.uk/ DataBank data repository development also forms part of the DataFlow project http://www.dataflow.ox.ac.uk/ Sally -- Sally Rumsey Digital Collections Development Manager Bodleian Digital Library Systems and Services (BDLSS) sally.rum...@bodleian.ox.ac.uk Tel: 01865 283860 > From: Stuart Lewis <s.le...@auckland.ac.nz> > Date: Mon, 5 Dec 2011 23:31:47 +0000 > To: David FLANDERS <d.fland...@jisc.ac.uk> > Cc: <sword-app-tech@lists.sourceforge.net>, <oda...@gmail.com>, Rufus > <rufus.poll...@okfn.org>, Kathi Fletcher <kathi.fletc...@gmail.com> > Subject: Re: [sword-app-tech] How to send large fiels > > FWIW we've got a couple of students under New Zealand's "Summer of eResearch" > scheme looking at implementing a university data catalogue, with CKAN being > one of the candidate systems. Will let you know how we get on. > > > Stuart Lewis > Digital Development Manager > Te Tumu Herenga The University of Auckland Library > Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand > Ph: +64 (0)9 373 7599 x81928 > > > > On 6/12/2011, at 11:28 AM, David FLANDERS wrote: > >> I’ve bugged Rufus a fair amount on this, one of his project manager’s Mark >> Macgillvray has been thinking on this re ‘Open Scholarship’ a fair amount – >> wish we could get a University to start playing around with this, of course >> the Tardis folk down in Oz have been doing good things as well, Cc Steve >> Androulakis. /dff >> >> From: Kathi Fletcher [mailto:kathi.fletc...@gmail.com] >> Sent: 05 December 2011 14:50 >> To: David FLANDERS >> Cc: < sword-app-tech@lists.sourceforge.net> Rufus; Leggett, Pete >> Subject: Re: [sword-app-tech] How to send large fiels >> >> Hi, >> >> I have CC'd Rufus Pollack of CKAN in case he has ideas about some sort of >> system where papers go in document repositories like DSpace, EPrint, and data >> goes in data repositories like CKAN etc. >> >> Kathi >> >> ---------- Forwarded message ---------- >> From: David FLANDERS <d.fland...@jisc.ac.uk> >> Date: 2011/12/5 >> Subject: Re: [sword-app-tech] How to send large fiels >> To: Ben O'Steen <bost...@gmail.com>, Stuart Lewis <s.le...@auckland.ac.nz> >> Cc: "< sword-app-tech@lists.sourceforge.net>" >> <sword-app-tech@lists.sourceforge.net>, "Leggett, Pete" >> <p.f.legg...@exeter.ac.uk> >> >> +1 >> >> Why not use systems *built for* data instead of a system built for research >> papers? CKAN, Tardis, Kasabi, MongoDB, NoSQL store (triple, graph, >> keyValue)...? >> >> I’d like to hear a good reason not to use these systems and then interoperate >> with repositories rather than build the same functionality into repositories? >> /dff >> >> From: Ben O'Steen [mailto:bost...@gmail.com] >> Sent: 05 December 2011 08:00 >> To: Stuart Lewis >> Cc: < sword-app-tech@lists.sourceforge.net> Leggett, Pete >> >> Subject: Re: [sword-app-tech] How to send large fiels >> >> While I think I understand the drive to put these files within a repository, >> I would suggest caution. Just because it might be possible to put a file into >> the care of a repository doesn't make it a practical or useful thing to do. >> >> - What do you feel you might gain by placing 500Gb+ files into a repository, >> compared with having them in an addressable filestore? >> - Have people been able to download files of that size from DSpace, Fedora or >> EPrints? >> - Has the repository been allocated space on a suitable filesystem? XFS, EBS, >> Thumper or similar? >> - Once the file is ingested into DSpace or Fedora for example, is there any >> other route to retrieve this, aside from HTTP? (Coding your own servlet/addon >> is not a real answer to this.) Is it easily accessible via Grid-FTP or >> HPN-SSH for example? >> - Can the workflows you wish to utilise handle the data you are giving it? Is >> any broad stroke tool aside from fixity useful here? >> >> Again, I am advising caution here, not besmirching the name of repositories. >> They do a good job with what we might currently term "small files", but were >> never developed with research data sizes in mind (3-500Gb is a decent rough >> guide. 1+Tb sets are certainly not uncommon) >> >> So, in short, weigh up the benefits against the downsides and not in >> hypotheticals. Actually do it, and get real researchers to try and use it. >> You'll soon have a metric to show what is useful and what isn't. >> >> On Monday, 5 December 2011, Stuart Lewis wrote: >> Hi Pete, >> >> Thanks for the information. I've attached a piece of code that we use >> locally as part of the curation framework (in DSpace 1.7 or above), written >> by a colleague: Kim Shepherd. The curation framework allows small jobs to be >> run on single items, collections, communities, or the whole repository. This >> particular job looks to see if there is a filename in a pre-described >> metadata field, and if there is no matching bitstream, it will then ingest >> the file from disk. >> >> More details of the curation system can be seen at: >> >> - https://wiki.duraspace.org/display/DSPACE/CurationSystem >> - https://wiki.duraspace.org/display/DSPACE/Curation+Task+Cookbook >> >> Some other curation tasks that Kim has written: >> >> - https://github.com/kshepherd/Curation >> >> This can be used by depositing the metadata via SWORD, with the filename in a >> metadata field. Optionally the code could be changed to copy the file from >> another source (e.g. FTP, HTTP, Grid, etc). >> >> Thanks, >> >> >> Stuart Lewis >> Digital Development Manager >> Te Tumu Herenga The University of Auckland Library >> Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand >> Ph: +64 (0)9 373 7599 x81928 >> >> >> >> On 29/11/2011, at 12:09 PM, Leggett, Pete wrote: >> >>> Hi Stuart, >>> >>> You asked for more info. We are developing a Research Data Repository >>> based on Dspace for storing the research data associated with Exeter >>> University research publications. >>> For some research fields such as Physics, Biology, this data can be very >>> large - TB's it seems!, hence the need to consider large injests over what >>> might be several days. >>> The researcher has the data, and would I am guessing create the metadata >>> but maybe in collaboration with a data curator. Ideally the researcher >>> would perform the deposit with, for large data sets, an offline injest of >>> the data itself. The data can be on the researchers >>> server/workstation/laptop/dvd/usb hard drive etc. >>> >>> There seems to be a couple of ways at least of approaching this so what I >>> was after was some references to what and how other people have done this >>> to give me a better handle on the best way forward - having very little >>> dspace or repository experience myself. But given the size of larger data >>> sets, I do think the best solution will involve as little copying of the >>> data as possible - with the ultimate being just one copy process, of the >>> data from source into repository. Everything less being done by reference >>> if that is possible. >>> >>> Are you perhaps able to point me at some "code" examples for the SWORD >>> deposit you talk about where a second process injests the files ? Would >>> this be coded in Java ? >>> Does the injest process have to be java based or can it be a perl script >>> for example ? Please forgive my Dspace ignorance! >>> >>> Best regards, >>> >>> Pete >>> >>> >>> On 28/11/2011 20:26, "Stuart Lewis" <s.le...@auckland.ac.nz> wrote: >>> >>>> Hi Pete, >>>> >>>> 'Deposit by reference' would probably be used to 'pull' data from a >>>> remote server. If you already have the data on your DSpace server, as >>>> Mark points out there might be better ways to perform the import, such as >>>> registering the bitstreams, or just performing a local import. >>>> >>>> A SWORD deposit by reference might take place in two parts: >>>> >>>> - Deposit some metadata, that includes a description of the file(s) to >>>> be ingested >>>> >>>> - A second process (perhaps triggered by the SWORD deposit, or >>>> undertaken later, such as via a DSpace curation task) that ingests the >>>> file(s) into the DSpace object. >>>> >>>> Could you tell us a bit more about the process you want to implement? >>>> Who has the data, the metadata, who performs the deposit etc? >>>> >>>> Thanks, >>>> >>>> >>>> Stuart Lewis >>>> Digital Development Manager >>>> Te Tumu Herenga The University of Auckland Library >>>> Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand >>>> Ph: +64 (0)9 373 7599 x81928 >>>> >>>> >>>> >>>> On 29/11/2011, at 7:19 AM, Leggett, Pete wrote: >>>> >>>>> Stuart, >>>>> >>>>> Can you provide any links to examples of using Œdeposit by reference¹ ? >>>>> >>>>> I am looking at feasibility of depositing very large items (tar.gz or >>>>> zip¹d data files), say up to 16TB, into Dspace 1.6.x with the obvious >>>>> problems of doing this using a web interface. >>>>> Wondering if EasyDeposit can be adapted to do Œdeposit by reference¹ >>>>> with either a utility of some kind on the dspace server looking for >>>>> large items to injest or a client pushing the data onto a directory on >>>>> the dspace server from where it can be injested. Ideally want to >>>>> minimise any copies of the data. >>>>> >>>>> Really want to avoid copying the item once it¹s on the Dspace server. >>>>> Could item be uploaded directly into asset store maybe ? >>>>> The other problem is how anyone could download the item once it¹s in >>>>> Dspace ? >>>>> >>>>> Anyone else doing this sort of very large item ( i.e. TB¹s ) injest ? >>>>> >>>>> Thank you, >>>>> >>>>> Pete >>>>> >>>>> >>>>> From: David FLANDERS [mailto:d.fland...@jisc.ac.uk] >>>> --------------------------------------------------------------------------- >>>> --- >> All the data continuously generated in your IT infrastructure >> contains a definitive record of customers, application performance, >> security threats, fraudulent activity, and more. Splunk takes this >> data and makes sense of it. IT sense. And common sense. >> http://p.sf.net/sfu/splunk-novd2d >> _______________________________________________ >> sword-app-tech mailing list >> sword-app-tech@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/sword-app-tech >> >> ----------------------------------------------------------------------------->> - >> All the data continuously generated in your IT infrastructure >> contains a definitive record of customers, application performance, >> security threats, fraudulent activity, and more. Splunk takes this >> data and makes sense of it. IT sense. And common sense. >> http://p.sf.net/sfu/splunk-novd2d >> _______________________________________________ >> sword-app-tech mailing list >> sword-app-tech@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/sword-app-tech >> >> >> >> >> -- >> Katherine Fletcher, kathi.fletc...@gmail.com >> Twitter: kefletcher Blog: kefletcher.blogspot.com >> >> >> >> ----------------------------------------------------------------------------->> - >> All the data continuously generated in your IT infrastructure >> contains a definitive record of customers, application performance, >> security threats, fraudulent activity, and more. Splunk takes this >> data and makes sense of it. IT sense. And common sense. >> http://p.sf.net/sfu/splunk-novd2d____________________________________________ >> ___ >> sword-app-tech mailing list >> sword-app-tech@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/sword-app-tech > > > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure > contains a definitive record of customers, application performance, > security threats, fraudulent activity, and more. Splunk takes this > data and makes sense of it. IT sense. And common sense. > http://p.sf.net/sfu/splunk-novd2d > _______________________________________________ > sword-app-tech mailing list > sword-app-tech@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/sword-app-tech ------------------------------------------------------------------------------ Cloud Services Checklist: Pricing and Packaging Optimization This white paper is intended to serve as a reference, checklist and point of discussion for anyone considering optimizing the pricing and packaging model of a cloud services business. Read Now! http://www.accelacomm.com/jaw/sfnl/114/51491232/ _______________________________________________ sword-app-tech mailing list sword-app-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/sword-app-tech