Re: [sword-app-tech] How to send large fiels

Sally Rumsey Tue, 06 Dec 2011 07:48:32 -0800

To add to the melting pot, we're (Oxford) working on both development of a
data repository (DataBank) and a data catalogue as part of the JISC funded
Damaro project.


http://damaro.oucs.ox.ac.uk/

DataBank data repository development also forms part of the DataFlow project
http://www.dataflow.ox.ac.uk/

Sally

-- 
Sally Rumsey
Digital Collections Development Manager
Bodleian Digital Library Systems and Services (BDLSS)

sally.rum...@bodleian.ox.ac.uk
Tel: 01865 283860


> From: Stuart Lewis <s.le...@auckland.ac.nz>
> Date: Mon, 5 Dec 2011 23:31:47 +0000
> To: David FLANDERS <d.fland...@jisc.ac.uk>
> Cc: <sword-app-tech@lists.sourceforge.net>, <oda...@gmail.com>, Rufus
> <rufus.poll...@okfn.org>, Kathi Fletcher <kathi.fletc...@gmail.com>
> Subject: Re: [sword-app-tech] How to send large fiels
> 
> FWIW we've got a couple of students under New Zealand's "Summer of eResearch"
> scheme looking at implementing a university data catalogue, with CKAN being
> one of the candidate systems.  Will let you know how we get on.
> 
> 
> Stuart Lewis
> Digital Development Manager
> Te Tumu Herenga The University of Auckland Library
> Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand
> Ph: +64 (0)9 373 7599 x81928
> 
> 
> 
> On 6/12/2011, at 11:28 AM, David FLANDERS wrote:
> 
>> I’ve bugged Rufus a fair amount on this, one of his project manager’s Mark
>> Macgillvray has been thinking on this re ‘Open Scholarship’ a fair amount –
>> wish we could get a University to start playing around with this, of course
>> the Tardis folk down in Oz have been doing good things as well, Cc Steve
>> Androulakis.  /dff
>>  
>> From: Kathi Fletcher [mailto:kathi.fletc...@gmail.com]
>> Sent: 05 December 2011 14:50
>> To: David FLANDERS
>> Cc: &lt sword-app-tech@lists.sourceforge.net&gt; Rufus; Leggett, Pete
>> Subject: Re: [sword-app-tech] How to send large fiels
>>  
>> Hi,
>>  
>> I have CC'd Rufus Pollack of CKAN in case he has ideas about some sort of
>> system where papers go in document repositories like DSpace, EPrint, and data
>> goes in data repositories like CKAN etc.
>>  
>> Kathi
>>  
>> ---------- Forwarded message ----------
>> From: David FLANDERS <d.fland...@jisc.ac.uk>
>> Date: 2011/12/5
>> Subject: Re: [sword-app-tech] How to send large fiels
>> To: Ben O'Steen <bost...@gmail.com>, Stuart Lewis <s.le...@auckland.ac.nz>
>> Cc: "&lt sword-app-tech@lists.sourceforge.net&gt"
>> <sword-app-tech@lists.sourceforge.net>, "Leggett, Pete"
>> <p.f.legg...@exeter.ac.uk>
>> 
>> +1
>>  
>> Why not use systems *built for* data instead of a system built for research
>> papers?  CKAN, Tardis, Kasabi, MongoDB, NoSQL store (triple, graph,
>> keyValue)...?
>>  
>> I’d like to hear a good reason not to use these systems and then interoperate
>> with repositories rather than build the same functionality into repositories?
>> /dff
>>  
>> From: Ben O'Steen [mailto:bost...@gmail.com]
>> Sent: 05 December 2011 08:00
>> To: Stuart Lewis
>> Cc: &lt sword-app-tech@lists.sourceforge.net&gt; Leggett, Pete
>> 
>> Subject: Re: [sword-app-tech] How to send large fiels
>>  
>> While I think I understand the drive to put these files within a repository,
>> I would suggest caution. Just because it might be possible to put a file into
>> the care of a repository doesn't make it a practical or useful thing to do.
>>  
>> - What do you feel you might gain by placing 500Gb+ files into a repository,
>> compared with having them in an addressable filestore?
>> - Have people been able to download files of that size from DSpace, Fedora or
>> EPrints?
>> - Has the repository been allocated space on a suitable filesystem? XFS, EBS,
>> Thumper or similar?
>> - Once the file is ingested into DSpace or Fedora for example, is there any
>> other route to retrieve this, aside from HTTP? (Coding your own servlet/addon
>> is not a real answer to this.) Is it easily accessible via Grid-FTP or
>> HPN-SSH for example?
>> - Can the workflows you wish to utilise handle the data you are giving it? Is
>> any broad stroke tool aside from fixity useful here?
>>  
>> Again, I am advising caution here, not besmirching the name of repositories.
>> They do a good job with what we might currently term "small files", but were
>> never developed with research data sizes in mind (3-500Gb is a decent rough
>> guide. 1+Tb sets are certainly not uncommon)
>>  
>> So, in short, weigh up the benefits against the downsides and not in
>> hypotheticals. Actually do it, and get real researchers to try and use it.
>> You'll soon have a metric to show what is useful and what isn't.
>> 
>> On Monday, 5 December 2011, Stuart Lewis wrote:
>> Hi Pete,
>> 
>> Thanks for the information.  I've attached a piece of code that we use
>> locally as part of the curation framework (in DSpace 1.7 or above), written
>> by a colleague: Kim Shepherd.  The curation framework allows small jobs to be
>> run on single items, collections, communities, or the whole repository.  This
>> particular job looks to see if there is a filename in a pre-described
>> metadata field, and if there is no matching bitstream, it will then ingest
>> the file from disk.
>> 
>> More details of the curation system can be seen at:
>> 
>>  - https://wiki.duraspace.org/display/DSPACE/CurationSystem
>>  - https://wiki.duraspace.org/display/DSPACE/Curation+Task+Cookbook
>> 
>> Some other curation tasks that Kim has written:
>> 
>>  - https://github.com/kshepherd/Curation
>> 
>> This can be used by depositing the metadata via SWORD, with the filename in a
>> metadata field.  Optionally the code could be changed to copy the file from
>> another source (e.g. FTP, HTTP, Grid, etc).
>> 
>> Thanks,
>> 
>> 
>> Stuart Lewis
>> Digital Development Manager
>> Te Tumu Herenga The University of Auckland Library
>> Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand
>> Ph: +64 (0)9 373 7599 x81928
>> 
>> 
>> 
>> On 29/11/2011, at 12:09 PM, Leggett, Pete wrote:
>> 
>>> Hi Stuart,
>>> 
>>> You asked for more info. We are developing a Research Data Repository
>>> based on Dspace for storing the research data associated with Exeter
>>> University research publications.
>>> For some research fields such as Physics, Biology, this data can be very
>>> large - TB's it seems!, hence the need to consider large injests over what
>>> might be several days.
>>> The researcher has the data, and would I am guessing create the metadata
>>> but maybe in collaboration with a data curator. Ideally the researcher
>>> would perform the deposit with, for large data sets, an offline injest of
>>> the data itself. The data can be on the researchers
>>> server/workstation/laptop/dvd/usb hard drive etc.
>>> 
>>> There seems to be a couple of ways at least of approaching this so what I
>>> was after was some references to what and how other people have done this
>>> to give me a better handle on the best way forward - having very little
>>> dspace or repository experience myself. But given the size of larger data
>>> sets, I do think the best solution will involve as little copying of the
>>> data as possible - with the ultimate being just one copy process, of the
>>> data from source into repository. Everything less being done by reference
>>> if that is possible.
>>> 
>>> Are you perhaps able to point me at some "code" examples for the SWORD
>>> deposit you talk about where a second process injests the files ? Would
>>> this be coded in Java ?
>>> Does the injest process have to be java based or can it be a perl script
>>> for example ? Please forgive my Dspace ignorance!
>>> 
>>> Best regards,
>>> 
>>> Pete
>>> 
>>> 
>>> On 28/11/2011 20:26, "Stuart Lewis" <s.le...@auckland.ac.nz> wrote:
>>> 
>>>> Hi Pete,
>>>> 
>>>> 'Deposit by reference' would probably be used to 'pull' data from a
>>>> remote server.  If you already have the data on your DSpace server, as
>>>> Mark points out there might be better ways to perform the import, such as
>>>> registering the bitstreams, or just performing a local import.
>>>> 
>>>> A SWORD deposit by reference might take place in two parts:
>>>> 
>>>> - Deposit some metadata, that includes a description of the file(s) to
>>>> be ingested
>>>> 
>>>> - A second process (perhaps triggered by the SWORD deposit, or
>>>> undertaken later, such as via a DSpace curation task) that ingests the
>>>> file(s) into the DSpace object.
>>>> 
>>>> Could you tell us a bit more about the process you want to implement?
>>>> Who has the data, the metadata, who performs the deposit etc?
>>>> 
>>>> Thanks,
>>>> 
>>>> 
>>>> Stuart Lewis
>>>> Digital Development Manager
>>>> Te Tumu Herenga The University of Auckland Library
>>>> Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand
>>>> Ph: +64 (0)9 373 7599 x81928
>>>> 
>>>> 
>>>> 
>>>> On 29/11/2011, at 7:19 AM, Leggett, Pete wrote:
>>>> 
>>>>> Stuart,
>>>>> 
>>>>> Can you provide any links to examples of using Œdeposit by reference¹ ?
>>>>> 
>>>>> I am looking at feasibility of depositing very large items (tar.gz or
>>>>> zip¹d data files), say up to 16TB, into Dspace 1.6.x with the obvious
>>>>> problems of doing this using a web interface.
>>>>> Wondering if EasyDeposit can be adapted to do Œdeposit by reference¹
>>>>> with either a utility of some kind on the  dspace server looking for
>>>>> large items to injest or a client pushing the data onto a directory on
>>>>> the dspace server from where it can be injested. Ideally want to
>>>>> minimise any copies of the data.
>>>>> 
>>>>> Really want to avoid copying the item once it¹s on the Dspace server.
>>>>> Could item be uploaded directly into asset store maybe ?
>>>>> The other problem is how anyone could download the item once it¹s in
>>>>> Dspace ?
>>>>> 
>>>>> Anyone else doing this sort of very large item ( i.e. TB¹s ) injest ?
>>>>> 
>>>>> Thank you,
>>>>> 
>>>>> Pete
>>>>> 
>>>>> 
>>>>> From: David FLANDERS [mailto:d.fland...@jisc.ac.uk]
>>>> ---------------------------------------------------------------------------
>>>> ---
>> All the data continuously generated in your IT infrastructure
>> contains a definitive record of customers, application performance,
>> security threats, fraudulent activity, and more. Splunk takes this
>> data and makes sense of it. IT sense. And common sense.
>> http://p.sf.net/sfu/splunk-novd2d
>> _______________________________________________
>> sword-app-tech mailing list
>> sword-app-tech@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/sword-app-tech
>> 
>> 
----------------------------------------------------------------------------->>
-
>> All the data continuously generated in your IT infrastructure
>> contains a definitive record of customers, application performance,
>> security threats, fraudulent activity, and more. Splunk takes this
>> data and makes sense of it. IT sense. And common sense.
>> http://p.sf.net/sfu/splunk-novd2d
>> _______________________________________________
>> sword-app-tech mailing list
>> sword-app-tech@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/sword-app-tech
>> 
>> 
>> 
>>  
>> -- 
>> Katherine Fletcher, kathi.fletc...@gmail.com
>> Twitter: kefletcher Blog: kefletcher.blogspot.com
>>  
>>  
>>  
>> 
----------------------------------------------------------------------------->>
-
>> All the data continuously generated in your IT infrastructure
>> contains a definitive record of customers, application performance,
>> security threats, fraudulent activity, and more. Splunk takes this
>> data and makes sense of it. IT sense. And common sense.
>> http://p.sf.net/sfu/splunk-novd2d____________________________________________
>> ___
>> sword-app-tech mailing list
>> sword-app-tech@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/sword-app-tech
> 
> 
> 
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure
> contains a definitive record of customers, application performance,
> security threats, fraudulent activity, and more. Splunk takes this
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> sword-app-tech mailing list
> sword-app-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/sword-app-tech


------------------------------------------------------------------------------
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of 
discussion for anyone considering optimizing the pricing and packaging model 
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/
_______________________________________________
sword-app-tech mailing list
sword-app-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sword-app-tech

Re: [sword-app-tech] How to send large fiels

Reply via email to