+1

Why not use systems *built for* data instead of a system built for research 
papers?  CKAN, Tardis, Kasabi, MongoDB, NoSQL store (triple, graph, 
keyValue)...?

I’d like to hear a good reason not to use these systems and then interoperate 
with repositories rather than build the same functionality into repositories?  
/dff

From: Ben O'Steen [mailto:bost...@gmail.com]
Sent: 05 December 2011 08:00
To: Stuart Lewis
Cc: &lt sword-app-tech@lists.sourceforge.net> Leggett, Pete
Subject: Re: [sword-app-tech] How to send large fiels

While I think I understand the drive to put these files within a repository, I 
would suggest caution. Just because it might be possible to put a file into the 
care of a repository doesn't make it a practical or useful thing to do.

- What do you feel you might gain by placing 500Gb+ files into a repository, 
compared with having them in an addressable filestore?
- Have people been able to download files of that size from DSpace, Fedora or 
EPrints?
- Has the repository been allocated space on a suitable filesystem? XFS, EBS, 
Thumper or similar?
- Once the file is ingested into DSpace or Fedora for example, is there any 
other route to retrieve this, aside from HTTP? (Coding your own servlet/addon 
is not a real answer to this.) Is it easily accessible via Grid-FTP or HPN-SSH 
for example?
- Can the workflows you wish to utilise handle the data you are giving it? Is 
any broad stroke tool aside from fixity useful here?

Again, I am advising caution here, not besmirching the name of repositories. 
They do a good job with what we might currently term "small files", but were 
never developed with research data sizes in mind (3-500Gb is a decent rough 
guide. 1+Tb sets are certainly not uncommon)

So, in short, weigh up the benefits against the downsides and not in 
hypotheticals. Actually do it, and get real researchers to try and use it. 
You'll soon have a metric to show what is useful and what isn't.

On Monday, 5 December 2011, Stuart Lewis wrote:
Hi Pete,

Thanks for the information.  I've attached a piece of code that we use locally 
as part of the curation framework (in DSpace 1.7 or above), written by a 
colleague: Kim Shepherd.  The curation framework allows small jobs to be run on 
single items, collections, communities, or the whole repository.  This 
particular job looks to see if there is a filename in a pre-described metadata 
field, and if there is no matching bitstream, it will then ingest the file from 
disk.

More details of the curation system can be seen at:

 - https://wiki.duraspace.org/display/DSPACE/CurationSystem
 - https://wiki.duraspace.org/display/DSPACE/Curation+Task+Cookbook

Some other curation tasks that Kim has written:

 - https://github.com/kshepherd/Curation

This can be used by depositing the metadata via SWORD, with the filename in a 
metadata field.  Optionally the code could be changed to copy the file from 
another source (e.g. FTP, HTTP, Grid, etc).

Thanks,


Stuart Lewis
Digital Development Manager
Te Tumu Herenga The University of Auckland Library
Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand
Ph: +64 (0)9 373 7599 x81928



On 29/11/2011, at 12:09 PM, Leggett, Pete wrote:

> Hi Stuart,
>
> You asked for more info. We are developing a Research Data Repository
> based on Dspace for storing the research data associated with Exeter
> University research publications.
> For some research fields such as Physics, Biology, this data can be very
> large - TB's it seems!, hence the need to consider large injests over what
> might be several days.
> The researcher has the data, and would I am guessing create the metadata
> but maybe in collaboration with a data curator. Ideally the researcher
> would perform the deposit with, for large data sets, an offline injest of
> the data itself. The data can be on the researchers
> server/workstation/laptop/dvd/usb hard drive etc.
>
> There seems to be a couple of ways at least of approaching this so what I
> was after was some references to what and how other people have done this
> to give me a better handle on the best way forward - having very little
> dspace or repository experience myself. But given the size of larger data
> sets, I do think the best solution will involve as little copying of the
> data as possible - with the ultimate being just one copy process, of the
> data from source into repository. Everything less being done by reference
> if that is possible.
>
> Are you perhaps able to point me at some "code" examples for the SWORD
> deposit you talk about where a second process injests the files ? Would
> this be coded in Java ?
> Does the injest process have to be java based or can it be a perl script
> for example ? Please forgive my Dspace ignorance!
>
> Best regards,
>
> Pete
>
>
> On 28/11/2011 20:26, "Stuart Lewis" <s.le...@auckland.ac.nz> wrote:
>
>> Hi Pete,
>>
>> 'Deposit by reference' would probably be used to 'pull' data from a
>> remote server.  If you already have the data on your DSpace server, as
>> Mark points out there might be better ways to perform the import, such as
>> registering the bitstreams, or just performing a local import.
>>
>> A SWORD deposit by reference might take place in two parts:
>>
>> - Deposit some metadata, that includes a description of the file(s) to
>> be ingested
>>
>> - A second process (perhaps triggered by the SWORD deposit, or
>> undertaken later, such as via a DSpace curation task) that ingests the
>> file(s) into the DSpace object.
>>
>> Could you tell us a bit more about the process you want to implement?
>> Who has the data, the metadata, who performs the deposit etc?
>>
>> Thanks,
>>
>>
>> Stuart Lewis
>> Digital Development Manager
>> Te Tumu Herenga The University of Auckland Library
>> Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand
>> Ph: +64 (0)9 373 7599 x81928
>>
>>
>>
>> On 29/11/2011, at 7:19 AM, Leggett, Pete wrote:
>>
>>> Stuart,
>>>
>>> Can you provide any links to examples of using Œdeposit by reference¹ ?
>>>
>>> I am looking at feasibility of depositing very large items (tar.gz or
>>> zip¹d data files), say up to 16TB, into Dspace 1.6.x with the obvious
>>> problems of doing this using a web interface.
>>> Wondering if EasyDeposit can be adapted to do Œdeposit by reference¹
>>> with either a utility of some kind on the  dspace server looking for
>>> large items to injest or a client pushing the data onto a directory on
>>> the dspace server from where it can be injested. Ideally want to
>>> minimise any copies of the data.
>>>
>>> Really want to avoid copying the item once it¹s on the Dspace server.
>>> Could item be uploaded directly into asset store maybe ?
>>> The other problem is how anyone could download the item once it¹s in
>>> Dspace ?
>>>
>>> Anyone else doing this sort of very large item ( i.e. TB¹s ) injest ?
>>>
>>> Thank you,
>>>
>>> Pete
>>>
>>>
>>> From: David FLANDERS [mailto:d.fland...@jisc.ac.uk]
>>------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
sword-app-tech mailing list
sword-app-tech@lists.sourceforge.net<javascript:;>
https://lists.sourceforge.net/lists/listinfo/sword-app-tech
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
sword-app-tech mailing list
sword-app-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sword-app-tech

Reply via email to