Re: [sword-app-tech] How to send large fiels

David FLANDERS Mon, 05 Dec 2011 16:07:04 -0800

That's great to hear Stuart, cheers for the update, is their a project blog by 
which we can monitor progress?  /dff


> -----Original Message-----
> From: Stuart Lewis [mailto:s.le...@auckland.ac.nz]
> Sent: 05 December 2011 23:32
> To: David FLANDERS
> Cc: Kathi Fletcher; &lt sword-app-tech@lists.sourceforge.net&gt;
> oda...@gmail.com; Rufus
> Subject: Re: [sword-app-tech] How to send large fiels
> 
> FWIW we've got a couple of students under New Zealand's "Summer of
> eResearch" scheme looking at implementing a university data catalogue,
> with CKAN being one of the candidate systems.  Will let you know how we
> get on.
> 
> 
> Stuart Lewis
> Digital Development Manager
> Te Tumu Herenga The University of Auckland Library
> Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand
> Ph: +64 (0)9 373 7599 x81928
> 
> 
> 
> On 6/12/2011, at 11:28 AM, David FLANDERS wrote:
> 
> > I’ve bugged Rufus a fair amount on this, one of his project manager’s
> Mark Macgillvray has been thinking on this re ‘Open Scholarship’ a fair
> amount – wish we could get a University to start playing around with
> this, of course the Tardis folk down in Oz have been doing good things
> as well, Cc Steve Androulakis.  /dff
> >
> > From: Kathi Fletcher [mailto:kathi.fletc...@gmail.com]
> > Sent: 05 December 2011 14:50
> > To: David FLANDERS
> > Cc: &lt sword-app-tech@lists.sourceforge.net&gt; Rufus; Leggett, Pete
> > Subject: Re: [sword-app-tech] How to send large fiels
> >
> > Hi,
> >
> > I have CC'd Rufus Pollack of CKAN in case he has ideas about some
> sort of system where papers go in document repositories like DSpace,
> EPrint, and data goes in data repositories like CKAN etc.
> >
> > Kathi
> >
> > ---------- Forwarded message ----------
> > From: David FLANDERS <d.fland...@jisc.ac.uk>
> > Date: 2011/12/5
> > Subject: Re: [sword-app-tech] How to send large fiels
> > To: Ben O'Steen <bost...@gmail.com>, Stuart Lewis
> <s.le...@auckland.ac.nz>
> > Cc: "&lt sword-app-tech@lists.sourceforge.net&gt" <sword-app-
> t...@lists.sourceforge.net>, "Leggett, Pete" <p.f.legg...@exeter.ac.uk>
> >
> > +1
> >
> > Why not use systems *built for* data instead of a system built for
> research papers?  CKAN, Tardis, Kasabi, MongoDB, NoSQL store (triple,
> graph, keyValue)...?
> >
> > I’d like to hear a good reason not to use these systems and then
> interoperate with repositories rather than build the same functionality
> into repositories?  /dff
> >
> > From: Ben O'Steen [mailto:bost...@gmail.com]
> > Sent: 05 December 2011 08:00
> > To: Stuart Lewis
> > Cc: &lt sword-app-tech@lists.sourceforge.net&gt; Leggett, Pete
> >
> > Subject: Re: [sword-app-tech] How to send large fiels
> >
> > While I think I understand the drive to put these files within a
> repository, I would suggest caution. Just because it might be possible
> to put a file into the care of a repository doesn't make it a practical
> or useful thing to do.
> >
> > - What do you feel you might gain by placing 500Gb+ files into a
> repository, compared with having them in an addressable filestore?
> > - Have people been able to download files of that size from DSpace,
> Fedora or EPrints?
> > - Has the repository been allocated space on a suitable filesystem?
> XFS, EBS, Thumper or similar?
> > - Once the file is ingested into DSpace or Fedora for example, is
> there any other route to retrieve this, aside from HTTP? (Coding your
> own servlet/addon is not a real answer to this.) Is it easily
> accessible via Grid-FTP or HPN-SSH for example?
> > - Can the workflows you wish to utilise handle the data you are
> giving it? Is any broad stroke tool aside from fixity useful here?
> >
> > Again, I am advising caution here, not besmirching the name of
> repositories. They do a good job with what we might currently term
> "small files", but were never developed with research data sizes in
> mind (3-500Gb is a decent rough guide. 1+Tb sets are certainly not
> uncommon)
> >
> > So, in short, weigh up the benefits against the downsides and not in
> hypotheticals. Actually do it, and get real researchers to try and use
> it. You'll soon have a metric to show what is useful and what isn't.
> >
> > On Monday, 5 December 2011, Stuart Lewis wrote:
> > Hi Pete,
> >
> > Thanks for the information.  I've attached a piece of code that we
> use locally as part of the curation framework (in DSpace 1.7 or above),
> written by a colleague: Kim Shepherd.  The curation framework allows
> small jobs to be run on single items, collections, communities, or the
> whole repository.  This particular job looks to see if there is a
> filename in a pre-described metadata field, and if there is no matching
> bitstream, it will then ingest the file from disk.
> >
> > More details of the curation system can be seen at:
> >
> >  - https://wiki.duraspace.org/display/DSPACE/CurationSystem
> >  - https://wiki.duraspace.org/display/DSPACE/Curation+Task+Cookbook
> >
> > Some other curation tasks that Kim has written:
> >
> >  - https://github.com/kshepherd/Curation
> >
> > This can be used by depositing the metadata via SWORD, with the
> filename in a metadata field.  Optionally the code could be changed to
> copy the file from another source (e.g. FTP, HTTP, Grid, etc).
> >
> > Thanks,
> >
> >
> > Stuart Lewis
> > Digital Development Manager
> > Te Tumu Herenga The University of Auckland Library
> > Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand
> > Ph: +64 (0)9 373 7599 x81928
> >
> >
> >
> > On 29/11/2011, at 12:09 PM, Leggett, Pete wrote:
> >
> > > Hi Stuart,
> > >
> > > You asked for more info. We are developing a Research Data
> Repository
> > > based on Dspace for storing the research data associated with
> Exeter
> > > University research publications.
> > > For some research fields such as Physics, Biology, this data can be
> very
> > > large - TB's it seems!, hence the need to consider large injests
> over what
> > > might be several days.
> > > The researcher has the data, and would I am guessing create the
> metadata
> > > but maybe in collaboration with a data curator. Ideally the
> researcher
> > > would perform the deposit with, for large data sets, an offline
> injest of
> > > the data itself. The data can be on the researchers
> > > server/workstation/laptop/dvd/usb hard drive etc.
> > >
> > > There seems to be a couple of ways at least of approaching this so
> what I
> > > was after was some references to what and how other people have
> done this
> > > to give me a better handle on the best way forward - having very
> little
> > > dspace or repository experience myself. But given the size of
> larger data
> > > sets, I do think the best solution will involve as little copying
> of the
> > > data as possible - with the ultimate being just one copy process,
> of the
> > > data from source into repository. Everything less being done by
> reference
> > > if that is possible.
> > >
> > > Are you perhaps able to point me at some "code" examples for the
> SWORD
> > > deposit you talk about where a second process injests the files ?
> Would
> > > this be coded in Java ?
> > > Does the injest process have to be java based or can it be a perl
> script
> > > for example ? Please forgive my Dspace ignorance!
> > >
> > > Best regards,
> > >
> > > Pete
> > >
> > >
> > > On 28/11/2011 20:26, "Stuart Lewis" <s.le...@auckland.ac.nz> wrote:
> > >
> > >> Hi Pete,
> > >>
> > >> 'Deposit by reference' would probably be used to 'pull' data from
> a
> > >> remote server.  If you already have the data on your DSpace
> server, as
> > >> Mark points out there might be better ways to perform the import,
> such as
> > >> registering the bitstreams, or just performing a local import.
> > >>
> > >> A SWORD deposit by reference might take place in two parts:
> > >>
> > >> - Deposit some metadata, that includes a description of the
> file(s) to
> > >> be ingested
> > >>
> > >> - A second process (perhaps triggered by the SWORD deposit, or
> > >> undertaken later, such as via a DSpace curation task) that ingests
> the
> > >> file(s) into the DSpace object.
> > >>
> > >> Could you tell us a bit more about the process you want to
> implement?
> > >> Who has the data, the metadata, who performs the deposit etc?
> > >>
> > >> Thanks,
> > >>
> > >>
> > >> Stuart Lewis
> > >> Digital Development Manager
> > >> Te Tumu Herenga The University of Auckland Library
> > >> Auckland Mail Centre, Private Bag 92019, Auckland 1142, New
> Zealand
> > >> Ph: +64 (0)9 373 7599 x81928
> > >>
> > >>
> > >>
> > >> On 29/11/2011, at 7:19 AM, Leggett, Pete wrote:
> > >>
> > >>> Stuart,
> > >>>
> > >>> Can you provide any links to examples of using Œdeposit by
> reference¹ ?
> > >>>
> > >>> I am looking at feasibility of depositing very large items
> (tar.gz or
> > >>> zip¹d data files), say up to 16TB, into Dspace 1.6.x with the
> obvious
> > >>> problems of doing this using a web interface.
> > >>> Wondering if EasyDeposit can be adapted to do Œdeposit by
> reference¹
> > >>> with either a utility of some kind on the  dspace server looking
> for
> > >>> large items to injest or a client pushing the data onto a
> directory on
> > >>> the dspace server from where it can be injested. Ideally want to
> > >>> minimise any copies of the data.
> > >>>
> > >>> Really want to avoid copying the item once it¹s on the Dspace
> server.
> > >>> Could item be uploaded directly into asset store maybe ?
> > >>> The other problem is how anyone could download the item once it¹s
> in
> > >>> Dspace ?
> > >>>
> > >>> Anyone else doing this sort of very large item ( i.e. TB¹s )
> injest ?
> > >>>
> > >>> Thank you,
> > >>>
> > >>> Pete
> > >>>
> > >>>
> > >>> From: David FLANDERS [mailto:d.fland...@jisc.ac.uk]
> > >>-------------------------------------------------------------------
> -----------
> > All the data continuously generated in your IT infrastructure
> > contains a definitive record of customers, application performance,
> > security threats, fraudulent activity, and more. Splunk takes this
> > data and makes sense of it. IT sense. And common sense.
> > http://p.sf.net/sfu/splunk-novd2d
> > _______________________________________________
> > sword-app-tech mailing list
> > sword-app-tech@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/sword-app-tech
> >
> > ---------------------------------------------------------------------
> ---------
> > All the data continuously generated in your IT infrastructure
> > contains a definitive record of customers, application performance,
> > security threats, fraudulent activity, and more. Splunk takes this
> > data and makes sense of it. IT sense. And common sense.
> > http://p.sf.net/sfu/splunk-novd2d
> > _______________________________________________
> > sword-app-tech mailing list
> > sword-app-tech@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/sword-app-tech
> >
> >
> >
> >
> > --
> > Katherine Fletcher, kathi.fletc...@gmail.com
> > Twitter: kefletcher Blog: kefletcher.blogspot.com
> >
> >
> >
> > ---------------------------------------------------------------------
> ---------
> > All the data continuously generated in your IT infrastructure
> > contains a definitive record of customers, application performance,
> > security threats, fraudulent activity, and more. Splunk takes this
> > data and makes sense of it. IT sense. And common sense.
> > http://p.sf.net/sfu/splunk-
> novd2d_______________________________________________
> > sword-app-tech mailing list
> > sword-app-tech@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/sword-app-tech
> 


------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
sword-app-tech mailing list
sword-app-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sword-app-tech

Re: [sword-app-tech] How to send large fiels

Reply via email to