Re: [sword-app-tech] How to send large fiels

Ben O'Steen Mon, 05 Dec 2011 02:38:38 -0800

I believe I know what you are getting at re: nosql stores, but a better
choice might be a communial HBase, R server or Hadoop deployment - nothing
for hard use, but something capable of slicing and allowing exploration of
arbitrary data.


Ben
On Dec 5, 2011 9:10 AM, "David FLANDERS" <d.fland...@jisc.ac.uk> wrote:

> +1 ****
>
> ** **
>
> Why not use systems *built for* data instead of a system built for
> research papers?  CKAN, Tardis, Kasabi, MongoDB, NoSQL store (triple,
> graph, keyValue)...?****
>
> ** **
>
> I’d like to hear a good reason not to use these systems and then
> interoperate with repositories rather than build the same functionality
> into repositories?  /dff****
>
> ** **
>
> *From:* Ben O'Steen [mailto:bost...@gmail.com]
> *Sent:* 05 December 2011 08:00
> *To:* Stuart Lewis
> *Cc:* &lt sword-app-tech@lists.sourceforge.net&gt; Leggett, Pete
> *Subject:* Re: [sword-app-tech] How to send large fiels****
>
> ** **
>
> While I think I understand the drive to put these files within a
> repository, I would suggest caution. Just because it might be possible to
> put a file into the care of a repository doesn't make it a practical or
> useful thing to do.****
>
> ** **
>
> - What do you feel you might gain by placing 500Gb+ files into a
> repository, compared with having them in an addressable filestore?****
>
> - Have people been able to download files of that size from DSpace, Fedora
> or EPrints?****
>
> - Has the repository been allocated space on a suitable filesystem? XFS,
> EBS, Thumper or similar?****
>
> - Once the file is ingested into DSpace or Fedora for example, is there
> any other route to retrieve this, aside from HTTP? (Coding your own
> servlet/addon is not a real answer to this.) Is it easily accessible via
> Grid-FTP or HPN-SSH for example?****
>
> - Can the workflows you wish to utilise handle the data you are giving it?
> Is any broad stroke tool aside from fixity useful here?****
>
> ** **
>
> Again, I am advising caution here, not besmirching the name of
> repositories. They do a good job with what we might currently term "small
> files", but were never developed with research data sizes in mind (3-500Gb
> is a decent rough guide. 1+Tb sets are certainly not uncommon)****
>
> ** **
>
> So, in short, weigh up the benefits against the downsides and not in
> hypotheticals. Actually do it, and get real researchers to try and use it.
> You'll soon have a metric to show what is useful and what isn't.****
>
>
> On Monday, 5 December 2011, Stuart Lewis wrote:****
>
> Hi Pete,
>
> Thanks for the information.  I've attached a piece of code that we use
> locally as part of the curation framework (in DSpace 1.7 or above), written
> by a colleague: Kim Shepherd.  The curation framework allows small jobs to
> be run on single items, collections, communities, or the whole repository.
>  This particular job looks to see if there is a filename in a pre-described
> metadata field, and if there is no matching bitstream, it will then ingest
> the file from disk.
>
> More details of the curation system can be seen at:
>
>  - https://wiki.duraspace.org/display/DSPACE/CurationSystem
>  - https://wiki.duraspace.org/display/DSPACE/Curation+Task+Cookbook
>
> Some other curation tasks that Kim has written:
>
>  - https://github.com/kshepherd/Curation
>
> This can be used by depositing the metadata via SWORD, with the filename
> in a metadata field.  Optionally the code could be changed to copy the file
> from another source (e.g. FTP, HTTP, Grid, etc).
>
> Thanks,
>
>
> Stuart Lewis
> Digital Development Manager
> Te Tumu Herenga The University of Auckland Library
> Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand
> Ph: +64 (0)9 373 7599 x81928
>
>
>
> On 29/11/2011, at 12:09 PM, Leggett, Pete wrote:
>
> > Hi Stuart,
> >
> > You asked for more info. We are developing a Research Data Repository
> > based on Dspace for storing the research data associated with Exeter
> > University research publications.
> > For some research fields such as Physics, Biology, this data can be very
> > large - TB's it seems!, hence the need to consider large injests over
> what
> > might be several days.
> > The researcher has the data, and would I am guessing create the metadata
> > but maybe in collaboration with a data curator. Ideally the researcher
> > would perform the deposit with, for large data sets, an offline injest of
> > the data itself. The data can be on the researchers
> > server/workstation/laptop/dvd/usb hard drive etc.
> >
> > There seems to be a couple of ways at least of approaching this so what I
> > was after was some references to what and how other people have done this
> > to give me a better handle on the best way forward - having very little
> > dspace or repository experience myself. But given the size of larger data
> > sets, I do think the best solution will involve as little copying of the
> > data as possible - with the ultimate being just one copy process, of the
> > data from source into repository. Everything less being done by reference
> > if that is possible.
> >
> > Are you perhaps able to point me at some "code" examples for the SWORD
> > deposit you talk about where a second process injests the files ? Would
> > this be coded in Java ?
> > Does the injest process have to be java based or can it be a perl script
> > for example ? Please forgive my Dspace ignorance!
> >
> > Best regards,
> >
> > Pete
> >
> >
> > On 28/11/2011 20:26, "Stuart Lewis" <s.le...@auckland.ac.nz> wrote:
> >
> >> Hi Pete,
> >>
> >> 'Deposit by reference' would probably be used to 'pull' data from a
> >> remote server.  If you already have the data on your DSpace server, as
> >> Mark points out there might be better ways to perform the import, such
> as
> >> registering the bitstreams, or just performing a local import.
> >>
> >> A SWORD deposit by reference might take place in two parts:
> >>
> >> - Deposit some metadata, that includes a description of the file(s) to
> >> be ingested
> >>
> >> - A second process (perhaps triggered by the SWORD deposit, or
> >> undertaken later, such as via a DSpace curation task) that ingests the
> >> file(s) into the DSpace object.
> >>
> >> Could you tell us a bit more about the process you want to implement?
> >> Who has the data, the metadata, who performs the deposit etc?
> >>
> >> Thanks,
> >>
> >>
> >> Stuart Lewis
> >> Digital Development Manager
> >> Te Tumu Herenga The University of Auckland Library
> >> Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand
> >> Ph: +64 (0)9 373 7599 x81928
> >>
> >>
> >>
> >> On 29/11/2011, at 7:19 AM, Leggett, Pete wrote:
> >>
> >>> Stuart,
> >>>
> >>> Can you provide any links to examples of using Œdeposit by reference¹ ?
> >>>
> >>> I am looking at feasibility of depositing very large items (tar.gz or
> >>> zip¹d data files), say up to 16TB, into Dspace 1.6.x with the obvious
> >>> problems of doing this using a web interface.
> >>> Wondering if EasyDeposit can be adapted to do Œdeposit by reference¹
> >>> with either a utility of some kind on the  dspace server looking for
> >>> large items to injest or a client pushing the data onto a directory on
> >>> the dspace server from where it can be injested. Ideally want to
> >>> minimise any copies of the data.
> >>>
> >>> Really want to avoid copying the item once it¹s on the Dspace server.
> >>> Could item be uploaded directly into asset store maybe ?
> >>> The other problem is how anyone could download the item once it¹s in
> >>> Dspace ?
> >>>
> >>> Anyone else doing this sort of very large item ( i.e. TB¹s ) injest ?
> >>>
> >>> Thank you,
> >>>
> >>> Pete
> >>>
> >>>
> >>> From: David FLANDERS [mailto:d.fland...@jisc.ac.uk]
>
> >>------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure
> contains a definitive record of customers, application performance,
> security threats, fraudulent activity, and more. Splunk takes this
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> sword-app-tech mailing list
> sword-app-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/sword-app-tech****
>

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d

_______________________________________________
sword-app-tech mailing list
sword-app-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sword-app-tech

Re: [sword-app-tech] How to send large fiels

Reply via email to