That's great to hear Stuart, cheers for the update, is their a project blog by which we can monitor progress? /dff
> -----Original Message----- > From: Stuart Lewis [mailto:s.le...@auckland.ac.nz] > Sent: 05 December 2011 23:32 > To: David FLANDERS > Cc: Kathi Fletcher; < sword-app-tech@lists.sourceforge.net> > oda...@gmail.com; Rufus > Subject: Re: [sword-app-tech] How to send large fiels > > FWIW we've got a couple of students under New Zealand's "Summer of > eResearch" scheme looking at implementing a university data catalogue, > with CKAN being one of the candidate systems. Will let you know how we > get on. > > > Stuart Lewis > Digital Development Manager > Te Tumu Herenga The University of Auckland Library > Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand > Ph: +64 (0)9 373 7599 x81928 > > > > On 6/12/2011, at 11:28 AM, David FLANDERS wrote: > > > I’ve bugged Rufus a fair amount on this, one of his project manager’s > Mark Macgillvray has been thinking on this re ‘Open Scholarship’ a fair > amount – wish we could get a University to start playing around with > this, of course the Tardis folk down in Oz have been doing good things > as well, Cc Steve Androulakis. /dff > > > > From: Kathi Fletcher [mailto:kathi.fletc...@gmail.com] > > Sent: 05 December 2011 14:50 > > To: David FLANDERS > > Cc: < sword-app-tech@lists.sourceforge.net> Rufus; Leggett, Pete > > Subject: Re: [sword-app-tech] How to send large fiels > > > > Hi, > > > > I have CC'd Rufus Pollack of CKAN in case he has ideas about some > sort of system where papers go in document repositories like DSpace, > EPrint, and data goes in data repositories like CKAN etc. > > > > Kathi > > > > ---------- Forwarded message ---------- > > From: David FLANDERS <d.fland...@jisc.ac.uk> > > Date: 2011/12/5 > > Subject: Re: [sword-app-tech] How to send large fiels > > To: Ben O'Steen <bost...@gmail.com>, Stuart Lewis > <s.le...@auckland.ac.nz> > > Cc: "< sword-app-tech@lists.sourceforge.net>" <sword-app- > t...@lists.sourceforge.net>, "Leggett, Pete" <p.f.legg...@exeter.ac.uk> > > > > +1 > > > > Why not use systems *built for* data instead of a system built for > research papers? CKAN, Tardis, Kasabi, MongoDB, NoSQL store (triple, > graph, keyValue)...? > > > > I’d like to hear a good reason not to use these systems and then > interoperate with repositories rather than build the same functionality > into repositories? /dff > > > > From: Ben O'Steen [mailto:bost...@gmail.com] > > Sent: 05 December 2011 08:00 > > To: Stuart Lewis > > Cc: < sword-app-tech@lists.sourceforge.net> Leggett, Pete > > > > Subject: Re: [sword-app-tech] How to send large fiels > > > > While I think I understand the drive to put these files within a > repository, I would suggest caution. Just because it might be possible > to put a file into the care of a repository doesn't make it a practical > or useful thing to do. > > > > - What do you feel you might gain by placing 500Gb+ files into a > repository, compared with having them in an addressable filestore? > > - Have people been able to download files of that size from DSpace, > Fedora or EPrints? > > - Has the repository been allocated space on a suitable filesystem? > XFS, EBS, Thumper or similar? > > - Once the file is ingested into DSpace or Fedora for example, is > there any other route to retrieve this, aside from HTTP? (Coding your > own servlet/addon is not a real answer to this.) Is it easily > accessible via Grid-FTP or HPN-SSH for example? > > - Can the workflows you wish to utilise handle the data you are > giving it? Is any broad stroke tool aside from fixity useful here? > > > > Again, I am advising caution here, not besmirching the name of > repositories. They do a good job with what we might currently term > "small files", but were never developed with research data sizes in > mind (3-500Gb is a decent rough guide. 1+Tb sets are certainly not > uncommon) > > > > So, in short, weigh up the benefits against the downsides and not in > hypotheticals. Actually do it, and get real researchers to try and use > it. You'll soon have a metric to show what is useful and what isn't. > > > > On Monday, 5 December 2011, Stuart Lewis wrote: > > Hi Pete, > > > > Thanks for the information. I've attached a piece of code that we > use locally as part of the curation framework (in DSpace 1.7 or above), > written by a colleague: Kim Shepherd. The curation framework allows > small jobs to be run on single items, collections, communities, or the > whole repository. This particular job looks to see if there is a > filename in a pre-described metadata field, and if there is no matching > bitstream, it will then ingest the file from disk. > > > > More details of the curation system can be seen at: > > > > - https://wiki.duraspace.org/display/DSPACE/CurationSystem > > - https://wiki.duraspace.org/display/DSPACE/Curation+Task+Cookbook > > > > Some other curation tasks that Kim has written: > > > > - https://github.com/kshepherd/Curation > > > > This can be used by depositing the metadata via SWORD, with the > filename in a metadata field. Optionally the code could be changed to > copy the file from another source (e.g. FTP, HTTP, Grid, etc). > > > > Thanks, > > > > > > Stuart Lewis > > Digital Development Manager > > Te Tumu Herenga The University of Auckland Library > > Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand > > Ph: +64 (0)9 373 7599 x81928 > > > > > > > > On 29/11/2011, at 12:09 PM, Leggett, Pete wrote: > > > > > Hi Stuart, > > > > > > You asked for more info. We are developing a Research Data > Repository > > > based on Dspace for storing the research data associated with > Exeter > > > University research publications. > > > For some research fields such as Physics, Biology, this data can be > very > > > large - TB's it seems!, hence the need to consider large injests > over what > > > might be several days. > > > The researcher has the data, and would I am guessing create the > metadata > > > but maybe in collaboration with a data curator. Ideally the > researcher > > > would perform the deposit with, for large data sets, an offline > injest of > > > the data itself. The data can be on the researchers > > > server/workstation/laptop/dvd/usb hard drive etc. > > > > > > There seems to be a couple of ways at least of approaching this so > what I > > > was after was some references to what and how other people have > done this > > > to give me a better handle on the best way forward - having very > little > > > dspace or repository experience myself. But given the size of > larger data > > > sets, I do think the best solution will involve as little copying > of the > > > data as possible - with the ultimate being just one copy process, > of the > > > data from source into repository. Everything less being done by > reference > > > if that is possible. > > > > > > Are you perhaps able to point me at some "code" examples for the > SWORD > > > deposit you talk about where a second process injests the files ? > Would > > > this be coded in Java ? > > > Does the injest process have to be java based or can it be a perl > script > > > for example ? Please forgive my Dspace ignorance! > > > > > > Best regards, > > > > > > Pete > > > > > > > > > On 28/11/2011 20:26, "Stuart Lewis" <s.le...@auckland.ac.nz> wrote: > > > > > >> Hi Pete, > > >> > > >> 'Deposit by reference' would probably be used to 'pull' data from > a > > >> remote server. If you already have the data on your DSpace > server, as > > >> Mark points out there might be better ways to perform the import, > such as > > >> registering the bitstreams, or just performing a local import. > > >> > > >> A SWORD deposit by reference might take place in two parts: > > >> > > >> - Deposit some metadata, that includes a description of the > file(s) to > > >> be ingested > > >> > > >> - A second process (perhaps triggered by the SWORD deposit, or > > >> undertaken later, such as via a DSpace curation task) that ingests > the > > >> file(s) into the DSpace object. > > >> > > >> Could you tell us a bit more about the process you want to > implement? > > >> Who has the data, the metadata, who performs the deposit etc? > > >> > > >> Thanks, > > >> > > >> > > >> Stuart Lewis > > >> Digital Development Manager > > >> Te Tumu Herenga The University of Auckland Library > > >> Auckland Mail Centre, Private Bag 92019, Auckland 1142, New > Zealand > > >> Ph: +64 (0)9 373 7599 x81928 > > >> > > >> > > >> > > >> On 29/11/2011, at 7:19 AM, Leggett, Pete wrote: > > >> > > >>> Stuart, > > >>> > > >>> Can you provide any links to examples of using Œdeposit by > reference¹ ? > > >>> > > >>> I am looking at feasibility of depositing very large items > (tar.gz or > > >>> zip¹d data files), say up to 16TB, into Dspace 1.6.x with the > obvious > > >>> problems of doing this using a web interface. > > >>> Wondering if EasyDeposit can be adapted to do Œdeposit by > reference¹ > > >>> with either a utility of some kind on the dspace server looking > for > > >>> large items to injest or a client pushing the data onto a > directory on > > >>> the dspace server from where it can be injested. Ideally want to > > >>> minimise any copies of the data. > > >>> > > >>> Really want to avoid copying the item once it¹s on the Dspace > server. > > >>> Could item be uploaded directly into asset store maybe ? > > >>> The other problem is how anyone could download the item once it¹s > in > > >>> Dspace ? > > >>> > > >>> Anyone else doing this sort of very large item ( i.e. TB¹s ) > injest ? > > >>> > > >>> Thank you, > > >>> > > >>> Pete > > >>> > > >>> > > >>> From: David FLANDERS [mailto:d.fland...@jisc.ac.uk] > > >>------------------------------------------------------------------- > ----------- > > All the data continuously generated in your IT infrastructure > > contains a definitive record of customers, application performance, > > security threats, fraudulent activity, and more. Splunk takes this > > data and makes sense of it. IT sense. And common sense. > > http://p.sf.net/sfu/splunk-novd2d > > _______________________________________________ > > sword-app-tech mailing list > > sword-app-tech@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/sword-app-tech > > > > --------------------------------------------------------------------- > --------- > > All the data continuously generated in your IT infrastructure > > contains a definitive record of customers, application performance, > > security threats, fraudulent activity, and more. Splunk takes this > > data and makes sense of it. IT sense. And common sense. > > http://p.sf.net/sfu/splunk-novd2d > > _______________________________________________ > > sword-app-tech mailing list > > sword-app-tech@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/sword-app-tech > > > > > > > > > > -- > > Katherine Fletcher, kathi.fletc...@gmail.com > > Twitter: kefletcher Blog: kefletcher.blogspot.com > > > > > > > > --------------------------------------------------------------------- > --------- > > All the data continuously generated in your IT infrastructure > > contains a definitive record of customers, application performance, > > security threats, fraudulent activity, and more. Splunk takes this > > data and makes sense of it. IT sense. And common sense. > > http://p.sf.net/sfu/splunk- > novd2d_______________________________________________ > > sword-app-tech mailing list > > sword-app-tech@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/sword-app-tech > ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-novd2d _______________________________________________ sword-app-tech mailing list sword-app-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/sword-app-tech