Brilliant, thank you Pete for posting your replies! It is obvious that
you've already considered this area in depth and so I am sure that many
people will find your responses very helpful indeed. My questions were
intended to provoke thought on the list, as I know from my own experience
and failings, it can be easy to fall into the mindset of forcing every
solution to be a repository!
In my opinion, the main point about data repositories is that, to be
useful, it should be easy to get a portion of a dataset, a dataset itself
or even multiple datasets into situations where the data can be worked on.
This implies a sliding scale of environments, from pushing the data into
HPC and compute clouds to providing downloads for personal google
spreadsheets, Excel, Matlab or SPSS. My questions mostly arose because I
did not see how putting (large) datasets into the normal research paper
repositories can aid this. They were not meant to be critical of the plan,
but intended to illustrate and share my concerns, concerns which I believe
you share.
Ben
On Tuesday, 6 December 2011, Leggett, Pete wrote:
> Hi Ben,****
>
> ** **
>
> In response to your questions (and apologies if I’m teaching anyone on
> this list to suck eggs):-****
>
> ** **
>
> - What do you feel you might gain by placing 500Gb+ files into a
> repository, compared with having them in an addressable filestore?****
>
> My understanding is that many funding bodies are now requiring that
> research data be made available along with the academic paper allowing for
> people to investigate/reproduce published research. The gain here in having
> the research data in a repository like Dspace would I guess makes it
> possible for people to easily find the data alongside the papers. That’s
> not to say that the actual data has to be in the traditional Dspace asset
> store. From what I’ve read so far on Dspace, the data can also be
> “referenced”, as long as the reference is to a file store accessible to
> Dspace. Further responses to this thread have talked about using other
> ideas like CKAN etc. – of which I know near zero so I will need to
> investigate further. At Exeter are very much geared up for using Dspace
> since we already have a number of Dspace repositories running already, so
> whatever solution we end up with, I think currently it will involve Dspace.
> ****
>
> The other gain of course is the proper and managed curation of research
> data through a work flow process rather the data ending up on a DVD on a
> professors shelf.****
>
> ** **
>
> - Have people been able to download files of that size from DSpace, Fedora
> or EPrints?****
>
> No idea!! But I take the point and it’s one I’ve already alluded to -
> it’s all very well getting this stuff into or referenced from Dspace but
> how will Joe researcher down load it easily. I suspect there is a
> requirement for providing other mechanisms for download from or via the
> Dspace repository rather than a normal web interface – isn’t this where
> SWORD comes in ?****
>
> ** **
>
> - Has the repository been allocated space on a suitable filesystem? XFS,
> EBS, Thumper or similar?****
>
> Yes I think so. We have a EMC Atmos providing currently 860TB of raw
> storage. This is basically object cloud storage in a similar vein to Amazon
> S3, but it does also provide NFS/CIFS access via an what EMC call an IFS
> server. We are currently running a DSpace asset store on our Atmos using an
> IFS server. Atmos also has a REST based interface and also has as an Amazon
> S3 Proxy (i.e. making it work with many Amazon S3 clients) in development
> and we have been beta testing this. We are also hoping to use the Atmos for
> backup of live research data. Atmos is good for archiving and serving up
> objects to the web and the sort of things people use S3 for but it’s not
> tier 1 storage – it’s not designed to be. The fit as a DSpace asset store
> seems to be a good one. Caveat - still remains to be proved in production
> DSpace use.****
>
> ** **
>
> - Once the file is ingested into DSpace or Fedora for example, is there
> any other route to retrieve this, aside from HTTP? (Coding your own
> servlet/addon is not a real answer to this.) Is it easily accessible via
> Grid-FTP or HPN-SSH for example?****
>
> Not yet for us, but I agree that another route such as HPN-SSH will be
> needed for large data sets.****
>
> ** **
>
> So, in short, weigh up the benefits against the downsides and not in
> hypotheticals. Actually do it, and get real researchers to try and use it.
> You'll soon have a metric to show what is useful and what isn't.****
>
> That’s what we are aiming to do as part of the JISC funded OpenExeter
> project – we will be piloting with researchers and using this to develop
> procedures and workflow etc.****
>
> ** **
>
> Many thanks for all comments and emails so far on this thread – very
> useful and interesting.****
>
> ** **
>
> Best regards,****
>
> ** **
>
> Pete****
>
> ** **
>
------------------------------------------------------------------------------
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of
discussion for anyone considering optimizing the pricing and packaging model
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/
_______________________________________________
sword-app-tech mailing list
sword-app-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sword-app-tech