Re: [Reprozip-users] Web Archiving

2018-04-18 Thread Vicky Steeves
Hi Rasa,

Apologies, we were traveling and just got back to the office. We are very
glad to be of help!

We let the users packing experiments to edit the yml file before the final
packing step, and for those secondary users who unpack, we let them
download and view the yml file. We certainly *could* automatically extract
categories of information for the user. It bears more thinking about,
especially since there are a few ways that unpacking users interface with
ReproUnzip.

Best,
Vicky

Vicky Steeves
Research Data Management and Reproducibility Librarian
Phone: 1-212-992-6269
ORCID: orcid.org/-0003-4298-168X/
vickysteeves.com | @VickySteeves <https://twitter.com/VickySteeves>
NYU Libraries Data Services | NYU Center for Data Science

On Tue, Apr 10, 2018 at 4:46 AM, Rasa Bočytė <rboc...@beeldengeluid.nl>
wrote:

> Hi Remi,
>
> In terms of migration, originally my institute planned to acquire files
> from the creators and then figure out what to do with them, most likely
> migrate individual files to updated versions when needed. Which I think is
> not a helpful approach since you need to start at the server and capture
> the environment and software that manipulates those files to create a
> website. Especially, if you want to be able to reproduce it.
>
> I am definitely leaning towards the idea that virtualisation of a web
> server would be the best approach for us. I will try to test out the
> examples that you have on your website and see if I can run some tests with
> my own case studies (of course, it depends if the creators will allow us to
> do it).
>
> I promise I won't bother you too much but my last question is about the
> metadata captured on the yml file. It is machine and human readable, but
> the question is what do you with it and how you present it once you have it
> so it becomes a valuable resource for those using the preserved object.
> Have you thought about automatically extracting some categories of
> information from that file in a user-friendly format or do you think it is
> enough as it is?
>
> Just wanted to say a massive thank you for your feedback. It has been
> incredibly helpful!
>
> Rasa
>
> On 6 April 2018 at 19:53, Rémi Rampin <remi.ram...@nyu.edu> wrote:
>
>> Rasa,
>>
>> 2018-04-04 08:03 EDT, Rasa Bočytė <rboc...@beeldengeluid.nl>:
>>
>>> In our case, we are getting all the source files directly from content
>>> creators and we are looking for a way to record and store all the
>>> technical, administrative and descriptive metadata, and visualise
>>> dependencies on software/hardware/file formats/ etc. (similar to what
>>> Binder does).
>>>
>>
>> I didn't think Binder did that (this binder?
>> <https://github.com/jupyterhub/binderhub>). It is certainly a good
>> resource for reproducing environments already described as a Docker image
>> or Conda YaML, but I am not aware of ways to use it to track or visualize
>> dependencies or any metadata.
>>
>> We have been mostly considering migration as it is a more scalable
>>> approach and less technically demanding. Do you find that virtualisation is
>>> a better strategy for website preservation? At least from the archival
>>> community, we have heard some reservations about using Docker since it is
>>> not considered a stable platform.
>>>
>>
>> When you talk of migration, do you mean to new hardware? What would you
>> be migrating to? Or do you mean upgrading underlying software/frameworks?
>> The way I see it, virtualization (sometimes referred to as "preserving
>> the mess") is definitely less technically demanding than migration. Could
>> you share a bit more about what you mean by this?
>>
>> Thanks
>>
>> PS: Please make sure you keep us...@reprozip.org in the recipients list.
>> --
>> Rémi Rampin
>> ReproZip Developer
>> Center for Data Science, New York University
>>
>
>
>
> --
>
> *Rasa Bocyte*
> Web Archiving Intern
>
> *Netherlands Institute for Sound and Vision*
> *Media
> <https://maps.google.com/?q=Media%C2%A0Parkboulevard%C2%A01=gmail=g>
>  Parkboulevard
> <https://maps.google.com/?q=Media%C2%A0Parkboulevard%C2%A01=gmail=g>
>  1
> <https://maps.google.com/?q=Media%C2%A0Parkboulevard%C2%A01=gmail=g>,
>  1217 WE  Hilversum | Postbus 1060, 1200 BB  Hilversum | **beeldengeluid.nl
> <http://www.beeldengeluid.nl/>*
>
> ___
> Reprozip-users mailing list
> Reprozip-users@vgc.poly.edu
> https://vgc.poly.edu/mailman/listinfo/reprozip-users
>
>
___
Reprozip-users mailing list
Reprozip-users@vgc.poly.edu
https://vgc.poly.edu/mailman/listinfo/reprozip-users


[Reprozip-users] Fwd: Web Archiving

2018-04-06 Thread Vicky Steeves
-- Forwarded message --
From: Rasa Bočytė <rboc...@beeldengeluid.nl>
Date: Fri, Apr 6, 2018 at 9:11 AM
Subject: Re: [Reprozip-users] Web Archiving
To: vicky.stee...@nyu.edu

Dear Vicky,

thank you for your response! I have been hopelessly looking for scalable
approaches for web preservation for the last couple of months, so I am very
excited that I finally found your project which deals with similar issues
and describes exactly what we need for our archiving purposes.
Automatically packaging and describing environments and dependencies is the
biggest challenge for us and ReproZip approach would be really useful.

If I understand correctly, ReproZip can describe environments that are
necessary to run a particular software or a web application that is used to
create a dynamic website (Nikola in the case of your website or Django with
StackedUp). Would it work if there is no such software? With the websites
that I am working with, we do have access to web servers. If the content is
placed on MySQL database, I guess you would be able to package the
environment with ReproZip, but could it capture the environment if the
content is placed on a server just as files and folders with no software?

Another question I have relates to the packaging process. How does it work
in terms of dynamic websites? I had a look at the video that you mentioned
in your email and what I would like to know is how ReproZip would deal with
content that is generated on the fly through user interaction. Is it
similar to webrecorder <https://webrecorder.io/>, in that it only records
dependencies and transactions that are clicked on by the user or could it
automatically capture the whole environment without user interaction?

I am sorry for bothering you with these questions. I come from a
non-technical background so it is difficult to understand all the technical
intricacies. But I would love to test your tools as part of my research,
and even if I do not manage to do that, I will definitely mention it as a
very promising approaches for web preservation.

Regards,
Rasa

On 5 April 2018 at 22:35, Vicky Steeves <vicky.stee...@nyu.edu> wrote:

> Hello Rasa,
>
> As the resident librarian on the team, I am really happy to see this email
> on the ReproZip users list!
>
> We are mainly exploring the possibilities of packing with dynamic sites,
> but within the domain of data journalism. Once we have worked with those
> use cases, we can certainly go beyond to other dynamic sites. Data
> journalism is a good place to start because of the nature of the work and
> how data journalism applications are served to the web --- lots of
> containers, databases, interactive websites, etc. In order to pack anything
> with ReproZip, we (or anyone using ReproZip!) needs access to the original
> environment and source files. We basically need access to the server, and
> then we can pack the dynamic site. We recorded the process of packing a
> dynamic website and put it on YouTube, which might be helpful:
> https://www.youtube.com/watch?v=SoE2nEJWylw=PLjgZ3v4gFx
> pXdPRBaFTh42w3HRMmX2WfD
>
> ReproZip automatically captures technical and administrative metadata. You
> can view the technical and administrative metadata collected in this sample
> config.yml file: https://gitlab.com/snippets/1686638. The config.yml has
> all the metadata from the .rpz package. That particular yml file I just
> linked comes from an experiment of mine, packing a website with ReproZip
> made with Nikola, a static site generator, and deployed on Firefox. This is
> the same website with Nikola, deployed on Google Chrome:
> https://gitlab.com/snippets/1686640. The config.yml file is human
> readable, but very long (lots of dependencies!). We still need to get the
> descriptive metadata from the users, though.
>
> ReproZip can visualize the provenance of the processes, dependencies, and
> input/output files via a graph. The documentation and examples of those can
> be found here: https://docs.reprozip.org/en/1.0.x/graph.html. We are in
> the process of integrating a patch for transforming this static graph into
> an interactive visualizations using D3.
>
> As for Docker, I too do not trust it. However, ReproZip simply *uses*
> Docker, but does not rely on it. ReproZip works on a plugin model -- so the
> .rpz file is generalized and can be used by many virtualization and
> container softwares. We are in the process of adding an unpacker for
> Singularity, for example. If Docker goes out of business/ceases to exist
> tomorrow, we can still unpack and reuse the contents of .rpz files. We
> actually wrote a paper about how ReproZip could be used for digital
> preservation, available open access here: 
> https://osf.io/preprints/lissa/5tm8d/
>
>
> In regards to emulation vs. migration, this is a larger conversation in
&g

Re: [Reprozip-users] Web Archiving

2018-04-05 Thread Vicky Steeves
Hello Rasa,

As the resident librarian on the team, I am really happy to see this email
on the ReproZip users list!

We are mainly exploring the possibilities of packing with dynamic sites,
but within the domain of data journalism. Once we have worked with those
use cases, we can certainly go beyond to other dynamic sites. Data
journalism is a good place to start because of the nature of the work and
how data journalism applications are served to the web --- lots of
containers, databases, interactive websites, etc. In order to pack anything
with ReproZip, we (or anyone using ReproZip!) needs access to the original
environment and source files. We basically need access to the server, and
then we can pack the dynamic site. We recorded the process of packing a
dynamic website and put it on YouTube, which might be helpful:
https://www.youtube.com/watch?v=SoE2nEJWylw=
PLjgZ3v4gFxpXdPRBaFTh42w3HRMmX2WfD

ReproZip automatically captures technical and administrative metadata. You
can view the technical and administrative metadata collected in this sample
config.yml file: https://gitlab.com/snippets/1686638. The config.yml has
all the metadata from the .rpz package. That particular yml file I just
linked comes from an experiment of mine, packing a website with ReproZip
made with Nikola, a static site generator, and deployed on Firefox. This is
the same website with Nikola, deployed on Google Chrome:
https://gitlab.com/snippets/1686640. The config.yml file is human readable,
but very long (lots of dependencies!). We still need to get the descriptive
metadata from the users, though.

ReproZip can visualize the provenance of the processes, dependencies, and
input/output files via a graph. The documentation and examples of those can
be found here: https://docs.reprozip.org/en/1.0.x/graph.html. We are in the
process of integrating a patch for transforming this static graph into an
interactive visualizations using D3.

As for Docker, I too do not trust it. However, ReproZip simply *uses*
Docker, but does not rely on it. ReproZip works on a plugin model -- so the
.rpz file is generalized and can be used by many virtualization and
container softwares. We are in the process of adding an unpacker for
Singularity, for example. If Docker goes out of business/ceases to exist
tomorrow, we can still unpack and reuse the contents of .rpz files. We
actually wrote a paper about how ReproZip could be used for digital
preservation, available open access here: https://osf.io/preprints/lissa/5tm8d/


In regards to emulation vs. migration, this is a larger conversation in the
digital preservation community as you probably know. There is a benefit to
serving a website (or any digital object really) in the original
environment, for serving it up to users later on. We've seen this in the
art conservation world, where time-based media preservation has forced
archivists to engage with emulation more seriously. Yale University offers
emulation as a service, where they provide access to various old operating
systems and software to allow folks to interact with digital objects in
their original environment. The video game community also has many
discussions about this, with emulators really rising out of that community.
A high quality migration can be as effective as emulation in preserving the
original look and feel of complex digital objects -- but in terms of that
scaling, I'm not sure.

Cheers,
Vicky
Vicky Steeves
Research Data Management and Reproducibility Librarian
Phone: 1-212-992-6269
ORCID: orcid.org/-0003-4298-168X/
vickysteeves.com | @VickySteeves <https://twitter.com/VickySteeves>
NYU Libraries Data Services | NYU Center for Data Science

On Wed, Apr 4, 2018 at 3:28 PM, Rémi Rampin <remi.ram...@nyu.edu> wrote:

> -- Forwarded message --
> From: Rasa Bočytė <rboc...@beeldengeluid.nl>
> Date: Wed, Apr 4, 2018 at 8:03 AM
> Subject: Re: [Reprozip-users] Web Archiving
> To: Rémi Rampin <remi.ram...@nyu.edu>
>
>
> Dear Remi,
>
> thank you for your response! It is good to hear that other people are
> working on similar issues as well!
>
> Could you tell me a bit more about your work on trying to packaging
> dynamic websites? Are you working on specific cases or just exploring the
> possibilities? I would be very interested to hear how you approach this.
>
> In our case, we are getting all the source files directly from content
> creators and we are looking for a way to record and store all the
> technical, administrative and descriptive metadata, and visualise
> dependencies on software/hardware/file formats/ etc. (similar to what
> Binder does). We would try to get as much information from the creators
> (probably via a questionnaire) about all the technical details as well as
> creative processes, and preferably record it in a machine and human
> readable format (XML) or README file.
>
> At the end of the day,