Re: [Analytics] [Wiki-research-l] Analytics clients (stat/notebook hosts) and backups of home directories

2019-07-10 Thread Nuria Ruiz
>I have one question for you: As you allow/encourage for more copies of
>the files to exist
To be extra clear, we do not encourage for data to be in that notebooks
hosts at all, there is no capacity of them to neither process nor hosts
large amounts of data. Data that you are working with is best placed on
/user/your-username databse in hadoop so far from encouraging multiple
copies we are rather encouraging you keep the data outside the notebook
machines.

Thanks,

Nuria

On Wed, Jul 10, 2019 at 11:13 AM Kate Zimmerman 
wrote:

> I second Leila's question. The issue of how we flag PII data and ensure
> it's appropriately scrubbed came up in our team meeting yesterday. We're
> discussing team practices for data/project backups tomorrow and plan to
> come out with some proposals, at least for the short term.
>
> Are there any existing processes or guidelines I should be aware of?
>
> Thanks!
> Kate
>
> --
>
> Kate Zimmerman (she/they)
> Head of Product Analytics
> Wikimedia Foundation
>
>
> On Wed, Jul 10, 2019 at 9:00 AM Leila Zia  wrote:
>
>> Hi Luca,
>>
>> Thanks for the heads up. Isaac is coordinating a response from the
>> Research side.
>>
>> I have one question for you: As you allow/encourage for more copies of
>> the files to exist, what is the mechanism you'd like to put in place
>> for reducing the chances of PII to be copied in new folders that then
>> will be even harder (for your team) to keep track of? Having an
>> explicit process/understanding about this will be very helpful.
>>
>> Thanks,
>> Leila
>>
>>
>> On Thu, Jul 4, 2019 at 3:14 AM Luca Toscano 
>> wrote:
>> >
>> > Hi everybody,
>> >
>> > as part of https://phabricator.wikimedia.org/T201165 the Analytics team
>> > thought to reach out to everybody to make it clear that all the home
>> > directories on the stat/notebook nodes are not backed up periodically.
>> They
>> > run on a software RAID configuration spanning multiple disks of course,
>> so
>> > we are resilient on a disk failure, but even if unlikely if might happen
>> > that a host could loose all its data. Please keep this in mind when
>> working
>> > on important projects and/or handling important data that you care
>> about.
>> >
>> > I just added a warning to
>> >
>> https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analytics_clients
>> .
>> > If you have really important data that is too big to backup, keep in
>> mind
>> > that you can use your home directory (/user/your-username) on HDFS (that
>> > replicates data three times across multiple nodes).
>> >
>> > Please let us know if you have comments/suggestions/etc.. in the
>> > aforementioned task.
>> >
>> > Thanks in advance!
>> >
>> > Luca (on behalf of the Analytics team)
>> > ___
>> > Wiki-research-l mailing list
>> > wiki-researc...@lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Wiki-research-l] Analytics clients (stat/notebook hosts) and backups of home directories

2019-07-10 Thread Kate Zimmerman
I second Leila's question. The issue of how we flag PII data and ensure
it's appropriately scrubbed came up in our team meeting yesterday. We're
discussing team practices for data/project backups tomorrow and plan to
come out with some proposals, at least for the short term.

Are there any existing processes or guidelines I should be aware of?

Thanks!
Kate

--

Kate Zimmerman (she/they)
Head of Product Analytics
Wikimedia Foundation


On Wed, Jul 10, 2019 at 9:00 AM Leila Zia  wrote:

> Hi Luca,
>
> Thanks for the heads up. Isaac is coordinating a response from the
> Research side.
>
> I have one question for you: As you allow/encourage for more copies of
> the files to exist, what is the mechanism you'd like to put in place
> for reducing the chances of PII to be copied in new folders that then
> will be even harder (for your team) to keep track of? Having an
> explicit process/understanding about this will be very helpful.
>
> Thanks,
> Leila
>
>
> On Thu, Jul 4, 2019 at 3:14 AM Luca Toscano 
> wrote:
> >
> > Hi everybody,
> >
> > as part of https://phabricator.wikimedia.org/T201165 the Analytics team
> > thought to reach out to everybody to make it clear that all the home
> > directories on the stat/notebook nodes are not backed up periodically.
> They
> > run on a software RAID configuration spanning multiple disks of course,
> so
> > we are resilient on a disk failure, but even if unlikely if might happen
> > that a host could loose all its data. Please keep this in mind when
> working
> > on important projects and/or handling important data that you care about.
> >
> > I just added a warning to
> >
> https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analytics_clients
> .
> > If you have really important data that is too big to backup, keep in mind
> > that you can use your home directory (/user/your-username) on HDFS (that
> > replicates data three times across multiple nodes).
> >
> > Please let us know if you have comments/suggestions/etc.. in the
> > aforementioned task.
> >
> > Thanks in advance!
> >
> > Luca (on behalf of the Analytics team)
> > ___
> > Wiki-research-l mailing list
> > wiki-researc...@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Wiki-research-l] Analytics clients (stat/notebook hosts) and backups of home directories

2019-07-10 Thread Leila Zia
Hi Luca,

Thanks for the heads up. Isaac is coordinating a response from the
Research side.

I have one question for you: As you allow/encourage for more copies of
the files to exist, what is the mechanism you'd like to put in place
for reducing the chances of PII to be copied in new folders that then
will be even harder (for your team) to keep track of? Having an
explicit process/understanding about this will be very helpful.

Thanks,
Leila


On Thu, Jul 4, 2019 at 3:14 AM Luca Toscano  wrote:
>
> Hi everybody,
>
> as part of https://phabricator.wikimedia.org/T201165 the Analytics team
> thought to reach out to everybody to make it clear that all the home
> directories on the stat/notebook nodes are not backed up periodically. They
> run on a software RAID configuration spanning multiple disks of course, so
> we are resilient on a disk failure, but even if unlikely if might happen
> that a host could loose all its data. Please keep this in mind when working
> on important projects and/or handling important data that you care about.
>
> I just added a warning to
> https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analytics_clients.
> If you have really important data that is too big to backup, keep in mind
> that you can use your home directory (/user/your-username) on HDFS (that
> replicates data three times across multiple nodes).
>
> Please let us know if you have comments/suggestions/etc.. in the
> aforementioned task.
>
> Thanks in advance!
>
> Luca (on behalf of the Analytics team)
> ___
> Wiki-research-l mailing list
> wiki-researc...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Wiki-research-l] Firewall on stat100x and notebook100x hosts

2019-07-10 Thread Isaac Johnson
Sounds perfect Luca -- thanks for the clarification!

On Wed, Jul 10, 2019 at 9:20 AM Luca Toscano  wrote:

> Hi Isaac,
>
> Il giorno mer 10 lug 2019 alle ore 16:14 Isaac Johnson <
> is...@wikimedia.org>
> ha scritto:
>
> > Hey Luca,
> > We discussed this in Research and it all sounds good to us with one
> > question below. If something else arises, we'll ping you. Thanks for the
> > heads up!
> >
> > > We assumed that instructing Spark to use a predefined
> > range of random ports was not possible, but in
> > https://phabricator.wikimedia.org/T170826 we discovered that there is a
> > way
> > (that seems to work fine from our tests).
> >
> > Will we need to change anything in our configuration or will this be
> > automatic?
> >
>
> On the stat hosts the change is already live and your new spark sessions
> will pick it up automatically, on the notebooks we'll need to restart the
> spark sessions before enabling the firewall. I am planning to contact all
> the owners of a Spark session on notebook100[3,4], so if anybody sees an
> email from me then there will be an action to do, otherwise none :)
>
> Luca
> ___
> Wiki-research-l mailing list
> wiki-researc...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>


-- 
Isaac Johnson -- Research Scientist -- Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Wiki-research-l] Firewall on stat100x and notebook100x hosts

2019-07-10 Thread Luca Toscano
Hi Isaac,

Il giorno mer 10 lug 2019 alle ore 16:14 Isaac Johnson 
ha scritto:

> Hey Luca,
> We discussed this in Research and it all sounds good to us with one
> question below. If something else arises, we'll ping you. Thanks for the
> heads up!
>
> > We assumed that instructing Spark to use a predefined
> range of random ports was not possible, but in
> https://phabricator.wikimedia.org/T170826 we discovered that there is a
> way
> (that seems to work fine from our tests).
>
> Will we need to change anything in our configuration or will this be
> automatic?
>

On the stat hosts the change is already live and your new spark sessions
will pick it up automatically, on the notebooks we'll need to restart the
spark sessions before enabling the firewall. I am planning to contact all
the owners of a Spark session on notebook100[3,4], so if anybody sees an
email from me then there will be an action to do, otherwise none :)

Luca
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Wiki-research-l] Firewall on stat100x and notebook100x hosts

2019-07-10 Thread Isaac Johnson
Hey Luca,
We discussed this in Research and it all sounds good to us with one
question below. If something else arises, we'll ping you. Thanks for the
heads up!

> We assumed that instructing Spark to use a predefined
range of random ports was not possible, but in
https://phabricator.wikimedia.org/T170826 we discovered that there is a way
(that seems to work fine from our tests).

Will we need to change anything in our configuration or will this be
automatic?

Best,
Isaac

On Fri, Jul 5, 2019 at 4:36 AM Luca Toscano  wrote:

> TL;DR: In https://phabricator.wikimedia.org/T170826 the Analytics team
> wants to add base firewall rules to stat100x and notebook100x hosts, that
> will cause any non-localhost or known traffic to be blocked by default.
> Please let us know in the task if this is a problem for you.
>
> Hi everybody,
>
> the Analytics team has always left the stat100x and notebook100x hosts
> without a set of base firewall rules to avoid impacting any
> research/test/etc.. activity on those hosts. This choice has a lot of
> downsides, one of the most problematic ones is that usually environments
> like the Python venvs can install potentially any package, and if the owner
> does not pay attention to security upgrades then we may have a security
> problem if the environment happens to bind to a network port and accept
> traffic from anywhere.
>
> One of the biggest problems was Spark: when somebody launches a shell using
> Hadoop Yarn (--master yarn), a Driver component is created that needs to
> bind to a random port to be able to communicate with the workers created on
> the Hadoop cluster. We assumed that instructing Spark to use a predefined
> range of random ports was not possible, but in
> https://phabricator.wikimedia.org/T170826 we discovered that there is a
> way
> (that seems to work fine from our tests). The other big use case that we
> know, Jupyter notebooks, seems to require only localhost traffic flow
> without restrictions.
>
> Please let us know in the task if you have a use case that requires your
> environment to bind to a network port on stat100x or notebook100x and
> accept traffic from other hosts. For example, having a python app that
> binds to port 33000 on stat1007 and listens/accepts traffic from other stat
> or notebook hosts.
>
> If we don't hear anything, we'll start adding base firewall rules to one
> host at the time during the upcoming weeks, tracking our work on the
> aforementioned task.
>
> Thanks!
>
> Luca (on behalf of the Analytics team)
> ___
> Wiki-research-l mailing list
> wiki-researc...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>


-- 
Isaac Johnson -- Research Scientist -- Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics