Re: [Wiki-research-l] Brief shutdown of stat1007 for maintenance - Thu Dec 12th 15:30 CET

2019-12-17 Thread Luca Toscano
Hi again,

the maintenance has been postponed to tomorrow (Dec 18th) around 15:00 CET.
Please let me know if it will be a problem for you. The whole maintenance
shouldn't last more than 30 mins :)

Thanks!

Luca

Il giorno mer 11 dic 2019 alle ore 08:36 Luca Toscano <
ltosc...@wikimedia.org> ha scritto:

> Hi everybody,
>
> the Analytics team is going to shutdown stat1007 for a few minutes on Thu
> Dec 12th at around 15:30 CET to check if there is space for a GPU in the
> server's chassis. Please let us know if this will impact your work (so we
> can arrange a different maintenance window).
>
> Thanks!
>
> Luca
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Enable Kerberos authentication for Hadoop (please read if you use Hadoop for your daily work)

2019-12-16 Thread Luca Toscano
Hi everybody,

the day has finally come, we'll start the procedure at around 13:00 CET.

For any issue please reach out to the Analytics team on Freenode
(#wikimedia-analytics) or https://phabricator.wikimedia.org/T238560

We hope to have the cluster back running in a couple of hours, but it might
take more if we encounter unexpected issues.

Thanks for the patience!

Luca

Il giorno mar 26 nov 2019 alle ore 09:46 Luca Toscano <
ltosc...@wikimedia.org> ha scritto:

> Hi everybody,
>
> to avoid any conflict with the start of the Fundraising season, we moved
> the Kerberos maintenance to Dec 16th at 10 AM CET (more info in
> https://phabricator.wikimedia.org/T238560).
>
> If you use Hadoop and you haven't requested credentials yet, please do so
> following
> https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide#Get_a_password_for_Kerberos
>
> Thanks!
>
> Luca (on behalf of the Analytics team)
>
>
>
> Il giorno lun 18 nov 2019 alle ore 16:11 Luca Toscano <
> ltosc...@wikimedia.org> ha scritto:
>
>> Hi everybody,
>>
>> the Analytics team is going to enable Kerberos authentication for Hadoop
>> on Monday December 2nd. The procedure will start around 10 AM CET and will
>> hopefully last 3/4 hours, but since this is an invasive change there might
>> be a possibility that it will last more. If you have anything important
>> that requires Hadoop on this date please let us know in advance.
>>
>> The most visible change from the user's point of view is the introduction
>> of a new account/password to be able to use the Hadoop services (like
>> Hive/HDFS/Spark/Oozie). We created a user guide about what will change with
>> kerberos in
>> https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide.
>> There is also a task opened to track any doubt/question/special-use-cases
>> during the next two weeks: https://phabricator.wikimedia.org/T238560.
>>
>> Feel free to reach out to IRC #wikimedia-analytics on Freenode too!
>>
>> Thanks!
>>
>> Luca (on behalf of the Analytics team)
>>
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Brief shutdown of stat1007 for maintenance - Thu Dec 12th 15:30 CET

2019-12-10 Thread Luca Toscano
Hi everybody,

the Analytics team is going to shutdown stat1007 for a few minutes on Thu
Dec 12th at around 15:30 CET to check if there is space for a GPU in the
server's chassis. Please let us know if this will impact your work (so we
can arrange a different maintenance window).

Thanks!

Luca
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Enable Kerberos authentication for Hadoop (please read if you use Hadoop for your daily work)

2019-11-26 Thread Luca Toscano
Hi everybody,

to avoid any conflict with the start of the Fundraising season, we moved
the Kerberos maintenance to Dec 16th at 10 AM CET (more info in
https://phabricator.wikimedia.org/T238560).

If you use Hadoop and you haven't requested credentials yet, please do so
following
https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide#Get_a_password_for_Kerberos

Thanks!

Luca (on behalf of the Analytics team)



Il giorno lun 18 nov 2019 alle ore 16:11 Luca Toscano <
ltosc...@wikimedia.org> ha scritto:

> Hi everybody,
>
> the Analytics team is going to enable Kerberos authentication for Hadoop
> on Monday December 2nd. The procedure will start around 10 AM CET and will
> hopefully last 3/4 hours, but since this is an invasive change there might
> be a possibility that it will last more. If you have anything important
> that requires Hadoop on this date please let us know in advance.
>
> The most visible change from the user's point of view is the introduction
> of a new account/password to be able to use the Hadoop services (like
> Hive/HDFS/Spark/Oozie). We created a user guide about what will change with
> kerberos in
> https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide.
> There is also a task opened to track any doubt/question/special-use-cases
> during the next two weeks: https://phabricator.wikimedia.org/T238560.
>
> Feel free to reach out to IRC #wikimedia-analytics on Freenode too!
>
> Thanks!
>
> Luca (on behalf of the Analytics team)
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Enable Kerberos authentication for Hadoop (please read if you use Hadoop for your daily work)

2019-11-18 Thread Luca Toscano
Hi everybody,

the Analytics team is going to enable Kerberos authentication for Hadoop on
Monday December 2nd. The procedure will start around 10 AM CET and will
hopefully last 3/4 hours, but since this is an invasive change there might
be a possibility that it will last more. If you have anything important
that requires Hadoop on this date please let us know in advance.

The most visible change from the user's point of view is the introduction
of a new account/password to be able to use the Hadoop services (like
Hive/HDFS/Spark/Oozie). We created a user guide about what will change with
kerberos in
https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide.
There is also a task opened to track any doubt/question/special-use-cases
during the next two weeks: https://phabricator.wikimedia.org/T238560.

Feel free to reach out to IRC #wikimedia-analytics on Freenode too!

Thanks!

Luca (on behalf of the Analytics team)
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Maintenance window for the Hadoop cluster - Tue Oct 15th 14:30 CET - 15:30 CET

2019-10-14 Thread Luca Toscano
Hi everybody,

the Analytics team is going to stop HDFS and Yarn services for a
(hopefully) brief time window tomorrow, Tue Oct 15th, from 14:30 to 15:30
CEST.

We are going to swap the Zookeeper cluster from the one currently used by
all the Kafka production services to a dedicated one within the Analytics
VLAN. More details in https://phabricator.wikimedia.org/T217057

Side effects: the move should be transparent to all users, except the ones
relying on history in Yarn (for example, checking it via yarn.wikimedia.org).
The history is in fact stored in zookeeper, and to keep things simple we
are not copying znodes over to the new cluster. Please let us know if this
impacts you in any way.

As usual, if the time window affects your work please let us know and we'll
find a new maintenance window.

Thanks!

Luca (on behalf of the Analytics team).
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Python 2 is going EOL on January 1st - Do you need Python 2 packages in Analytics?

2019-09-13 Thread Luca Toscano
Hi everybody,

as https://www.python.org/doc/sunset-python-2/ says Python 2 is finally
going EOL on January 1st. We (as Analytics team) have a lot of packages
deployed on stat/notebook/hadoop hosts via puppet that should be removed,
but before doing so we'd need to know if anybody of you is currently using
a Python-2-only environment to work/research/test/etc... If so, please
comment in the following task so we'll discuss your use case and possibly
find a Python-3 solution: https://phabricator.wikimedia.org/T204737
In the task we are going to add info about common packages that we know
(keras, tensorflow, pytorch, etc..) to help you migrate to Python 3 as
quickly and painlessly as possible, so if you are interested please
subscribe to the task.

Thanks in advance!

Luca (on behalf of the Analytics team)
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Urgent maintenance to an-coord1001 requires a brief stop of Oozie/Hive/Spark/etc..

2019-07-15 Thread Luca Toscano
Maintenance completed!

Luca

Il giorno lun 15 lug 2019 alle ore 16:22 Luca Toscano <
ltosc...@wikimedia.org> ha scritto:

> Hi everybody,
>
> due to https://phabricator.wikimedia.org/T227941 we'd need to take down
> Oozie/Hive/etc.. on an-coord1001. The maintenance should not last long, but
> if you have any issue please reach out to us on IRC (#wikimedia-analytics
> on Freenode).
>
> Thanks!
>
> Luca (on behalf of the Analytics team)
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Urgent maintenance to an-coord1001 requires a brief stop of Oozie/Hive/Spark/etc..

2019-07-15 Thread Luca Toscano
Hi everybody,

due to https://phabricator.wikimedia.org/T227941 we'd need to take down
Oozie/Hive/etc.. on an-coord1001. The maintenance should not last long, but
if you have any issue please reach out to us on IRC (#wikimedia-analytics
on Freenode).

Thanks!

Luca (on behalf of the Analytics team)
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] [Analytics] Analytics clients (stat/notebook hosts) and backups of home directories

2019-07-11 Thread Luca Toscano
Hi Leila and Kate,

adding a few words after Nuria's email to clarify my original intentions.
My point was that any important and vital file that needs to be preserved
may be stored in HDFS rather than on stat/notebooks due to the absence of
backups of the home directories. My concern was that people had a different
understanding about backups and I wanted to clarify.
We (as Analytics team) don't have any good way at the moment to
periodically scan HDFS and home directories across hosts to find PII data
that is retained more than the allowed period of time. The main motivation
is that we'd need to find a way to check a huge amount of files, with
different names and formats, and figure out if the data contained in them
is PII and retained more than X days. It is not an impossible task but not
easy or trivial, we'd need a lot more staff in my opinion to create and
maintain something similar :) We started recently with the clean up of old
home directories (i.e. belonging to users not active anymore) and we
established a process with SRE to get pinged when a user is offboarded to
verify what data should be kept and what not (I know that both of you are
aware of this since you have been working with us on several tasks, I am
writing it to allow other people to get the context :). This is only a
starting point, I really hope to have something more robust and complete in
the future. In the meantime, I'd say that every user is responsible of the
data that he/she handles on the Analytics infrastructure, periodically
reviewing it and deleting when necessary. I don't have a specific
guideline/process to suggest, but we can definitely have a chat together
and decide something shared among our teams!

Let me know if this makes sense or not :)

Thanks,

Luca

Il giorno mer 10 lug 2019 alle ore 23:15 Nuria Ruiz 
ha scritto:

> >I have one question for you: As you allow/encourage for more copies of
> >the files to exist
> To be extra clear, we do not encourage for data to be in that notebooks
> hosts at all, there is no capacity of them to neither process nor hosts
> large amounts of data. Data that you are working with is best placed on
> /user/your-username databse in hadoop so far from encouraging multiple
> copies we are rather encouraging you keep the data outside the notebook
> machines.
>
> Thanks,
>
> Nuria
>
> On Wed, Jul 10, 2019 at 11:13 AM Kate Zimmerman 
> wrote:
>
>> I second Leila's question. The issue of how we flag PII data and ensure
>> it's appropriately scrubbed came up in our team meeting yesterday. We're
>> discussing team practices for data/project backups tomorrow and plan to
>> come out with some proposals, at least for the short term.
>>
>> Are there any existing processes or guidelines I should be aware of?
>>
>> Thanks!
>> Kate
>>
>> --
>>
>> Kate Zimmerman (she/they)
>> Head of Product Analytics
>> Wikimedia Foundation
>>
>>
>> On Wed, Jul 10, 2019 at 9:00 AM Leila Zia  wrote:
>>
>>> Hi Luca,
>>>
>>> Thanks for the heads up. Isaac is coordinating a response from the
>>> Research side.
>>>
>>> I have one question for you: As you allow/encourage for more copies of
>>> the files to exist, what is the mechanism you'd like to put in place
>>> for reducing the chances of PII to be copied in new folders that then
>>> will be even harder (for your team) to keep track of? Having an
>>> explicit process/understanding about this will be very helpful.
>>>
>>> Thanks,
>>> Leila
>>>
>>>
>>> On Thu, Jul 4, 2019 at 3:14 AM Luca Toscano 
>>> wrote:
>>> >
>>> > Hi everybody,
>>> >
>>> > as part of https://phabricator.wikimedia.org/T201165 the Analytics
>>> team
>>> > thought to reach out to everybody to make it clear that all the home
>>> > directories on the stat/notebook nodes are not backed up periodically.
>>> They
>>> > run on a software RAID configuration spanning multiple disks of
>>> course, so
>>> > we are resilient on a disk failure, but even if unlikely if might
>>> happen
>>> > that a host could loose all its data. Please keep this in mind when
>>> working
>>> > on important projects and/or handling important data that you care
>>> about.
>>> >
>>> > I just added a warning to
>>> >
>>> https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analytics_clients
>>> .
>>> > If you have really important data that is too big to backup, keep in
>>> mind
>>> > that you can use your home directory (/user/your-username) on HD

Re: [Wiki-research-l] [Analytics] Firewall on stat100x and notebook100x hosts

2019-07-10 Thread Luca Toscano
Hi Isaac,

Il giorno mer 10 lug 2019 alle ore 16:14 Isaac Johnson 
ha scritto:

> Hey Luca,
> We discussed this in Research and it all sounds good to us with one
> question below. If something else arises, we'll ping you. Thanks for the
> heads up!
>
> > We assumed that instructing Spark to use a predefined
> range of random ports was not possible, but in
> https://phabricator.wikimedia.org/T170826 we discovered that there is a
> way
> (that seems to work fine from our tests).
>
> Will we need to change anything in our configuration or will this be
> automatic?
>

On the stat hosts the change is already live and your new spark sessions
will pick it up automatically, on the notebooks we'll need to restart the
spark sessions before enabling the firewall. I am planning to contact all
the owners of a Spark session on notebook100[3,4], so if anybody sees an
email from me then there will be an action to do, otherwise none :)

Luca
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Firewall on stat100x and notebook100x hosts

2019-07-05 Thread Luca Toscano
TL;DR: In https://phabricator.wikimedia.org/T170826 the Analytics team
wants to add base firewall rules to stat100x and notebook100x hosts, that
will cause any non-localhost or known traffic to be blocked by default.
Please let us know in the task if this is a problem for you.

Hi everybody,

the Analytics team has always left the stat100x and notebook100x hosts
without a set of base firewall rules to avoid impacting any
research/test/etc.. activity on those hosts. This choice has a lot of
downsides, one of the most problematic ones is that usually environments
like the Python venvs can install potentially any package, and if the owner
does not pay attention to security upgrades then we may have a security
problem if the environment happens to bind to a network port and accept
traffic from anywhere.

One of the biggest problems was Spark: when somebody launches a shell using
Hadoop Yarn (--master yarn), a Driver component is created that needs to
bind to a random port to be able to communicate with the workers created on
the Hadoop cluster. We assumed that instructing Spark to use a predefined
range of random ports was not possible, but in
https://phabricator.wikimedia.org/T170826 we discovered that there is a way
(that seems to work fine from our tests). The other big use case that we
know, Jupyter notebooks, seems to require only localhost traffic flow
without restrictions.

Please let us know in the task if you have a use case that requires your
environment to bind to a network port on stat100x or notebook100x and
accept traffic from other hosts. For example, having a python app that
binds to port 33000 on stat1007 and listens/accepts traffic from other stat
or notebook hosts.

If we don't hear anything, we'll start adding base firewall rules to one
host at the time during the upcoming weeks, tracking our work on the
aforementioned task.

Thanks!

Luca (on behalf of the Analytics team)
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Analytics clients (stat/notebook hosts) and backups of home directories

2019-07-04 Thread Luca Toscano
Hi everybody,

as part of https://phabricator.wikimedia.org/T201165 the Analytics team
thought to reach out to everybody to make it clear that all the home
directories on the stat/notebook nodes are not backed up periodically. They
run on a software RAID configuration spanning multiple disks of course, so
we are resilient on a disk failure, but even if unlikely if might happen
that a host could loose all its data. Please keep this in mind when working
on important projects and/or handling important data that you care about.

I just added a warning to
https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analytics_clients.
If you have really important data that is too big to backup, keep in mind
that you can use your home directory (/user/your-username) on HDFS (that
replicates data three times across multiple nodes).

Please let us know if you have comments/suggestions/etc.. in the
aforementioned task.

Thanks in advance!

Luca (on behalf of the Analytics team)
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Hive and Oozie unavailable for maintenance on Wed 26th 9 AM CEST

2019-06-25 Thread Luca Toscano
Hi everybody,

as part of https://phabricator.wikimedia.org/T225306 I need to reboot the
an-coord1001 host, that runs the Hive server/metastore and Oozie. Tomorrow
June 26th I'll reboot the host at around 9 AM CEST, the maintenance window
should last 10/15 minutes more or less. This means that hive jobs might
fail during that timeframe, please let me know if it is a problem.

Thanks in advance,

Luca (on behalf of the Analytics team)
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Reboot of stat1004--6-7 and notebook1003-4 happening on May 21st (early EU morning)

2019-05-20 Thread Luca Toscano
Hi everybody,

the stat1004-6-7 and notebook1003-4 hosts will be rebooted tomorrow
morning, May 21st, during the EU morning for security upgrades (Linux
kernel upgrades). Please let me or anybody in the Analytics team know if
this is problematic for your work so we can schedule a better maintenance
window.

Thanks!

Luca (on behalf of the Analytics team)
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] DEPRECATION WARNING: dbstore1002 is going to be decommissioned on March 4th

2019-02-22 Thread Luca Toscano
Hi everybody,

the Analytics team has been working with the SRE Data Persistence team
during the last months to replace dbstore1002 with three brand new nodes,
dbstore100[3-5]. We are moving from a single mysql instance (multi-source)
to a multi-instance environment.

For more info please check:
* T210478 and related subtasks.
* https://wikitech.wikimedia.org/wiki/Analytics/Data_access#MariaDB_replicas

We are planning to decommission the dbstore1002 host (namely stopping mysql
and shutting down the server) on Monday March 4th (EU morning). We have
recently been following up with a lot of users to help them migrate to the
new environment, so we are reasonably sure that this move should not
heavily impact anybody, but if we have left some use case aside please let
us know in https://phabricator.wikimedia.org/T215589. If we don't hear
anything before the March 4th deadline we'll proceed with the host
decommission maintenance.

Luca (on behalf of the Analytics team)
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Unscheduled reboot of stat1007

2018-11-22 Thread Luca Toscano
Hi everybody,

as FYI today I rebooted stat1007 due to unexpected maintenance (an error
from my side) while investigating a Spark2 issue (that is now fixed).
Apologies if this has impacted your work!

Luca
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Upcoming move of users from stat1005 to stat1007

2018-11-20 Thread Luca Toscano
Hi everybody,

final email about this I promise :)

Current status: access to stat1005 is still allowed since some users need
more time to move their work to stat1007, but there will be no more rsyncs
of home directories to stat1007 to avoid impacting people that already
moved to the new host. If you need to copy data over please follow up with
me (or anybody in the Analytics team) in
https://phabricator.wikimedia.org/T205846. Please also keep in mind that
we'll eventually repurpose stat1005 to a new role (hopefully with a working
GPU) and we'll not keep the data on it forever, so please check all your
data on stat1007 as soon as possible (and let us know if you are missing
something).

Thanks a lot!

Luca (on behalf of the Analytics team)

Il giorno mer 7 nov 2018 alle ore 07:32 Luca Toscano 
ha scritto:

> Hi everybody,
>
> this is a reminder that in a week stat1005 will not be usable anymore.
> Please follow up with me or the Analytics team if you need more time or if
> you have any question :)
>
> Thanks!
>
> Luca
>
> Il giorno mer 31 ott 2018 alle ore 16:03 Luca Toscano <
> ltosc...@wikimedia.org> ha scritto:
>
>> Hi everybody,
>>
>> as part of https://phabricator.wikimedia.org/T205846 we are going to ask
>> to all the stat1005's users to move to stat1007 during the next two weeks.
>> The deadline is November 14th, by which time ssh access to stat1005 will be
>> removed.
>>
>> Background: on stat1005 we have a GPU (more details in
>> https://phabricator.wikimedia.org/T148843) that has been sitting there
>> for almost two years, and it would be great to try to make it work during
>> the next months. This effort will require a lot of tests/reboots/etc.. that
>> can of course impact ongoing work of all of you, so we prefer to move
>> everybody to another identical machine beforehand.
>>
>> Please reach out to me or to the analytics team in T205846 or IRC
>> (#wikimedia-analytics on Freenode) if you have any
>> questions/doubts/blocker/etc.., we are not going to enforce the deadline if
>> anybody will raise concerns or blockers of course. It would be great to
>> move everybody by Nov 14th but we surely don't want to disrupt any ongoing
>> important work.
>>
>> I am going to update the Wikitech documentation about stat1005 and
>> stat1007 as soon as possible, for the moment keep in mind that stat1007
>> will take over completely everything that stat1005 currently does.
>>
>> I have already copied over all the stat1005 directories to stat1007, and
>> I'll periodically sync them during the following days. If you don't find
>> anything important, please add a note in T205846.
>>
>> Thanks a lot and sorry for the trouble,
>>
>> Luca (on behalf of the Analytics team)
>>
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Analytics Hadoop Cloudera upgrade scheduled for Nov 12th (Monday) at 14:00 CEST

2018-11-08 Thread Luca Toscano
Hi everybody,

the Analytics team will shutdown completely the Hadoop cluster for a couple
of hours on Monday Nov 12th at 14:00 CEST to upgrade the Cloudera
distribution to 5.15 (currently 5.10). No big updates but only a collection
of small/medium fixes that (hopefully) will improve the reliability of our
cluster. For more info, please check
https://phabricator.wikimedia.org/T204759.

This means that tools like HDFS, Hive, Oozie, etc.. will not be available
during the maintenance window, so if this impacts your work please reach
out to us so we can chat about it and possibly re-schedule if needed (in
the task or #wikimedia-analytics on Freenode IRC).

Thanks a lot for the patience, we are trying to do our best to keep all our
systems as up to date as possible :)

Luca (on behalf of the Analytics team)
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Upcoming move of users from stat1005 to stat1007

2018-11-06 Thread Luca Toscano
Hi everybody,

this is a reminder that in a week stat1005 will not be usable anymore.
Please follow up with me or the Analytics team if you need more time or if
you have any question :)

Thanks!

Luca

Il giorno mer 31 ott 2018 alle ore 16:03 Luca Toscano <
ltosc...@wikimedia.org> ha scritto:

> Hi everybody,
>
> as part of https://phabricator.wikimedia.org/T205846 we are going to ask
> to all the stat1005's users to move to stat1007 during the next two weeks.
> The deadline is November 14th, by which time ssh access to stat1005 will be
> removed.
>
> Background: on stat1005 we have a GPU (more details in
> https://phabricator.wikimedia.org/T148843) that has been sitting there
> for almost two years, and it would be great to try to make it work during
> the next months. This effort will require a lot of tests/reboots/etc.. that
> can of course impact ongoing work of all of you, so we prefer to move
> everybody to another identical machine beforehand.
>
> Please reach out to me or to the analytics team in T205846 or IRC
> (#wikimedia-analytics on Freenode) if you have any
> questions/doubts/blocker/etc.., we are not going to enforce the deadline if
> anybody will raise concerns or blockers of course. It would be great to
> move everybody by Nov 14th but we surely don't want to disrupt any ongoing
> important work.
>
> I am going to update the Wikitech documentation about stat1005 and
> stat1007 as soon as possible, for the moment keep in mind that stat1007
> will take over completely everything that stat1005 currently does.
>
> I have already copied over all the stat1005 directories to stat1007, and
> I'll periodically sync them during the following days. If you don't find
> anything important, please add a note in T205846.
>
> Thanks a lot and sorry for the trouble,
>
> Luca (on behalf of the Analytics team)
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] [Analytics] Hive and Oozie unavailable for maintenance on Tue Oct 9th 10 AM CEST

2018-10-10 Thread Luca Toscano
Thanks for the note Neil, should have been clearer! I am also going to fix
all the analytics1003's references in Wikitech later on today :)

Luca

Il giorno mer 10 ott 2018 alle ore 01:08 Neil Patel Quinn <
nqu...@wikimedia.org> ha scritto:

> Quick note: since the Hive coordinator has moved, you'll have to update
> its url from *analytics1003.eqiad.wmnet *to *an-coord1001.eqiad.wmnet *in
> any scripts you have.
>
> On Fri, 5 Oct 2018 at 09:54, Luca Toscano  wrote:
>
>> Hi everybody,
>>
>> the Analytics team is going to move the Oozie and Hive daemons from the
>> analytics1003 host to an-coord1001 (new host, hardware refresh) on Tuesday
>> Oct 9th at 10 AM CEST. This will require downtime for Oozie and Hive, so
>> some jobs might fail or not work at all during the maintenance. We have
>> allocated two hours for this procedure but it should require less time.
>>
>> Tracking task: T205509
>>
>> As always, please follow up with me or anybody in the analytics team for
>> clarifications and/or comments (via Phabricator or IRC Freenode
>> #wikimedia-analytics).
>>
>> Thanks for the patience!
>>
>> Luca (on behalf of the Analytics team)
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>
>
> --
> Neil Patel Quinn <https://meta.wikimedia.org/wiki/User:Neil_P._Quinn-WMF>
> (he/him/his)
> product analyst, Wikimedia Foundation
> ___
> Analytics mailing list
> analyt...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Hive and Oozie unavailable for maintenance on Tue Oct 9th 10 AM CEST

2018-10-05 Thread Luca Toscano
Hi everybody,

the Analytics team is going to move the Oozie and Hive daemons from the
analytics1003 host to an-coord1001 (new host, hardware refresh) on Tuesday
Oct 9th at 10 AM CEST. This will require downtime for Oozie and Hive, so
some jobs might fail or not work at all during the maintenance. We have
allocated two hours for this procedure but it should require less time.

Tracking task: T205509

As always, please follow up with me or anybody in the analytics team for
clarifications and/or comments (via Phabricator or IRC Freenode
#wikimedia-analytics).

Thanks for the patience!

Luca (on behalf of the Analytics team)
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Brief unavailability scheduled for the Event Logging database replica

2018-09-26 Thread Luca Toscano
Hi everybody,

Tomorrow Sept 27th at 10 CEST db1108 (alias analytics-slave) will be down
for a brief (max 30 mins) maintenance (Mariadb and Linux kernel upgrade).
This means that the log database will not be available for querying during
this time frame. Please reach out to me or to the Analytics team if this
impacts your work (elukey or #wikimedia-analytics on IRC Freenode).

Thanks!

Luca
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Analytics Hadoop cluster full shutdown scheduled for Sept 25th

2018-09-25 Thread Luca Toscano
Hi everybody,

maintenance just completed, it took a bit more but no issue registered so
far. We weren't able to swap analytics1003 (where Oozie/Hive/etc..) in this
maintenance window, so I'll likely send another email next week to schedule
downtime (not for the entire cluster but mostly for Hive/Oozie only),
please don't hate me :)

If you see any issue please contact us (via
https://phabricator.wikimedia.org/T203635 or IRC Freenode
#wikimedia-analytics).

Thanks!

Luca

Il giorno lun 24 set 2018 alle ore 16:50 Luca Toscano <
ltosc...@wikimedia.org> ha scritto:

> Hi everybody,
>
> this is a reminder that the maintenance will happen tomorrow (Tue 25th, 10
> CEST).
>
> Luca
>
> Il giorno ven 14 set 2018 alle ore 12:13 Luca Toscano <
> ltosc...@wikimedia.org> ha scritto:
>
>> Hi everybody,
>>
>> the Analytics team needs to replace the Hadoop master node hosts
>> (analytics100[1,2]) and the Hive/Oozie host (analytics1003) as part of
>> regular hardware refresh (hosts getting out of warranty). In order to do
>> things safely we decided to proceed with a full cluster shutdown on Sept
>> 25th at 10 AM CEST. The maintenance should last a couple of hours and all
>> there shouldn't be any noticeable change for the Hadoop users.
>>
>> This means that during the maintenance:
>> - HDFS will not be available
>> - Yarn will not be available
>> - Hive/Spark (cluster mode)/Oozie/etc.. will not be available
>>
>> Please let us know if this impacts your work in
>> https://phabricator.wikimedia.org/T203635 or on the #wikimedia-analytics
>> Freenode IRC channel.
>>
>> Thanks a lot!
>>
>> Luca
>>
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Analytics Hadoop cluster full shutdown scheduled for Sept 25th

2018-09-24 Thread Luca Toscano
Hi everybody,

this is a reminder that the maintenance will happen tomorrow (Tue 25th, 10
CEST).

Luca

Il giorno ven 14 set 2018 alle ore 12:13 Luca Toscano <
ltosc...@wikimedia.org> ha scritto:

> Hi everybody,
>
> the Analytics team needs to replace the Hadoop master node hosts
> (analytics100[1,2]) and the Hive/Oozie host (analytics1003) as part of
> regular hardware refresh (hosts getting out of warranty). In order to do
> things safely we decided to proceed with a full cluster shutdown on Sept
> 25th at 10 AM CEST. The maintenance should last a couple of hours and all
> there shouldn't be any noticeable change for the Hadoop users.
>
> This means that during the maintenance:
> - HDFS will not be available
> - Yarn will not be available
> - Hive/Spark (cluster mode)/Oozie/etc.. will not be available
>
> Please let us know if this impacts your work in
> https://phabricator.wikimedia.org/T203635 or on the #wikimedia-analytics
> Freenode IRC channel.
>
> Thanks a lot!
>
> Luca
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Analytics Hadoop cluster full shutdown scheduled for Sept 25th

2018-09-14 Thread Luca Toscano
Hi everybody,

the Analytics team needs to replace the Hadoop master node hosts
(analytics100[1,2]) and the Hive/Oozie host (analytics1003) as part of
regular hardware refresh (hosts getting out of warranty). In order to do
things safely we decided to proceed with a full cluster shutdown on Sept
25th at 10 AM CEST. The maintenance should last a couple of hours and all
there shouldn't be any noticeable change for the Hadoop users.

This means that during the maintenance:
- HDFS will not be available
- Yarn will not be available
- Hive/Spark (cluster mode)/Oozie/etc.. will not be available

Please let us know if this impacts your work in
https://phabricator.wikimedia.org/T203635 or on the #wikimedia-analytics
Freenode IRC channel.

Thanks a lot!

Luca
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Upcoming reboot of stat* and notebook* hosts - Sept 13th

2018-09-11 Thread Luca Toscano
Hi everybody,

on Thursday Sept 13th (EU morning) I am planning to reboot the stat hosts
(stat1004, stat1005 and stat1006) and the notebook hosts (notebook1003,
notebook1004) for Linux kernel upgrades. Please let me know if this impacts
your work in https://phabricator.wikimedia.org/T203165 or on IRC (elukey -
#wikimedia-analytics).

Thanks!

Luca
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Kafka Main Eqiad outage and failover of Eventbus/Eventstreams to codfw

2018-07-12 Thread Luca Toscano
[Adding some other mailing lists in Cc]

Hi everybody,

as a lot of you have probably already noticed yesterday reading the
operations@ mailing list, we had an outage of the Kafka Main eqiad cluster
that forced us to switch the Eventbus and Eventstreams services to codfw.

All the precise timings will be listed in
https://wikitech.wikimedia.org/wiki/Incident_documentation/20180711-kafka-eqiad,
but for a quick glimpse:

2018-07-11 17:00 UTC - Eventbus service switched to codfw
2018-07-11 18:44 UTC - Eventstreams service switched to codfw

We are going to switch back those services to eqiad during the next couple
of hours. The consumers of the Eventstreams service may get some failures
or data drops, apologies in advance for the trouble.

Cheers,

Luca

Il giorno gio 12 lug 2018 alle ore 00:00 Luca Toscano <
ltosc...@wikimedia.org> ha scritto:

> Hi everybody,
>
> as you might have seen from the operations' channel on IRC the Kafka Main
> Eqiad cluster (kafka100[1-3].eqiad.wmnet) suffered a long outage due to new
> topics pushed out with too long names (causing fs operation issues, etc..).
> I'll update this email thread tomorrow EU time with more details, tasks,
> precise root cause, etc.., but the important bit to know is that Eventbus
> and Eventstreams have been failed over to the Kafka Main Codfw cluster.
> This should be transparent to everybody but please let us know otherwise.
>
> Thanks for the patience!
>
> (a very sleepy :) Luca
>
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Upcoming reboot of stat100[56] and analytics1003 (Hive, Oozie) for kernel security upgrades

2018-03-06 Thread Luca Toscano
Hi everybody,

tomorrow EU morning (Wed Mar 7th) I'd need to reboot stat100[56] and
analytics1003 for kernel security updates. Hive and Oozie (Analytics Hadoop
cluster) will not be available for a (hopefully) brief period of time.
Please let me know if there is an important work that you are doing that
cannot be stopped and the maintenance will be postponed accordingly :)

Tracking task: https://phabricator.wikimedia.org/T188594

Thanks!

Luca (on behalf of the Analytics team)
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Analytics Hadoop cluster maintenance announce for Feb 6th

2018-02-05 Thread Luca Toscano
Hi everybody,

just a reminder that the upgrade is scheduled for tomorrow EU/CET morning.

Luca

2018-01-23 17:58 GMT+01:00 Luca Toscano <ltosc...@wikimedia.org>:

> *TL;DR*: The Analytics Hadoop cluster will be completely down for max 2h
> on *Feb 6th* (EU/CET morning) to upgrade all the daemons to Java 8.
>
> Hi everybody,
>
> we are planning to upgrade the Analytics Hadoop cluster to Java 8 on *Feb
> 6th* (EU/CET morning) for https://phabricator.wikimedia.org/T166248.
> Sadly we can't do a rolling upgrade of all the jvm-based Hadoop daemons
> since the distribution that we use (Cloudera) suggests to perform the
> upgrade only after a complete cluster shutdown. This means that for a
> couple of hours (hopefully a lot less) all the Hadoop based services will
> be unavailable (Hive, Oozie, HDFS, etc..).
>
> We have tested the new configuration in labs and all the regular Analytics
> jobs seem to work correctly, so we don't expect major issues after the
> upgrade, but if you have any question or concern please follow up in the
> task.
>
> Thanks!
>
> Luca and Andrew (on behalf of the Analytics team)
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Analytics Hadoop cluster maintenance announce for Feb 6th

2018-01-23 Thread Luca Toscano
*TL;DR*: The Analytics Hadoop cluster will be completely down for max
2h on *Feb
6th* (EU/CET morning) to upgrade all the daemons to Java 8.

Hi everybody,

we are planning to upgrade the Analytics Hadoop cluster to Java 8 on *Feb
6th* (EU/CET morning) for https://phabricator.wikimedia.org/T166248.
Sadly we can't do a rolling upgrade of all the jvm-based Hadoop daemons
since the distribution that we use (Cloudera) suggests to perform the
upgrade only after a complete cluster shutdown. This means that for a
couple of hours (hopefully a lot less) all the Hadoop based services will
be unavailable (Hive, Oozie, HDFS, etc..).

We have tested the new configuration in labs and all the regular Analytics
jobs seem to work correctly, so we don't expect major issues after the
upgrade, but if you have any question or concern please follow up in the
task.

Thanks!

Luca and Andrew (on behalf of the Analytics team)
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Alter tables for the log database on the analytics slaves

2017-06-27 Thread Luca Toscano
Hi everybody,

the Analytics team is working on some alter tables to the Eventlogging
'log' database on analytics-store (dbstore1002) and analytics-slave
(db1047) as part of https://phabricator.wikimedia.org/T167162.

The list of alter tables are the following:
https://phabricator.wikimedia.org/P5570

This should be a transparent change but I thought it would have been better
to keep all of you informed in case of unintended regressions or
side-effects. The context of the alter tables is in T167162 but the TL;DR
is that we need nullable attributes across all the EL tables (except fields
like id, uuid and timestamp) to be able to sanitize data with our new
eventlogging_cleaner script (https://phabricator.wikimedia.org/T156933).

Please let me know if you encounter any issue with this change.

Thanks in advance!

Luca
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Upcoming reboots of stat100[234] and most of the Analytics hosts (Kafka and Hadoop)

2016-10-20 Thread Luca Toscano
Hi everybody,

due to a severe kernel vulnerability (https://access.redhat.com/
security/vulnerabilities/2706661) I need to reboot the stat1002, stat1003
and stat1004 hosts to install the new kernel. The reboots are scheduled for
9 AM CEST tomorrow (Oct 21st), please follow up with me or anybody in the
Analytics team if you have ongoing work that can't be stopped.

The Analytics Hadoop and Kafka clusters will be rebooted too during the
next hours. Event if this maintenance shouldn't cause any major issue, you
might experience some service degradation. More up to date information on
IRC in the analytics and operations channels.

Thanks and apologies in advance for the trouble!

Luca
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l