Re: [Wiki-research-l] Brief shutdown of stat1007 for maintenance - Thu Dec 12th 15:30 CET
Hi again, the maintenance has been postponed to tomorrow (Dec 18th) around 15:00 CET. Please let me know if it will be a problem for you. The whole maintenance shouldn't last more than 30 mins :) Thanks! Luca Il giorno mer 11 dic 2019 alle ore 08:36 Luca Toscano < ltosc...@wikimedia.org> ha scritto: > Hi everybody, > > the Analytics team is going to shutdown stat1007 for a few minutes on Thu > Dec 12th at around 15:30 CET to check if there is space for a GPU in the > server's chassis. Please let us know if this will impact your work (so we > can arrange a different maintenance window). > > Thanks! > > Luca > ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Enable Kerberos authentication for Hadoop (please read if you use Hadoop for your daily work)
Hi everybody, the day has finally come, we'll start the procedure at around 13:00 CET. For any issue please reach out to the Analytics team on Freenode (#wikimedia-analytics) or https://phabricator.wikimedia.org/T238560 We hope to have the cluster back running in a couple of hours, but it might take more if we encounter unexpected issues. Thanks for the patience! Luca Il giorno mar 26 nov 2019 alle ore 09:46 Luca Toscano < ltosc...@wikimedia.org> ha scritto: > Hi everybody, > > to avoid any conflict with the start of the Fundraising season, we moved > the Kerberos maintenance to Dec 16th at 10 AM CET (more info in > https://phabricator.wikimedia.org/T238560). > > If you use Hadoop and you haven't requested credentials yet, please do so > following > https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide#Get_a_password_for_Kerberos > > Thanks! > > Luca (on behalf of the Analytics team) > > > > Il giorno lun 18 nov 2019 alle ore 16:11 Luca Toscano < > ltosc...@wikimedia.org> ha scritto: > >> Hi everybody, >> >> the Analytics team is going to enable Kerberos authentication for Hadoop >> on Monday December 2nd. The procedure will start around 10 AM CET and will >> hopefully last 3/4 hours, but since this is an invasive change there might >> be a possibility that it will last more. If you have anything important >> that requires Hadoop on this date please let us know in advance. >> >> The most visible change from the user's point of view is the introduction >> of a new account/password to be able to use the Hadoop services (like >> Hive/HDFS/Spark/Oozie). We created a user guide about what will change with >> kerberos in >> https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide. >> There is also a task opened to track any doubt/question/special-use-cases >> during the next two weeks: https://phabricator.wikimedia.org/T238560. >> >> Feel free to reach out to IRC #wikimedia-analytics on Freenode too! >> >> Thanks! >> >> Luca (on behalf of the Analytics team) >> > ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Brief shutdown of stat1007 for maintenance - Thu Dec 12th 15:30 CET
Hi everybody, the Analytics team is going to shutdown stat1007 for a few minutes on Thu Dec 12th at around 15:30 CET to check if there is space for a GPU in the server's chassis. Please let us know if this will impact your work (so we can arrange a different maintenance window). Thanks! Luca ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Enable Kerberos authentication for Hadoop (please read if you use Hadoop for your daily work)
Hi everybody, to avoid any conflict with the start of the Fundraising season, we moved the Kerberos maintenance to Dec 16th at 10 AM CET (more info in https://phabricator.wikimedia.org/T238560). If you use Hadoop and you haven't requested credentials yet, please do so following https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide#Get_a_password_for_Kerberos Thanks! Luca (on behalf of the Analytics team) Il giorno lun 18 nov 2019 alle ore 16:11 Luca Toscano < ltosc...@wikimedia.org> ha scritto: > Hi everybody, > > the Analytics team is going to enable Kerberos authentication for Hadoop > on Monday December 2nd. The procedure will start around 10 AM CET and will > hopefully last 3/4 hours, but since this is an invasive change there might > be a possibility that it will last more. If you have anything important > that requires Hadoop on this date please let us know in advance. > > The most visible change from the user's point of view is the introduction > of a new account/password to be able to use the Hadoop services (like > Hive/HDFS/Spark/Oozie). We created a user guide about what will change with > kerberos in > https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide. > There is also a task opened to track any doubt/question/special-use-cases > during the next two weeks: https://phabricator.wikimedia.org/T238560. > > Feel free to reach out to IRC #wikimedia-analytics on Freenode too! > > Thanks! > > Luca (on behalf of the Analytics team) > ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Enable Kerberos authentication for Hadoop (please read if you use Hadoop for your daily work)
Hi everybody, the Analytics team is going to enable Kerberos authentication for Hadoop on Monday December 2nd. The procedure will start around 10 AM CET and will hopefully last 3/4 hours, but since this is an invasive change there might be a possibility that it will last more. If you have anything important that requires Hadoop on this date please let us know in advance. The most visible change from the user's point of view is the introduction of a new account/password to be able to use the Hadoop services (like Hive/HDFS/Spark/Oozie). We created a user guide about what will change with kerberos in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide. There is also a task opened to track any doubt/question/special-use-cases during the next two weeks: https://phabricator.wikimedia.org/T238560. Feel free to reach out to IRC #wikimedia-analytics on Freenode too! Thanks! Luca (on behalf of the Analytics team) ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Maintenance window for the Hadoop cluster - Tue Oct 15th 14:30 CET - 15:30 CET
Hi everybody, the Analytics team is going to stop HDFS and Yarn services for a (hopefully) brief time window tomorrow, Tue Oct 15th, from 14:30 to 15:30 CEST. We are going to swap the Zookeeper cluster from the one currently used by all the Kafka production services to a dedicated one within the Analytics VLAN. More details in https://phabricator.wikimedia.org/T217057 Side effects: the move should be transparent to all users, except the ones relying on history in Yarn (for example, checking it via yarn.wikimedia.org). The history is in fact stored in zookeeper, and to keep things simple we are not copying znodes over to the new cluster. Please let us know if this impacts you in any way. As usual, if the time window affects your work please let us know and we'll find a new maintenance window. Thanks! Luca (on behalf of the Analytics team). ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Python 2 is going EOL on January 1st - Do you need Python 2 packages in Analytics?
Hi everybody, as https://www.python.org/doc/sunset-python-2/ says Python 2 is finally going EOL on January 1st. We (as Analytics team) have a lot of packages deployed on stat/notebook/hadoop hosts via puppet that should be removed, but before doing so we'd need to know if anybody of you is currently using a Python-2-only environment to work/research/test/etc... If so, please comment in the following task so we'll discuss your use case and possibly find a Python-3 solution: https://phabricator.wikimedia.org/T204737 In the task we are going to add info about common packages that we know (keras, tensorflow, pytorch, etc..) to help you migrate to Python 3 as quickly and painlessly as possible, so if you are interested please subscribe to the task. Thanks in advance! Luca (on behalf of the Analytics team) ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Urgent maintenance to an-coord1001 requires a brief stop of Oozie/Hive/Spark/etc..
Maintenance completed! Luca Il giorno lun 15 lug 2019 alle ore 16:22 Luca Toscano < ltosc...@wikimedia.org> ha scritto: > Hi everybody, > > due to https://phabricator.wikimedia.org/T227941 we'd need to take down > Oozie/Hive/etc.. on an-coord1001. The maintenance should not last long, but > if you have any issue please reach out to us on IRC (#wikimedia-analytics > on Freenode). > > Thanks! > > Luca (on behalf of the Analytics team) > ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Urgent maintenance to an-coord1001 requires a brief stop of Oozie/Hive/Spark/etc..
Hi everybody, due to https://phabricator.wikimedia.org/T227941 we'd need to take down Oozie/Hive/etc.. on an-coord1001. The maintenance should not last long, but if you have any issue please reach out to us on IRC (#wikimedia-analytics on Freenode). Thanks! Luca (on behalf of the Analytics team) ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] [Analytics] Analytics clients (stat/notebook hosts) and backups of home directories
Hi Leila and Kate, adding a few words after Nuria's email to clarify my original intentions. My point was that any important and vital file that needs to be preserved may be stored in HDFS rather than on stat/notebooks due to the absence of backups of the home directories. My concern was that people had a different understanding about backups and I wanted to clarify. We (as Analytics team) don't have any good way at the moment to periodically scan HDFS and home directories across hosts to find PII data that is retained more than the allowed period of time. The main motivation is that we'd need to find a way to check a huge amount of files, with different names and formats, and figure out if the data contained in them is PII and retained more than X days. It is not an impossible task but not easy or trivial, we'd need a lot more staff in my opinion to create and maintain something similar :) We started recently with the clean up of old home directories (i.e. belonging to users not active anymore) and we established a process with SRE to get pinged when a user is offboarded to verify what data should be kept and what not (I know that both of you are aware of this since you have been working with us on several tasks, I am writing it to allow other people to get the context :). This is only a starting point, I really hope to have something more robust and complete in the future. In the meantime, I'd say that every user is responsible of the data that he/she handles on the Analytics infrastructure, periodically reviewing it and deleting when necessary. I don't have a specific guideline/process to suggest, but we can definitely have a chat together and decide something shared among our teams! Let me know if this makes sense or not :) Thanks, Luca Il giorno mer 10 lug 2019 alle ore 23:15 Nuria Ruiz ha scritto: > >I have one question for you: As you allow/encourage for more copies of > >the files to exist > To be extra clear, we do not encourage for data to be in that notebooks > hosts at all, there is no capacity of them to neither process nor hosts > large amounts of data. Data that you are working with is best placed on > /user/your-username databse in hadoop so far from encouraging multiple > copies we are rather encouraging you keep the data outside the notebook > machines. > > Thanks, > > Nuria > > On Wed, Jul 10, 2019 at 11:13 AM Kate Zimmerman > wrote: > >> I second Leila's question. The issue of how we flag PII data and ensure >> it's appropriately scrubbed came up in our team meeting yesterday. We're >> discussing team practices for data/project backups tomorrow and plan to >> come out with some proposals, at least for the short term. >> >> Are there any existing processes or guidelines I should be aware of? >> >> Thanks! >> Kate >> >> -- >> >> Kate Zimmerman (she/they) >> Head of Product Analytics >> Wikimedia Foundation >> >> >> On Wed, Jul 10, 2019 at 9:00 AM Leila Zia wrote: >> >>> Hi Luca, >>> >>> Thanks for the heads up. Isaac is coordinating a response from the >>> Research side. >>> >>> I have one question for you: As you allow/encourage for more copies of >>> the files to exist, what is the mechanism you'd like to put in place >>> for reducing the chances of PII to be copied in new folders that then >>> will be even harder (for your team) to keep track of? Having an >>> explicit process/understanding about this will be very helpful. >>> >>> Thanks, >>> Leila >>> >>> >>> On Thu, Jul 4, 2019 at 3:14 AM Luca Toscano >>> wrote: >>> > >>> > Hi everybody, >>> > >>> > as part of https://phabricator.wikimedia.org/T201165 the Analytics >>> team >>> > thought to reach out to everybody to make it clear that all the home >>> > directories on the stat/notebook nodes are not backed up periodically. >>> They >>> > run on a software RAID configuration spanning multiple disks of >>> course, so >>> > we are resilient on a disk failure, but even if unlikely if might >>> happen >>> > that a host could loose all its data. Please keep this in mind when >>> working >>> > on important projects and/or handling important data that you care >>> about. >>> > >>> > I just added a warning to >>> > >>> https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analytics_clients >>> . >>> > If you have really important data that is too big to backup, keep in >>> mind >>> > that you can use your home directory (/user/your-username) on HD
Re: [Wiki-research-l] [Analytics] Firewall on stat100x and notebook100x hosts
Hi Isaac, Il giorno mer 10 lug 2019 alle ore 16:14 Isaac Johnson ha scritto: > Hey Luca, > We discussed this in Research and it all sounds good to us with one > question below. If something else arises, we'll ping you. Thanks for the > heads up! > > > We assumed that instructing Spark to use a predefined > range of random ports was not possible, but in > https://phabricator.wikimedia.org/T170826 we discovered that there is a > way > (that seems to work fine from our tests). > > Will we need to change anything in our configuration or will this be > automatic? > On the stat hosts the change is already live and your new spark sessions will pick it up automatically, on the notebooks we'll need to restart the spark sessions before enabling the firewall. I am planning to contact all the owners of a Spark session on notebook100[3,4], so if anybody sees an email from me then there will be an action to do, otherwise none :) Luca ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Firewall on stat100x and notebook100x hosts
TL;DR: In https://phabricator.wikimedia.org/T170826 the Analytics team wants to add base firewall rules to stat100x and notebook100x hosts, that will cause any non-localhost or known traffic to be blocked by default. Please let us know in the task if this is a problem for you. Hi everybody, the Analytics team has always left the stat100x and notebook100x hosts without a set of base firewall rules to avoid impacting any research/test/etc.. activity on those hosts. This choice has a lot of downsides, one of the most problematic ones is that usually environments like the Python venvs can install potentially any package, and if the owner does not pay attention to security upgrades then we may have a security problem if the environment happens to bind to a network port and accept traffic from anywhere. One of the biggest problems was Spark: when somebody launches a shell using Hadoop Yarn (--master yarn), a Driver component is created that needs to bind to a random port to be able to communicate with the workers created on the Hadoop cluster. We assumed that instructing Spark to use a predefined range of random ports was not possible, but in https://phabricator.wikimedia.org/T170826 we discovered that there is a way (that seems to work fine from our tests). The other big use case that we know, Jupyter notebooks, seems to require only localhost traffic flow without restrictions. Please let us know in the task if you have a use case that requires your environment to bind to a network port on stat100x or notebook100x and accept traffic from other hosts. For example, having a python app that binds to port 33000 on stat1007 and listens/accepts traffic from other stat or notebook hosts. If we don't hear anything, we'll start adding base firewall rules to one host at the time during the upcoming weeks, tracking our work on the aforementioned task. Thanks! Luca (on behalf of the Analytics team) ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Analytics clients (stat/notebook hosts) and backups of home directories
Hi everybody, as part of https://phabricator.wikimedia.org/T201165 the Analytics team thought to reach out to everybody to make it clear that all the home directories on the stat/notebook nodes are not backed up periodically. They run on a software RAID configuration spanning multiple disks of course, so we are resilient on a disk failure, but even if unlikely if might happen that a host could loose all its data. Please keep this in mind when working on important projects and/or handling important data that you care about. I just added a warning to https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analytics_clients. If you have really important data that is too big to backup, keep in mind that you can use your home directory (/user/your-username) on HDFS (that replicates data three times across multiple nodes). Please let us know if you have comments/suggestions/etc.. in the aforementioned task. Thanks in advance! Luca (on behalf of the Analytics team) ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Hive and Oozie unavailable for maintenance on Wed 26th 9 AM CEST
Hi everybody, as part of https://phabricator.wikimedia.org/T225306 I need to reboot the an-coord1001 host, that runs the Hive server/metastore and Oozie. Tomorrow June 26th I'll reboot the host at around 9 AM CEST, the maintenance window should last 10/15 minutes more or less. This means that hive jobs might fail during that timeframe, please let me know if it is a problem. Thanks in advance, Luca (on behalf of the Analytics team) ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Reboot of stat1004--6-7 and notebook1003-4 happening on May 21st (early EU morning)
Hi everybody, the stat1004-6-7 and notebook1003-4 hosts will be rebooted tomorrow morning, May 21st, during the EU morning for security upgrades (Linux kernel upgrades). Please let me or anybody in the Analytics team know if this is problematic for your work so we can schedule a better maintenance window. Thanks! Luca (on behalf of the Analytics team) ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] DEPRECATION WARNING: dbstore1002 is going to be decommissioned on March 4th
Hi everybody, the Analytics team has been working with the SRE Data Persistence team during the last months to replace dbstore1002 with three brand new nodes, dbstore100[3-5]. We are moving from a single mysql instance (multi-source) to a multi-instance environment. For more info please check: * T210478 and related subtasks. * https://wikitech.wikimedia.org/wiki/Analytics/Data_access#MariaDB_replicas We are planning to decommission the dbstore1002 host (namely stopping mysql and shutting down the server) on Monday March 4th (EU morning). We have recently been following up with a lot of users to help them migrate to the new environment, so we are reasonably sure that this move should not heavily impact anybody, but if we have left some use case aside please let us know in https://phabricator.wikimedia.org/T215589. If we don't hear anything before the March 4th deadline we'll proceed with the host decommission maintenance. Luca (on behalf of the Analytics team) ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Unscheduled reboot of stat1007
Hi everybody, as FYI today I rebooted stat1007 due to unexpected maintenance (an error from my side) while investigating a Spark2 issue (that is now fixed). Apologies if this has impacted your work! Luca ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Upcoming move of users from stat1005 to stat1007
Hi everybody, final email about this I promise :) Current status: access to stat1005 is still allowed since some users need more time to move their work to stat1007, but there will be no more rsyncs of home directories to stat1007 to avoid impacting people that already moved to the new host. If you need to copy data over please follow up with me (or anybody in the Analytics team) in https://phabricator.wikimedia.org/T205846. Please also keep in mind that we'll eventually repurpose stat1005 to a new role (hopefully with a working GPU) and we'll not keep the data on it forever, so please check all your data on stat1007 as soon as possible (and let us know if you are missing something). Thanks a lot! Luca (on behalf of the Analytics team) Il giorno mer 7 nov 2018 alle ore 07:32 Luca Toscano ha scritto: > Hi everybody, > > this is a reminder that in a week stat1005 will not be usable anymore. > Please follow up with me or the Analytics team if you need more time or if > you have any question :) > > Thanks! > > Luca > > Il giorno mer 31 ott 2018 alle ore 16:03 Luca Toscano < > ltosc...@wikimedia.org> ha scritto: > >> Hi everybody, >> >> as part of https://phabricator.wikimedia.org/T205846 we are going to ask >> to all the stat1005's users to move to stat1007 during the next two weeks. >> The deadline is November 14th, by which time ssh access to stat1005 will be >> removed. >> >> Background: on stat1005 we have a GPU (more details in >> https://phabricator.wikimedia.org/T148843) that has been sitting there >> for almost two years, and it would be great to try to make it work during >> the next months. This effort will require a lot of tests/reboots/etc.. that >> can of course impact ongoing work of all of you, so we prefer to move >> everybody to another identical machine beforehand. >> >> Please reach out to me or to the analytics team in T205846 or IRC >> (#wikimedia-analytics on Freenode) if you have any >> questions/doubts/blocker/etc.., we are not going to enforce the deadline if >> anybody will raise concerns or blockers of course. It would be great to >> move everybody by Nov 14th but we surely don't want to disrupt any ongoing >> important work. >> >> I am going to update the Wikitech documentation about stat1005 and >> stat1007 as soon as possible, for the moment keep in mind that stat1007 >> will take over completely everything that stat1005 currently does. >> >> I have already copied over all the stat1005 directories to stat1007, and >> I'll periodically sync them during the following days. If you don't find >> anything important, please add a note in T205846. >> >> Thanks a lot and sorry for the trouble, >> >> Luca (on behalf of the Analytics team) >> > ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Analytics Hadoop Cloudera upgrade scheduled for Nov 12th (Monday) at 14:00 CEST
Hi everybody, the Analytics team will shutdown completely the Hadoop cluster for a couple of hours on Monday Nov 12th at 14:00 CEST to upgrade the Cloudera distribution to 5.15 (currently 5.10). No big updates but only a collection of small/medium fixes that (hopefully) will improve the reliability of our cluster. For more info, please check https://phabricator.wikimedia.org/T204759. This means that tools like HDFS, Hive, Oozie, etc.. will not be available during the maintenance window, so if this impacts your work please reach out to us so we can chat about it and possibly re-schedule if needed (in the task or #wikimedia-analytics on Freenode IRC). Thanks a lot for the patience, we are trying to do our best to keep all our systems as up to date as possible :) Luca (on behalf of the Analytics team) ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Upcoming move of users from stat1005 to stat1007
Hi everybody, this is a reminder that in a week stat1005 will not be usable anymore. Please follow up with me or the Analytics team if you need more time or if you have any question :) Thanks! Luca Il giorno mer 31 ott 2018 alle ore 16:03 Luca Toscano < ltosc...@wikimedia.org> ha scritto: > Hi everybody, > > as part of https://phabricator.wikimedia.org/T205846 we are going to ask > to all the stat1005's users to move to stat1007 during the next two weeks. > The deadline is November 14th, by which time ssh access to stat1005 will be > removed. > > Background: on stat1005 we have a GPU (more details in > https://phabricator.wikimedia.org/T148843) that has been sitting there > for almost two years, and it would be great to try to make it work during > the next months. This effort will require a lot of tests/reboots/etc.. that > can of course impact ongoing work of all of you, so we prefer to move > everybody to another identical machine beforehand. > > Please reach out to me or to the analytics team in T205846 or IRC > (#wikimedia-analytics on Freenode) if you have any > questions/doubts/blocker/etc.., we are not going to enforce the deadline if > anybody will raise concerns or blockers of course. It would be great to > move everybody by Nov 14th but we surely don't want to disrupt any ongoing > important work. > > I am going to update the Wikitech documentation about stat1005 and > stat1007 as soon as possible, for the moment keep in mind that stat1007 > will take over completely everything that stat1005 currently does. > > I have already copied over all the stat1005 directories to stat1007, and > I'll periodically sync them during the following days. If you don't find > anything important, please add a note in T205846. > > Thanks a lot and sorry for the trouble, > > Luca (on behalf of the Analytics team) > ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] [Analytics] Hive and Oozie unavailable for maintenance on Tue Oct 9th 10 AM CEST
Thanks for the note Neil, should have been clearer! I am also going to fix all the analytics1003's references in Wikitech later on today :) Luca Il giorno mer 10 ott 2018 alle ore 01:08 Neil Patel Quinn < nqu...@wikimedia.org> ha scritto: > Quick note: since the Hive coordinator has moved, you'll have to update > its url from *analytics1003.eqiad.wmnet *to *an-coord1001.eqiad.wmnet *in > any scripts you have. > > On Fri, 5 Oct 2018 at 09:54, Luca Toscano wrote: > >> Hi everybody, >> >> the Analytics team is going to move the Oozie and Hive daemons from the >> analytics1003 host to an-coord1001 (new host, hardware refresh) on Tuesday >> Oct 9th at 10 AM CEST. This will require downtime for Oozie and Hive, so >> some jobs might fail or not work at all during the maintenance. We have >> allocated two hours for this procedure but it should require less time. >> >> Tracking task: T205509 >> >> As always, please follow up with me or anybody in the analytics team for >> clarifications and/or comments (via Phabricator or IRC Freenode >> #wikimedia-analytics). >> >> Thanks for the patience! >> >> Luca (on behalf of the Analytics team) >> ___ >> Wiki-research-l mailing list >> Wiki-research-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> > > > -- > Neil Patel Quinn <https://meta.wikimedia.org/wiki/User:Neil_P._Quinn-WMF> > (he/him/his) > product analyst, Wikimedia Foundation > ___ > Analytics mailing list > analyt...@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Hive and Oozie unavailable for maintenance on Tue Oct 9th 10 AM CEST
Hi everybody, the Analytics team is going to move the Oozie and Hive daemons from the analytics1003 host to an-coord1001 (new host, hardware refresh) on Tuesday Oct 9th at 10 AM CEST. This will require downtime for Oozie and Hive, so some jobs might fail or not work at all during the maintenance. We have allocated two hours for this procedure but it should require less time. Tracking task: T205509 As always, please follow up with me or anybody in the analytics team for clarifications and/or comments (via Phabricator or IRC Freenode #wikimedia-analytics). Thanks for the patience! Luca (on behalf of the Analytics team) ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Brief unavailability scheduled for the Event Logging database replica
Hi everybody, Tomorrow Sept 27th at 10 CEST db1108 (alias analytics-slave) will be down for a brief (max 30 mins) maintenance (Mariadb and Linux kernel upgrade). This means that the log database will not be available for querying during this time frame. Please reach out to me or to the Analytics team if this impacts your work (elukey or #wikimedia-analytics on IRC Freenode). Thanks! Luca ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Analytics Hadoop cluster full shutdown scheduled for Sept 25th
Hi everybody, maintenance just completed, it took a bit more but no issue registered so far. We weren't able to swap analytics1003 (where Oozie/Hive/etc..) in this maintenance window, so I'll likely send another email next week to schedule downtime (not for the entire cluster but mostly for Hive/Oozie only), please don't hate me :) If you see any issue please contact us (via https://phabricator.wikimedia.org/T203635 or IRC Freenode #wikimedia-analytics). Thanks! Luca Il giorno lun 24 set 2018 alle ore 16:50 Luca Toscano < ltosc...@wikimedia.org> ha scritto: > Hi everybody, > > this is a reminder that the maintenance will happen tomorrow (Tue 25th, 10 > CEST). > > Luca > > Il giorno ven 14 set 2018 alle ore 12:13 Luca Toscano < > ltosc...@wikimedia.org> ha scritto: > >> Hi everybody, >> >> the Analytics team needs to replace the Hadoop master node hosts >> (analytics100[1,2]) and the Hive/Oozie host (analytics1003) as part of >> regular hardware refresh (hosts getting out of warranty). In order to do >> things safely we decided to proceed with a full cluster shutdown on Sept >> 25th at 10 AM CEST. The maintenance should last a couple of hours and all >> there shouldn't be any noticeable change for the Hadoop users. >> >> This means that during the maintenance: >> - HDFS will not be available >> - Yarn will not be available >> - Hive/Spark (cluster mode)/Oozie/etc.. will not be available >> >> Please let us know if this impacts your work in >> https://phabricator.wikimedia.org/T203635 or on the #wikimedia-analytics >> Freenode IRC channel. >> >> Thanks a lot! >> >> Luca >> > ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Analytics Hadoop cluster full shutdown scheduled for Sept 25th
Hi everybody, this is a reminder that the maintenance will happen tomorrow (Tue 25th, 10 CEST). Luca Il giorno ven 14 set 2018 alle ore 12:13 Luca Toscano < ltosc...@wikimedia.org> ha scritto: > Hi everybody, > > the Analytics team needs to replace the Hadoop master node hosts > (analytics100[1,2]) and the Hive/Oozie host (analytics1003) as part of > regular hardware refresh (hosts getting out of warranty). In order to do > things safely we decided to proceed with a full cluster shutdown on Sept > 25th at 10 AM CEST. The maintenance should last a couple of hours and all > there shouldn't be any noticeable change for the Hadoop users. > > This means that during the maintenance: > - HDFS will not be available > - Yarn will not be available > - Hive/Spark (cluster mode)/Oozie/etc.. will not be available > > Please let us know if this impacts your work in > https://phabricator.wikimedia.org/T203635 or on the #wikimedia-analytics > Freenode IRC channel. > > Thanks a lot! > > Luca > ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Analytics Hadoop cluster full shutdown scheduled for Sept 25th
Hi everybody, the Analytics team needs to replace the Hadoop master node hosts (analytics100[1,2]) and the Hive/Oozie host (analytics1003) as part of regular hardware refresh (hosts getting out of warranty). In order to do things safely we decided to proceed with a full cluster shutdown on Sept 25th at 10 AM CEST. The maintenance should last a couple of hours and all there shouldn't be any noticeable change for the Hadoop users. This means that during the maintenance: - HDFS will not be available - Yarn will not be available - Hive/Spark (cluster mode)/Oozie/etc.. will not be available Please let us know if this impacts your work in https://phabricator.wikimedia.org/T203635 or on the #wikimedia-analytics Freenode IRC channel. Thanks a lot! Luca ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Upcoming reboot of stat* and notebook* hosts - Sept 13th
Hi everybody, on Thursday Sept 13th (EU morning) I am planning to reboot the stat hosts (stat1004, stat1005 and stat1006) and the notebook hosts (notebook1003, notebook1004) for Linux kernel upgrades. Please let me know if this impacts your work in https://phabricator.wikimedia.org/T203165 or on IRC (elukey - #wikimedia-analytics). Thanks! Luca ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Kafka Main Eqiad outage and failover of Eventbus/Eventstreams to codfw
[Adding some other mailing lists in Cc] Hi everybody, as a lot of you have probably already noticed yesterday reading the operations@ mailing list, we had an outage of the Kafka Main eqiad cluster that forced us to switch the Eventbus and Eventstreams services to codfw. All the precise timings will be listed in https://wikitech.wikimedia.org/wiki/Incident_documentation/20180711-kafka-eqiad, but for a quick glimpse: 2018-07-11 17:00 UTC - Eventbus service switched to codfw 2018-07-11 18:44 UTC - Eventstreams service switched to codfw We are going to switch back those services to eqiad during the next couple of hours. The consumers of the Eventstreams service may get some failures or data drops, apologies in advance for the trouble. Cheers, Luca Il giorno gio 12 lug 2018 alle ore 00:00 Luca Toscano < ltosc...@wikimedia.org> ha scritto: > Hi everybody, > > as you might have seen from the operations' channel on IRC the Kafka Main > Eqiad cluster (kafka100[1-3].eqiad.wmnet) suffered a long outage due to new > topics pushed out with too long names (causing fs operation issues, etc..). > I'll update this email thread tomorrow EU time with more details, tasks, > precise root cause, etc.., but the important bit to know is that Eventbus > and Eventstreams have been failed over to the Kafka Main Codfw cluster. > This should be transparent to everybody but please let us know otherwise. > > Thanks for the patience! > > (a very sleepy :) Luca > > ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Upcoming reboot of stat100[56] and analytics1003 (Hive, Oozie) for kernel security upgrades
Hi everybody, tomorrow EU morning (Wed Mar 7th) I'd need to reboot stat100[56] and analytics1003 for kernel security updates. Hive and Oozie (Analytics Hadoop cluster) will not be available for a (hopefully) brief period of time. Please let me know if there is an important work that you are doing that cannot be stopped and the maintenance will be postponed accordingly :) Tracking task: https://phabricator.wikimedia.org/T188594 Thanks! Luca (on behalf of the Analytics team) ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Analytics Hadoop cluster maintenance announce for Feb 6th
Hi everybody, just a reminder that the upgrade is scheduled for tomorrow EU/CET morning. Luca 2018-01-23 17:58 GMT+01:00 Luca Toscano <ltosc...@wikimedia.org>: > *TL;DR*: The Analytics Hadoop cluster will be completely down for max 2h > on *Feb 6th* (EU/CET morning) to upgrade all the daemons to Java 8. > > Hi everybody, > > we are planning to upgrade the Analytics Hadoop cluster to Java 8 on *Feb > 6th* (EU/CET morning) for https://phabricator.wikimedia.org/T166248. > Sadly we can't do a rolling upgrade of all the jvm-based Hadoop daemons > since the distribution that we use (Cloudera) suggests to perform the > upgrade only after a complete cluster shutdown. This means that for a > couple of hours (hopefully a lot less) all the Hadoop based services will > be unavailable (Hive, Oozie, HDFS, etc..). > > We have tested the new configuration in labs and all the regular Analytics > jobs seem to work correctly, so we don't expect major issues after the > upgrade, but if you have any question or concern please follow up in the > task. > > Thanks! > > Luca and Andrew (on behalf of the Analytics team) > ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Analytics Hadoop cluster maintenance announce for Feb 6th
*TL;DR*: The Analytics Hadoop cluster will be completely down for max 2h on *Feb 6th* (EU/CET morning) to upgrade all the daemons to Java 8. Hi everybody, we are planning to upgrade the Analytics Hadoop cluster to Java 8 on *Feb 6th* (EU/CET morning) for https://phabricator.wikimedia.org/T166248. Sadly we can't do a rolling upgrade of all the jvm-based Hadoop daemons since the distribution that we use (Cloudera) suggests to perform the upgrade only after a complete cluster shutdown. This means that for a couple of hours (hopefully a lot less) all the Hadoop based services will be unavailable (Hive, Oozie, HDFS, etc..). We have tested the new configuration in labs and all the regular Analytics jobs seem to work correctly, so we don't expect major issues after the upgrade, but if you have any question or concern please follow up in the task. Thanks! Luca and Andrew (on behalf of the Analytics team) ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Alter tables for the log database on the analytics slaves
Hi everybody, the Analytics team is working on some alter tables to the Eventlogging 'log' database on analytics-store (dbstore1002) and analytics-slave (db1047) as part of https://phabricator.wikimedia.org/T167162. The list of alter tables are the following: https://phabricator.wikimedia.org/P5570 This should be a transparent change but I thought it would have been better to keep all of you informed in case of unintended regressions or side-effects. The context of the alter tables is in T167162 but the TL;DR is that we need nullable attributes across all the EL tables (except fields like id, uuid and timestamp) to be able to sanitize data with our new eventlogging_cleaner script (https://phabricator.wikimedia.org/T156933). Please let me know if you encounter any issue with this change. Thanks in advance! Luca ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Upcoming reboots of stat100[234] and most of the Analytics hosts (Kafka and Hadoop)
Hi everybody, due to a severe kernel vulnerability (https://access.redhat.com/ security/vulnerabilities/2706661) I need to reboot the stat1002, stat1003 and stat1004 hosts to install the new kernel. The reboots are scheduled for 9 AM CEST tomorrow (Oct 21st), please follow up with me or anybody in the Analytics team if you have ongoing work that can't be stopped. The Analytics Hadoop and Kafka clusters will be rebooted too during the next hours. Event if this maintenance shouldn't cause any major issue, you might experience some service degradation. More up to date information on IRC in the analytics and operations channels. Thanks and apologies in advance for the trouble! Luca ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l