[Wikidata-bugs] [Maniphest] [Commented On] T221632: Storage capacity upgrade for WDQS

2019-07-19 Thread wiki_willy
wiki_willy added a comment.


  Talked to @Gehel, @faidon, and @RobH - and we're going to proceed with RAID0 
for this install.  Although RAID0 is not ideal, we can make an exception here, 
as long as a drive failure that causes the system to go down, won't result in 
any dc-ops onsite repair emergencies.
  
  Thanks,
  Willy

TASK DETAIL
  https://phabricator.wikimedia.org/T221632

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel, wiki_willy
Cc: wiki_willy, mark, RobH, faidon, Smalyshev, Aklapper, Gehel, darthmon_wmde, 
Nandana, Sario528, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, 
QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Jonas, Xmlizer, 
jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T221632: Storage capacity upgrade for WDQS

2019-07-19 Thread Gehel
Gehel added a comment.


  I've just updated the task description to make it clear that even if we move 
storage to RAID0, we'll keep the OS on RAID1 (same scheme used by elasticsearch 
servers).

TASK DETAIL
  https://phabricator.wikimedia.org/T221632

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel
Cc: mark, RobH, faidon, Smalyshev, Aklapper, Gehel, darthmon_wmde, Nandana, 
Sario528, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, 
EBjune, merbst, LawExplorer, _jensen, rosalieper, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T221632: Storage capacity upgrade for WDQS

2019-07-09 Thread Smalyshev
Smalyshev added a comment.


  > wdqs could lose a single node, but could it lose two?
  
  Depends. Losing two hosts in the same cluster in the same datacenter would be 
a huge issue. But we could temporarily borrow a node between clusters (e.g. 
internal -> public) if such thing happens. Also, we should be adding one host 
for the public cluster shortly (we want to do it regardless of anything else) 
so this would make it more resilient against such scenario.

TASK DETAIL
  https://phabricator.wikimedia.org/T221632

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel, Smalyshev
Cc: RobH, faidon, Smalyshev, Aklapper, Gehel, darthmon_wmde, Nandana, Lahi, 
Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Jonas, Xmlizer, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T221632: Storage capacity upgrade for WDQS

2019-07-09 Thread RobH
RobH added a comment.


  I'd like to note that I'm against the trend of our raid configurations 
becoming raid0 per node, and relying on horizontal node redundancy over disk 
redundancy.
  
  Losing a single disk in a raid0 node invalidates the entire node, making it a 
single point of urgency for our on-site engineers to address quickly.  If we 
stick with some kind of actual disk redundancy, we lessen the urgency of those 
disk failures by an order of magnitude.

TASK DETAIL
  https://phabricator.wikimedia.org/T221632

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel, RobH
Cc: RobH, faidon, Smalyshev, Aklapper, Gehel, darthmon_wmde, Nandana, Lahi, 
Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Jonas, Xmlizer, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T221632: Storage capacity upgrade for WDQS

2019-04-25 Thread Smalyshev
Smalyshev added a comment.


  > I don't think it makes sense to perpetuate a vertical scaling model
  
  With regard to disk space, I don't think we have a choice at least for the 
next FY. Even if we find alternative backend that can be a drop-in replacement 
for Blazegraph (or somehow miraculously discover an easy way to shard the data 
without killing performance) - and my current estimate for this optimistic 
scenario is "very low probability" - implementing that is likely going to take 
time. And in all that time we'd have to continue service queries on existing 
platform. Which means each node has to store the whole set of the query data.
  
  We can discuss future sharding solutions, and it's totally fine, but 
realistically I personally do not see any scenario where we have any such 
solution implemented and ready to the point we can 100% migrate our Wikidata 
query load to it in a year. Right now we are no further in this road than "we 
should think about trying some options", and we aren't even clear on what these 
options would be. It's a long road from here to a working sharding solution, 
and I don't see a way to make through it in a shorter time. If somebody sees a 
way please tell me. Which for me means that at least for the next FY, we need 
to plan for the current model - which implies hosting all the data on each 
server - to be what we have.
  
  Right now our data size is about 840G, and it keeps growing. I know Wikidata 
has growth projections so I'll add them here later but you can look on the DB 
size graph 

 and make your own conclusions. For me, this says we'll start to run out of 
disk space before the end of this calendar year. So we need to find solution 
for that - be it new disks, RAID0 or whatever magic there is.
  
  So when we discuss the budget for the next FY, I think we should base it on 
current model and growth projections we have now being the facts on the ground. 
If we find better model (and we do allocate time and resources to look for it), 
great, but we should not be basing our planning on miracles happening.
  
  > Taking machines offline and rebuilding them from scratch just because a 
disk failed or because we need more storage is really something that we need to 
avoid
  
  So far it was my experience this happened every time we did capacity upgrade. 
Has this changed (regardless of RAID0 move, I'm talking about current 
situation) or we still need to take the host offline when we add disk capacity 
now? If it hasn't changed, is there a model (not involving changing WDQS 
platform software, etc.) that can allow us to upgrade capacity without 
reimaging?

TASK DETAIL
  https://phabricator.wikimedia.org/T221632

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel, Smalyshev
Cc: faidon, Smalyshev, Aklapper, Gehel, alaa_wmde, Nandana, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Jonas, Xmlizer, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T221632: Storage capacity upgrade for WDQS

2019-04-25 Thread Gehel
Gehel added a comment.


  In T221632#5137095 , 
@faidon wrote:
  
  > I don't think it makes sense to perpetuate a vertical scaling model.
  
  
  I could not agree more! And there is work going into this direction (see this 
page 
). 
But changing the full stack while not breaking the current clients is not going 
to happen overnight. So we need a solution in the meantime.

TASK DETAIL
  https://phabricator.wikimedia.org/T221632

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel
Cc: faidon, Smalyshev, Aklapper, Gehel, alaa_wmde, Nandana, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Jonas, Xmlizer, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T221632: Storage capacity upgrade for WDQS

2019-04-25 Thread faidon
faidon added a comment.


  I don't think it makes sense to perpetuate a vertical scaling model. Both of 
the options listed here (adding disks, RAID 0) are things that we generally do 
not do, due to the hidden costs and burdens for everyone involved. Taking 
machines offline and rebuilding them from scratch just because a disk failed or 
because we need more storage is really something that we need to avoid, and 
something that the data center operations team cannot really support with its 
existing staffing (esp. taking into account the failure rate of disks).
  
  PoC/MVPs with vertical scaling are obviously OK and we can be somewhat 
flexible in HW needs for those, but for a service with the maturity and 
popularity of WDQS I think it makes sense to start designing it for scale at 
this point. It's clear that it's here to stay, and that going down a vertical 
scaling route today would only mean that we're deferring the problem until the 
next storage expansion, i.e. creating more tech debt. Horizontal scaling and 
quick/cheap operations for expansion (= adding boxes over time, staggering them 
over multiple FY) is really the way to go here.
  
  Budget-wise we can be flexible indeed! We can allocate some funds without 
defining exactly how we would spend them (within reason), and defer that 
decision until there is more clarity on the technical front. Does that make 
sense?

TASK DETAIL
  https://phabricator.wikimedia.org/T221632

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel, faidon
Cc: faidon, Smalyshev, Aklapper, Gehel, alaa_wmde, Nandana, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Jonas, Xmlizer, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T221632: Storage capacity upgrade for WDQS

2019-04-25 Thread Gehel
Gehel added a comment.


  In T221632#5132147 , 
@Smalyshev wrote:
  
  > This also makes me note that if we do introduce sharding (in any shape, 
either with Blazegraph or another solution) we'd need even more servers, since 
each shard would need to be at least on 2-3 servers to survive, so for sharding 
to make any sense we'd need at least 6, maybe even more servers, otherwise we'd 
just store every or nearly every shard on every server, which makes sharding 
pointless.
  >
  > So if we'd want resilience to loss of a server and meaningful sharding, 
we'd need something like 2x servers probably.
  
  
  As discussed on IRC, this depends a lot on what the future solution is, how 
it shards / replicate, and what the scaling strategy is. I don't think we can 
take any decision on the future architecture at this point. We might want to 
reserve //some// budget for whatever solution we come up with, but the scope 
isn't defined enough yet to have any meaningful estimate.

TASK DETAIL
  https://phabricator.wikimedia.org/T221632

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel
Cc: Smalyshev, Aklapper, Gehel, alaa_wmde, Nandana, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Jonas, Xmlizer, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T221632: Storage capacity upgrade for WDQS

2019-04-23 Thread Smalyshev
Smalyshev added a comment.


  This also makes me note that if we do introduce sharding (in any shape, 
either with Blazegraph or another solution) we'd need even more servers, since 
each shard would need to be at least on 2-3 servers to survive, so for sharding 
to make any sense we'd need at least 6, maybe even more servers, otherwise we'd 
just store every or nearly every shard on every server, which makes sharding 
pointless.
  
  So if we'd want resilience to loss of a server and meaningful sharding, we'd 
need something like 3x servers probably.

TASK DETAIL
  https://phabricator.wikimedia.org/T221632

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel, Smalyshev
Cc: Smalyshev, Aklapper, Gehel, alaa_wmde, Nandana, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Jonas, Xmlizer, jkroll, Wikidata-bugs, 
Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs