[vagrant-up] Re: Vagrant ansible provisioner multi VM race

Darragh Bailey Tue, 17 Nov 2015 10:10:19 -0800


On Friday, November 13, 2015 at 5:38:58 PM UTC, Darragh Bailey wrote:
>
> Hi,
>
>
> I've spent some time previously digging into the ansible provisioner and 
> some issues we have at work with a development environment where we are 
> looking to spin up multiple VM's in parallel and then provision them to a 
> base configuration using ansible.
>
> There was a pull request accepted from one of my colleagues 
> https://github.com/mitchellh/vagrant/pull/5551, however we've still found 
> that there is some issues remaining.
>
> The crux of the issue appears to be that there is an assumption in the 
> ansible provisioner that when it is executed all active machines will be 
> ready for SSH access other wise they are broken and should be recreated. 
> This doesn't hold true when using the vagrant up --parallel command, as 
> depending on the provider one or more of the VM's may still be booting.
>
> In turn, this means the common inventory file generated may get 
> regenerated subsequently as the other machines reach the ansible 
> provisioner phase, thus truncating said file while it is being read by the 
> ansible process spun up by other threads.
>
> So there are two problems here:
>
> 1) inventory file may be truncated
> 2) inventory may be incomplete or incorrect because the thread didn't 
> allow enough time for SSH to be available
>
>
> I suggest reading the comments on the pull request for more details.
>
>
> Item one is trivial to fix, just switch to an inventory file per machine, 
> so that each subprocess gets it's own file. That way none will overwrite or 
> remove an inventory file already being read.
>
> Number two appears to be a little more tricky, looking at the code from 
> https://github.com/mitchellh/vagrant/blob/a3c077cbe0b27339bb14c7bcd404ea64fefd16d4/plugins/provisioners/ansible/provisioner/host.rb#L121-L142,
>  
> I tried adding something like the following code:
>
> ---------------------
>         @machine.env.active_machines.each do |am|
>           begin
>             m = @machine.env.machine(*am)
>             retryable(on: Vagrant::Errors::SSHNotReady,
>                       tries: config.inventory_retries.to_i, sleep: 2) do
>               m_ssh_info = m.ssh_info
>               if !m_ssh_info.nil?
>                 forced_ssh_user = ""
>                 if config.force_remote_user
>                   forced_ssh_user = 
> "ansible_ssh_user='#{m_ssh_info[:username]}' "
>                 end
>                 machines += "#{m.name} ansible_ssh_host=#{m_ssh_info[:host
> ]} " <<
>                   "ansible_ssh_port=#{m_ssh_info[:port]} " <<
>                   "#{forced_ssh_user}ansible_ssh_private_key_file=" <<
>                   "'#{m_ssh_info[:private_key_path][0]}'\n"
>                 inventory_machines[m.name] = m
>               else
>                 # there may be more valid states to retry on, but not for 
> libvirt
>                 if [:preparing, :running].include?(m.state.id)
>                   @logger.info("Machine '#{am[0]} (#{am[1]})' not yet 
> responding to ssh.")
>                   raise Vagrant::Errors::SSHNotReady
>                 else
>                   @logger.info("Machine '#{am[0]} (#{am[1]})' not 
> running, ignoring.")
>                   machines += "# NOT RUNNING: ignoring non-running 
> machine '#{m.name}'.\n"
>                 end
>               end
>             end
>           rescue Vagrant::Errors::SSHNotReady => e
>             @logger.error("Auto-generated inventory: Impossible to get SSH 
> information for " <<
>               "machine '#{m.name} (#{m.provider_name})'. This machine 
> should be recreated.")
>             # Leave a note about this missing machine in the inventory
>             machines += "# MISSING: '#{m.name}' machine was probably 
> removed without using " <<
>               "Vagrant. This machine should be recreated.\n"
>           rescue Vagrant::Errors::MachineNotFound => e
>             @logger.info("Auto-generated inventory: Skip machine 
> '#{am[0]} (#{am[1]})', " <<
>               "which is not configured for this Vagrant environment.")
>           end
>         end
> ---------------------
>
> Basically I was looking to retry in the cases where there may be a chance 
> of the machine providing SSH but it's not ready to yet. In doing this I 
> discovered that accessing m.state.id can trigger a machine is locked 
> exception because querying of the machine state involves trigging an update 
> of the machine index cached state which requires the machine state method 
> to explicitly unlock the returned entry.
> see 
> https://github.com/mitchellh/vagrant/blob/a3c077cbe0b27339bb14c7bcd404ea64fefd16d4/lib/vagrant/machine.rb#L485-L503
>
> Of course with the ansible provisioner for each machine in a multi machine 
> configuration all calling this method on each each machine from different 
> threads, there is a high likelyhood of two threads trying to modify the 
> same entry state at the same time.
>
> Also this seems a little flaky anyway, since it's dependent on the state 
> being reported by the provider, which in turn is defined by the providers.
>
> It seems like vagrant multi machine environments would benefit from having 
> some standard mechanism to establish whether a machine has been created by 
> a provider and it has finished executing and should now have valid ssh_info.
>
> At the moment this appears that it would only benefit the ansible 
> provisioner, but I'm sure it would be useful elsewhere in knowing that a 
> machine is available and should be able to respond rather than trying to 
> infer based on whether the ssh_info is nil or not.
>
> Btw, using communicate.ready? has it's own issues for the ssh communicator 
> in that how it synchronizes around the insecure ssh private key switch. It 
> still also doesn't 
>


Forgot to complete this:

If you use communicate.ready? on a machine from a different thread, you can 
result in the second thread beginning to try to communicate with the 
machine using the existing connection and then getting the connection 
closed as the first thread reaches 
https://github.com/mitchellh/vagrant/blob/c1508cd893e4b3bd5282f1dde3399140955b80e4/plugins/communicators/ssh/communicator.rb#L183-L184
 
while the second is in the middle of a connect block using the connection 
https://github.com/mitchellh/vagrant/blob/c1508cd893e4b3bd5282f1dde3399140955b80e4/plugins/communicators/ssh/communicator.rb#L312

Basically the communicator isn't thread safe.

 

> Desired Enhancement
> --------------------------------
>
> To call it out explicitly, it would be useful to independently determine 
> certain machine/environment states consistently, not limited to 
> establishing whether the machine is still being brought up by the provider, 
> is ready for anything needing it booted, or has failed.
>
>
> I'm more than happy to implement the needed changes, and plan to upload a 
> pull request for some of the changes to achieve the above once I've synced 
> with the latest head.
>
> I'm not sure if there is something in there I can use to infer the needed 
> state information consistently or where best to add such information?
> perhaps the lib/vagrant/action/builtin/provision.rb by setting some env 
> flags around the app.call(env) call at 
> https://github.com/mitchellh/vagrant/blob/a3c077cbe0b27339bb14c7bcd404ea64fefd16d4/lib/vagrant/action/builtin/provision.rb#L79-L80
>  
> ?
>
>

I've opened up https://github.com/mitchellh/vagrant/issues/6526 to cover 
this with some details added about where I'm looking to fix. Hopefully 
there will be feedback on the issue combined with discussion here that will 
direct my efforts to fix in the preferred manner.

--
Darragh Bailey

-- 
This mailing list is governed under the HashiCorp Community Guidelines - 
https://www.hashicorp.com/community-guidelines.html. Behavior in violation of 
those guidelines may result in your removal from this mailing list.

GitHub Issues: https://github.com/mitchellh/vagrant/issues
IRC: #vagrant on Freenode
--- 
You received this message because you are subscribed to the Google Groups 
"Vagrant" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/vagrant-up/a6693438-a96b-4c63-85a8-9f6fd0aecacc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[vagrant-up] Re: Vagrant ansible provisioner multi VM race

Reply via email to