Hi,
I've spent some time previously digging into the ansible provisioner and some issues we have at work with a development environment where we are looking to spin up multiple VM's in parallel and then provision them to a base configuration using ansible. There was a pull request accepted from one of my colleagues https://github.com/mitchellh/vagrant/pull/5551, however we've still found that there is some issues remaining. The crux of the issue appears to be that there is an assumption in the ansible provisioner that when it is executed all active machines will be ready for SSH access other wise they are broken and should be recreated. This doesn't hold true when using the vagrant up --parallel command, as depending on the provider one or more of the VM's may still be booting. In turn, this means the common inventory file generated may get regenerated subsequently as the other machines reach the ansible provisioner phase, thus truncating said file while it is being read by the ansible process spun up by other threads. So there are two problems here: 1) inventory file may be truncated 2) inventory may be incomplete or incorrect because the thread didn't allow enough time for SSH to be available I suggest reading the comments on the pull request for more details. Item one is trivial to fix, just switch to an inventory file per machine, so that each subprocess gets it's own file. That way none will overwrite or remove an inventory file already being read. Number two appears to be a little more tricky, looking at the code from https://github.com/mitchellh/vagrant/blob/a3c077cbe0b27339bb14c7bcd404ea64fefd16d4/plugins/provisioners/ansible/provisioner/host.rb#L121-L142, I tried adding something like the following code: --------------------- @machine.env.active_machines.each do |am| begin m = @machine.env.machine(*am) retryable(on: Vagrant::Errors::SSHNotReady, tries: config.inventory_retries.to_i, sleep: 2) do m_ssh_info = m.ssh_info if !m_ssh_info.nil? forced_ssh_user = "" if config.force_remote_user forced_ssh_user = "ansible_ssh_user='#{m_ssh_info[:username]}' " end machines += "#{m.name} ansible_ssh_host=#{m_ssh_info[:host]} " << "ansible_ssh_port=#{m_ssh_info[:port]} " << "#{forced_ssh_user}ansible_ssh_private_key_file=" << "'#{m_ssh_info[:private_key_path][0]}'\n" inventory_machines[m.name] = m else # there may be more valid states to retry on, but not for libvirt if [:preparing, :running].include?(m.state.id) @logger.info("Machine '#{am[0]} (#{am[1]})' not yet responding to ssh.") raise Vagrant::Errors::SSHNotReady else @logger.info("Machine '#{am[0]} (#{am[1]})' not running, ignoring.") machines += "# NOT RUNNING: ignoring non-running machine '#{m.name}'.\n" end end end rescue Vagrant::Errors::SSHNotReady => e @logger.error("Auto-generated inventory: Impossible to get SSH information for " << "machine '#{m.name} (#{m.provider_name})'. This machine should be recreated.") # Leave a note about this missing machine in the inventory machines += "# MISSING: '#{m.name}' machine was probably removed without using " << "Vagrant. This machine should be recreated.\n" rescue Vagrant::Errors::MachineNotFound => e @logger.info("Auto-generated inventory: Skip machine '#{am[0]} (#{am[1]})', " << "which is not configured for this Vagrant environment.") end end --------------------- Basically I was looking to retry in the cases where there may be a chance of the machine providing SSH but it's not ready to yet. In doing this I discovered that accessing m.state.id can trigger a machine is locked exception because querying of the machine state involves trigging an update of the machine index cached state which requires the machine state method to explicitly unlock the returned entry. see https://github.com/mitchellh/vagrant/blob/a3c077cbe0b27339bb14c7bcd404ea64fefd16d4/lib/vagrant/machine.rb#L485-L503 Of course with the ansible provisioner for each machine in a multi machine configuration all calling this method on each each machine from different threads, there is a high likelyhood of two threads trying to modify the same entry state at the same time. Also this seems a little flaky anyway, since it's dependent on the state being reported by the provider, which in turn is defined by the providers. It seems like vagrant multi machine environments would benefit from having some standard mechanism to establish whether a machine has been created by a provider and it has finished executing and should now have valid ssh_info. At the moment this appears that it would only benefit the ansible provisioner, but I'm sure it would be useful elsewhere in knowing that a machine is available and should be able to respond rather than trying to infer based on whether the ssh_info is nil or not. Btw, using communicate.ready? has it's own issues for the ssh communicator in that how it synchronizes around the insecure ssh private key switch. It still also doesn't Desired Enhancement -------------------------------- To call it out explicitly, it would be useful to independently determine certain machine/environment states consistently, not limited to establishing whether the machine is still being brought up by the provider, is ready for anything needing it booted, or has failed. I'm more than happy to implement the needed changes, and plan to upload a pull request for some of the changes to achieve the above once I've synced with the latest head. I'm not sure if there is something in there I can use to infer the needed state information consistently or where best to add such information? perhaps the lib/vagrant/action/builtin/provision.rb by setting some env flags around the app.call(env) call at https://github.com/mitchellh/vagrant/blob/a3c077cbe0b27339bb14c7bcd404ea64fefd16d4/lib/vagrant/action/builtin/provision.rb#L79-L80 ? -- Regards, Darragh Bailey -- This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list. GitHub Issues: https://github.com/mitchellh/vagrant/issues IRC: #vagrant on Freenode --- You received this message because you are subscribed to the Google Groups "Vagrant" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/vagrant-up/4ec5da66-5a79-4c8e-873e-6a8278f5fb95%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
