Hi,

I've spent some time previously digging into the ansible provisioner and 
some issues we have at work with a development environment where we are 
looking to spin up multiple VM's in parallel and then provision them to a 
base configuration using ansible.

There was a pull request accepted from one of my colleagues 
https://github.com/mitchellh/vagrant/pull/5551, however we've still found 
that there is some issues remaining.

The crux of the issue appears to be that there is an assumption in the 
ansible provisioner that when it is executed all active machines will be 
ready for SSH access other wise they are broken and should be recreated. 
This doesn't hold true when using the vagrant up --parallel command, as 
depending on the provider one or more of the VM's may still be booting.

In turn, this means the common inventory file generated may get regenerated 
subsequently as the other machines reach the ansible provisioner phase, 
thus truncating said file while it is being read by the ansible process 
spun up by other threads.

So there are two problems here:

1) inventory file may be truncated
2) inventory may be incomplete or incorrect because the thread didn't allow 
enough time for SSH to be available


I suggest reading the comments on the pull request for more details.


Item one is trivial to fix, just switch to an inventory file per machine, 
so that each subprocess gets it's own file. That way none will overwrite or 
remove an inventory file already being read.

Number two appears to be a little more tricky, looking at the code from 
https://github.com/mitchellh/vagrant/blob/a3c077cbe0b27339bb14c7bcd404ea64fefd16d4/plugins/provisioners/ansible/provisioner/host.rb#L121-L142,
 
I tried adding something like the following code:

---------------------
        @machine.env.active_machines.each do |am|
          begin
            m = @machine.env.machine(*am)
            retryable(on: Vagrant::Errors::SSHNotReady,
                      tries: config.inventory_retries.to_i, sleep: 2) do
              m_ssh_info = m.ssh_info
              if !m_ssh_info.nil?
                forced_ssh_user = ""
                if config.force_remote_user
                  forced_ssh_user = 
"ansible_ssh_user='#{m_ssh_info[:username]}' "
                end
                machines += "#{m.name} ansible_ssh_host=#{m_ssh_info[:host]} 
" <<
                  "ansible_ssh_port=#{m_ssh_info[:port]} " <<
                  "#{forced_ssh_user}ansible_ssh_private_key_file=" <<
                  "'#{m_ssh_info[:private_key_path][0]}'\n"
                inventory_machines[m.name] = m
              else
                # there may be more valid states to retry on, but not for 
libvirt
                if [:preparing, :running].include?(m.state.id)
                  @logger.info("Machine '#{am[0]} (#{am[1]})' not yet 
responding to ssh.")
                  raise Vagrant::Errors::SSHNotReady
                else
                  @logger.info("Machine '#{am[0]} (#{am[1]})' not running, 
ignoring.")
                  machines += "# NOT RUNNING: ignoring non-running machine 
'#{m.name}'.\n"
                end
              end
            end
          rescue Vagrant::Errors::SSHNotReady => e
            @logger.error("Auto-generated inventory: Impossible to get SSH 
information for " <<
              "machine '#{m.name} (#{m.provider_name})'. This machine 
should be recreated.")
            # Leave a note about this missing machine in the inventory
            machines += "# MISSING: '#{m.name}' machine was probably 
removed without using " <<
              "Vagrant. This machine should be recreated.\n"
          rescue Vagrant::Errors::MachineNotFound => e
            @logger.info("Auto-generated inventory: Skip machine '#{am[0]} 
(#{am[1]})', " <<
              "which is not configured for this Vagrant environment.")
          end
        end
---------------------

Basically I was looking to retry in the cases where there may be a chance 
of the machine providing SSH but it's not ready to yet. In doing this I 
discovered that accessing m.state.id can trigger a machine is locked 
exception because querying of the machine state involves trigging an update 
of the machine index cached state which requires the machine state method 
to explicitly unlock the returned entry.
see 
https://github.com/mitchellh/vagrant/blob/a3c077cbe0b27339bb14c7bcd404ea64fefd16d4/lib/vagrant/machine.rb#L485-L503

Of course with the ansible provisioner for each machine in a multi machine 
configuration all calling this method on each each machine from different 
threads, there is a high likelyhood of two threads trying to modify the 
same entry state at the same time.

Also this seems a little flaky anyway, since it's dependent on the state 
being reported by the provider, which in turn is defined by the providers.

It seems like vagrant multi machine environments would benefit from having 
some standard mechanism to establish whether a machine has been created by 
a provider and it has finished executing and should now have valid ssh_info.

At the moment this appears that it would only benefit the ansible 
provisioner, but I'm sure it would be useful elsewhere in knowing that a 
machine is available and should be able to respond rather than trying to 
infer based on whether the ssh_info is nil or not.

Btw, using communicate.ready? has it's own issues for the ssh communicator 
in that how it synchronizes around the insecure ssh private key switch. It 
still also doesn't 


Desired Enhancement
--------------------------------

To call it out explicitly, it would be useful to independently determine 
certain machine/environment states consistently, not limited to 
establishing whether the machine is still being brought up by the provider, 
is ready for anything needing it booted, or has failed.


I'm more than happy to implement the needed changes, and plan to upload a 
pull request for some of the changes to achieve the above once I've synced 
with the latest head.

I'm not sure if there is something in there I can use to infer the needed 
state information consistently or where best to add such information?
perhaps the lib/vagrant/action/builtin/provision.rb by setting some env 
flags around the app.call(env) call at 
https://github.com/mitchellh/vagrant/blob/a3c077cbe0b27339bb14c7bcd404ea64fefd16d4/lib/vagrant/action/builtin/provision.rb#L79-L80
 
?


--
Regards,
Darragh Bailey

-- 
This mailing list is governed under the HashiCorp Community Guidelines - 
https://www.hashicorp.com/community-guidelines.html. Behavior in violation of 
those guidelines may result in your removal from this mailing list.

GitHub Issues: https://github.com/mitchellh/vagrant/issues
IRC: #vagrant on Freenode
--- 
You received this message because you are subscribed to the Google Groups 
"Vagrant" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/vagrant-up/4ec5da66-5a79-4c8e-873e-6a8278f5fb95%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to