#ansible #ssh

Nature of Ansible's "Timeout (12s) waiting for privilege escalation prompt" error

Ansible can occasionally fail with the following error: "Timeout (12s) waiting for privilege escalation prompt".

At some point I wanted to finally learn exactly what this "privilege escalation prompt timeout" is, where I can change its value and where it says that it defaults to 12 seconds.

Clearly it relates to sudo (or other become mechanism), but I could not find anything about it in Ansible documentation - nothing about such specific timeout. In fact, it seemed that there is no setting in Ansible as a whole with a default value of 12 seconds at all. And I did not have any such value set anywhere on my side, too.

Inspecting Ansible source

Luckly, searching Ansible source code for the exact place where this error originates from turned out to be trivial:

$ git log -1 --pretty=oneline
82529e534dd3edd84aba03d86b337f88c58b9982 (HEAD, tag: v2.19.0) New release v2.19.0 (#85513)

$ grep -Rn "waiting for privilege escalation prompt" lib/
lib/ansible/plugins/connection/ssh.py:1245:                        raise AnsibleConnectionFailure('Timeout (%ds) waiting for privilege escalation prompt: %s' % (timeout, to_native(b_stdout)))

From there it's easy to track down the whole context that is important here. First we have the code that starts ssh in a subprocess (with a proper connection timeout passed as an argument, e.g. as -o ConnectTimeout to openssh client):

p = subprocess.Popen(...)

Then we construct the timeout value the we discuss in this note:

# select timeout should be longer than the connect timeout, otherwise
# they will race each other when we can't connect, and the connect
# timeout usually fails
timeout = 2 + self.get_option('timeout')

And we use it just a moment later, when we wait for data from the connection:

selector = selectors.DefaultSelector()
selector.register(p.stdout, selectors.EVENT_READ)
selector.register(p.stderr, selectors.EVENT_READ)

...

while True:
    poll = p.poll()
    events = selector.select(timeout)

    # We pay attention to timeouts only while negotiating a prompt.

    if not events:
        # We timed out
        if state <= states.index('awaiting_escalation'):
            # If the process has already exited, then it's not really a
            # timeout; we'll let the normal error handling deal with it.
            if poll is not None:
                break
            self._terminate_process(p)
            raise AnsibleConnectionFailure('Timeout (%ds) waiting for privilege escalation prompt: %s' % (timeout, to_native(b_stdout)))

The two key things to note here are:

  1. The value of 12 seconds mentioned in the title, which is used as a timeout for reaching privilege escalation prompt, comes from the value of SSH connection timeout (which defaults to 10 seconds) extended by hardcoded extra 2 seconds.

  2. Countdown for both timeouts (connection and reaching privilege escalation prompt) is started approximately at the same time - when connection initiation starts, i.e. they more or less overlap (as opposed to prompt timeout starting its countdown after the connection is ready).

For defaults we have 10 seconds to connect to the host (timeout passed to the ssh process), and then we have all what is left from these 10 seconds, plus extra 2 seconds, to reach privilege escalation prompt from the remote system (assuming we requested such escalation).

Side note on these two extra seconds

It looks like this 2 seconds extension was added a long time ago (2015), to avoid a situation where the connection timeout managed by the ssh process and the reaching-escalation-prompt timeout managed inside the Python process would race each other upon slow connection (or no connectivity at all) and then lead to other confusing error message.

Meaning that this extension is not motivated by wanting to have an extra time to get to the escalation prompt after connecting, but is rather a trick to solve different issue, with a side effect of having this one extra timeout value.

Conclusions

There is a dedicated timeout for obtaining privilege escalation prompt on the host, but its value is an implementation detail and is not exposed to the user as a separate setting. It's simply derived from SSH connection timeout by extending it by 2 seconds, which, if default settings are used, results in the value of 12 seconds. At the same time this value is present in the timeout error message, which may be slightly confusing as to what is the exact nature of the timeout and its value.

Changing the SSH connection timeout setting is the way to influence how long Ansible will wait until triggering the timeout that generates the discussed error.

Slow connection process (taking substantial part of the connection timeout) reduces time left for awaiting escalation prompt after connection succeeded. On a heavy loaded systems this may trigger timeouts reported as "Timeout (12s) waiting for privilege escalation prompt", while the culprit may mostly come from some networking issues (as this reported timeout overlaps the connection process).

On the other hand, "Timeout (12s) waiting for privilege escalation prompt", may also mean that ssh connection succeeded with some delay, and yes it's the host that is not too responsive, but it does not mean we have spent the whole 12 seconds between connection being established and us stopping awaiting for the prompt - it might as well be, say, half that time (rest being used up during the connection process).