[singularity] Excerpt from a work in progress by Eliezer Yudkowsky

Ben Goertzel Fri, 15 Sep 2006 19:16:33 -0700

Hi,

Eliezer asked me to forward this to the Singularity list... it is an
excerpt from a work-in-progress of his and is relevant to some current
discussions on the list.

-- Ben G

---------- Forwarded message ----------
From: Eliezer S. Yudkowsky <[EMAIL PROTECTED]>
Date: Sep 15, 2006 3:43 PM
Subject: Please fwd to Singularity list
To: Ben Goertzel <[EMAIL PROTECTED]>

Ben, please forward this to your Singularity list.

** Excerpts from a work in progress follow. **

Imagine that I'm visiting a distant city, and a local friend volunteers
to drive me to the airport.  I don't know the neighborhood.  Each time
my friend approaches a street intersection, I don't know whether my
friend will turn left, turn right, or continue straight ahead.  I can't
predict my friend's move even as we approach each individual
intersection - let alone, predict the whole sequence of moves in advance.

Yet I can predict the result of my friend's unpredictable actions: we
will arrive at the airport.  Even if my friend's house were located
elsewhere in the city, so that my friend made a completely different
sequence of turns, I would just as confidently predict our arrival at
the airport.  I can predict this long in advance, before I even get into
the car.  My flight departs soon, and there's no time to waste; I
wouldn't get into the car in the first place, if I couldn't confidently
predict that the car would travel to the airport along an unpredictable
pathway.

You cannot build Deep Blue by programming in a good chess move for every
possible chess position.  First of all, it is impossible to build a
chess player this way, because you don't know exactly which positions it
will encounter.  You would have to record a specific move for zillions
of positions, more than you could consider in a lifetime with your slow
neurons.  And second, even if you did this, the resulting program would
not play chess any better than you do.

This holds true on any level where an answer has to meet a sufficiently
high standard.  If you want any answer better than you could come up
with yourself, you necessarily sacrifice your ability to predict the
exact answer in advance.

But you don't sacrifice your ability to predict *everything*.  As my
coworker, Marcello Herreshoff, says:  "We never run a program unless we
know something about the output and we don't know the output."  Deep
Blue's programmers didn't know which moves Deep Blue would make, but
they must have known something about Deep Blue's output which
distinguished that output from the output of a pseudo-random move
generator.  After all, it would have been much simpler to create a
pseudo-random move generator; but instead the programmers felt obligated
to carefully craft the complex program that is Deep Blue.  In both
cases, the programmers wouldn't know the move - so what was the key
difference?  What was the fact that the programmers knew about Deep
Blue's output, if they didn't know the output?

They didn't know for certain that Deep Blue would win, but they knew
that it would try; they knew how to describe the compact target region
into which Deep Blue was trying to steer the future, as a fact about its
source code.

It is not possible to prove strong, non-probabilistic theorems about the
external world, because the state of the external world is not fully
known.  Even if we could perfectly observe every atom, there's a little
thing called the "problem of induction".  If every swan ever observed
has been white, it doesn't mean that tomorrow you won't see a black
swan.  Just because every physical interaction ever observed has obeyed
conservation of momentum, doesn't mean that tomorrow the rules won't
change.  It's never happened before, but to paraphrase Richard Feynman,
you have to go with what your experiments tell you.  If tomorrow your
experiments start telling you that apples fall up, then that's what you
have to believe.

So you can't build an AI by specifying the exact action - the particular
chess move, the precise motor output - in advance.  It also seems that
it would be impossible to prove any statement about the real-world
consequences of the AI's actions.  The real world is not knowably
knowable.  Even if we possessed a model that was, in fact, complete and
correct, we could never have absolute confidence in that model.  So what
could possibly be a "provably Friendly" AI?

You can try to prove a theorem along the lines of:  "Providing that the
transistors in this computer chip behave the way they're supposed to,
the AI that runs on this chip will always *try* to be Friendly."  You're
going to prove a statement about the search the AI carries out to find
its actions.  To prove this formally, you would have to precisely define
"try to be Friendly": the complete criterion that the AI uses to choose
among its actions - including how the AI learns a model of reality from
experience, how the AI identifies the goal-valent aspects of the reality
it learns to model, and how the AI chooses actions on the basis of their
extrapolated goal-valent consequences.

Once you've formulated this precise definition, you still can't prove an
absolute certainty that the AI will be Friendly in the real world,
because a series of cosmic rays could still hit all of the transistors
at exactly the wrong time to overwrite the entire program with an evil
AI.  Or Descartes's infinitely powerful deceiving demon could have
fooled you into thinking that there was a computer in front of you, when
in fact it's a hydrogen bomb.  Or the Dark Lords of the Matrix could
reach into the computer simulation that is our world, and replace the AI
with Cthulhu.  What you can prove with mathematical certitude is that if
all the transistors in the chip work correctly, the AI "will always try
to be Friendly" - after you've given "try to be Friendly" a precise
definition in terms of how the AI learns a model of the world,
identifies the important things in it, and chooses between actions,
*these all being events that happen inside the computer chip*.

Since human programmers aren't good at writing error-tolerant code,
computer chips are constructed (at a tremendous expense in heat
dissipation) to be as close to perfect as the engineers can make them.
For a computer chip to not make a single error in a day, the millions of
component transistors that switch billions of times per second have to
perform quintillions of error-free operations in a day.  The inside of
the computer chip is an environment that is very close to totally knowable.

Computer chips are not actually perfect.  The next step up would be to
prove - or more likely, ask a maturing AI to prove - that the AI remains
Friendly given any possible single bitflip, then any possible two
bitflips.  A proof for two bitflips would probably drive the real-world
probability of corruption to very close to zero, although this
probability itself would not have been proven.  Eventually one would
dispense with such adhockery, and let the AI design its own hardware -
choosing for itself the correct balance of high-precision hardware and
fault-tolerant software, with the final infinitesimal probability of
failure being proven on the assumption that the observed laws of physics
continue to hold.  The AI could even write error-checking code to
protect against classes of non-malicious changes in physics.  You can't
defend against infinitely powerful deceiving demons; but there are
realistic steps you can take to defend yourself against cosmic rays and
new discoveries in physics.

In real life, a transistor has a substantially higher probability than
one-in-a-million of failing on any given day.  After all, someone might
spoon ice cream into the computer; lightning might strike the electrical
line and fry the chip; the heatsink might fail and melt the chip... that
sort of thing happens much more often than once in every three thousand
years, the frequency implied by a 0.000001/day failure rate.  So if you
look at one lone transistor, nothing else, and ask the probability that
it will go on functioning correctly through the whole day, the chance of
failure is clearly greater than one in a million.

But there are millions of transistors on the chip - perhaps 155 million,
for a high-end 2006 processor.  Clearly, if each lone transistor has a
probability of failure greater than one in a million, the chance of the
entire chip working is infinitesimal.

What is the flaw in this reasoning?  The probability of failure is not
conditionally independent between transistors.  Spooning ice cream into
the computer will destroy the whole chip - millions of transistors will
fail at the same time.  If we are told solely that one transistor has
failed, we should guess a much higher probability that a neighboring
transistor has also failed, since most causes of failure destroy many
transistors at once.  Conversely, if we are told that one transistor is
still working properly, this considerably increases the chance that the
neighboring transistor is still working.  If event A has a probability
of 1/2, and event B has a probability of 1/2, then the joint probability
of A&B both occurring can have a probability of 1/4, 1/2, 0, or anything
in between.  The key is the conditional probability p(B|A), the
probability that B occurs given that A occurs - the two events are not
necessarily independent.  The chance that it rains and that the sidewalk
gets wet is not the product of the probability that it rains and the
probability that the sidewalk gets wet.

The reason a computer chip can work deterministically is that the
conditionally independent component of a transistor's chance of failure
is very small - that is, the individual contribution of each extra
transistor to the overall chip's chance of failure is infinitesimal.  If
this were not true, if each additional transistor had any noticeable
independent chance of failing, it would be impossible to build a
computer chip.  You'd be limited to a few dozen transistors at best -
especially if they had to switch trillions of times per day.

For a Friendly AI to continue in existence, the cumulative probability
of catastrophic failure must be bounded over the intended working
lifespan of the AI.  (The actual intended working lifespan might be,
say, a million years; I hope that humanity will not need the original
Friendly AI for anything like this length of time.  But we would
calculate the cumulative bound over a google clock ticks, just to leave
error margin.)  If the Friendly AI accidentally slices off a human's
arm, but is properly "horrified" in a decision-theoretic sense - retains
the same goals, and revises its planning to avoid ever doing it again -
this is not a *catastrophic* failure.  An error in self-modification -
an error in the AI rewriting its own source code - can be catastrophic;
a failure of this type can warp the AI's goals so that the AI now
chooses according to the criterion of slicing off as many human arms as
possible.

Therefore, for a Friendly AI to rewrite its own source code, the
cumulative probability of catastrophic error must be bounded over
billions of sequential self-modifications.  The billionth version of the
source code, designing the billionth-and-first version, must preserve
with fidelity the Friendly invariant - the optimization target that
describes what the AI is trying to do as efficiently as possible.

Therefore, the independent component in the probability of failure on
each self-modification must be effectively zero.  That doesn't mean the
probability of the entire AI failing somehow-or-other has a real-world
value of zero.  It means that, whatever this probability of failure is,
we think it's pretty much the same after ten billion self-modifications
as after one billion self-modifications.

Sounds difficult, doesn't it?  George Polya advises us to try to think
of a problem similar to ours and solved before.

We find that, interestingly and suggestively, a formal mathematical
proof of ten billion steps can be as strong as a proof of ten steps. The
proof is as strong as its axioms, even for extremely long proofs. This
doesn't mean that the conclusion of a formal proof is perfectly
reliable.  Your axioms could be wrong; you could have overlooked a
fundamental mistake.  But it is at least *theoretically possible* for
the system to survive ten billion steps, because *if* you got the axioms
right, then the stochastically independent failure probabilities on each
step don't add up.  Even if the proof-checker has a nonzero independent
chance of making a mistake on each step, you can get arbitrarily low
error probabilities by double-checking or triple-checking.

When computer engineers *prove* a chip valid - a good idea if the chip
has 155 million transistors and you can't issue a patch afterward - the
engineers use human-guided, machine-verified formal proof.  Human beings
are not trustworthy to peer over a purported proof of ten billion steps;
we have too high a chance of missing an error.  And present-day
theorem-proving techniques are not smart enough to design and prove an
entire computer chip on their own - current algorithms undergo an
exponential explosion in the search space.  Human mathematicians can
prove theorems far more complex than modern theorem-provers can handle,
without being defeated by exponential explosion.  But human mathematics
is informal and unreliable; occasionally someone discovers a flaw in a
previously accepted informal proof.

The upshot is that human engineers guide a theorem-prover through the
intermediate steps of a proof.  The human chooses the next lemma, and a
complex theorem-prover generates a formal proof, and a simple verifier
checks the steps.  That's how modern engineers build reliable machinery
with 155 million interdependent parts.

Proving a computer chip correct requires a synergy of human intelligence
and computer algorithms, as currently neither suffices on its own.  The
idea is that a Friendly AI would use a similar combination of abilities
when modifying its own code - could both invent large designs without
being defeated by exponential explosion, and also verify its steps with
extreme reliability.  That is one way a Friendly AI might remain
*knowably, provably* Friendly even after carrying out a large number of
self-modifications.

And this proof comes with many caveats:  The proven guarantee of
"Friendliness" would actually specify some invariant internal behavior -
the optimization target, the search carried out, the criterion for
choosing between actions - and if the programmers screw this up, the
"Friendly" AI won't actually be friendly in the real world.  Moreover,
there would still be the standard problem of induction - maybe the
previously undiscovered "sorcery addenda" to the laws of physics state
that the program we've written is the exact ritual which materializes
Azathoth into our solar system.  Which only goes to say that mere
mathematical proof would not give us real-world *certainty*.

But if you *can't even prove mathematically* that the AI is Friendly,
it's practically *guaranteed* to fail.  Mathematical proof does not give
us real-world certainty.  But if you proved mathematically that the AI
was "Friendly" given its transistors, then it would be *possible* to
win.  You would not *automatically* fail.

**

Note also that, although very few transhumanists seem to realize it,
there's an analogous problem for stability of self-modifications in
humans.  Your brain was not designed to be end-user-modifiable.  If we
were a lot smarter, we might be able to tinker with a messy, biological
human brain that has no user-serviceable parts and evolved all its
elements to operate within very narrow design parameters that don't
include self-improvement. Unfortunately, we can't make ourselves a lot
smarter without hacking the brain - rather drastically, if we want
augments that are smarter than the smartest existing humans.  Catch-22.

One possible solution is to focus our messy, fragile intelligence down
onto the source code of a compact, extremely reliable seed that even our
tiny brains can check over.  Then the seed reliably grows, along
deterministic pathways, into something superintelligent enough to tackle
the much more difficult problem of upgrading a messy biological human.
It's not the only possible way over the hump; you could try extremely
careful and conservative intelligence-improvement modifications on a
large group of humans checking each other, and hope that they got smart
enough to spot new errors in more extreme modifications.  Frankly,
though, eventually one of them would just build a damn AI, and to hell
with the overcomplicated paranoia of carbon chauvinism.

--
Eliezer S. Yudkowsky                          http://singinst.org/
Research Fellow, Singularity Institute for Artificial Intelligence

-----
This list is sponsored by AGIRI: http://www.agiri.org/email
To unsubscribe or change your options, please go to:
http://v2.listbox.com/member/[EMAIL PROTECTED]

[singularity] Excerpt from a work in progress by Eliezer Yudkowsky

Reply via email to