Hi, Eliezer asked me to forward this to the Singularity list... it is an excerpt from a work-in-progress of his and is relevant to some current discussions on the list.
-- Ben G ---------- Forwarded message ---------- From: Eliezer S. Yudkowsky <[EMAIL PROTECTED]> Date: Sep 15, 2006 3:43 PM Subject: Please fwd to Singularity list To: Ben Goertzel <[EMAIL PROTECTED]> Ben, please forward this to your Singularity list. ** Excerpts from a work in progress follow. ** Imagine that I'm visiting a distant city, and a local friend volunteers to drive me to the airport. I don't know the neighborhood. Each time my friend approaches a street intersection, I don't know whether my friend will turn left, turn right, or continue straight ahead. I can't predict my friend's move even as we approach each individual intersection - let alone, predict the whole sequence of moves in advance. Yet I can predict the result of my friend's unpredictable actions: we will arrive at the airport. Even if my friend's house were located elsewhere in the city, so that my friend made a completely different sequence of turns, I would just as confidently predict our arrival at the airport. I can predict this long in advance, before I even get into the car. My flight departs soon, and there's no time to waste; I wouldn't get into the car in the first place, if I couldn't confidently predict that the car would travel to the airport along an unpredictable pathway. You cannot build Deep Blue by programming in a good chess move for every possible chess position. First of all, it is impossible to build a chess player this way, because you don't know exactly which positions it will encounter. You would have to record a specific move for zillions of positions, more than you could consider in a lifetime with your slow neurons. And second, even if you did this, the resulting program would not play chess any better than you do. This holds true on any level where an answer has to meet a sufficiently high standard. If you want any answer better than you could come up with yourself, you necessarily sacrifice your ability to predict the exact answer in advance. But you don't sacrifice your ability to predict *everything*. As my coworker, Marcello Herreshoff, says: "We never run a program unless we know something about the output and we don't know the output." Deep Blue's programmers didn't know which moves Deep Blue would make, but they must have known something about Deep Blue's output which distinguished that output from the output of a pseudo-random move generator. After all, it would have been much simpler to create a pseudo-random move generator; but instead the programmers felt obligated to carefully craft the complex program that is Deep Blue. In both cases, the programmers wouldn't know the move - so what was the key difference? What was the fact that the programmers knew about Deep Blue's output, if they didn't know the output? They didn't know for certain that Deep Blue would win, but they knew that it would try; they knew how to describe the compact target region into which Deep Blue was trying to steer the future, as a fact about its source code. It is not possible to prove strong, non-probabilistic theorems about the external world, because the state of the external world is not fully known. Even if we could perfectly observe every atom, there's a little thing called the "problem of induction". If every swan ever observed has been white, it doesn't mean that tomorrow you won't see a black swan. Just because every physical interaction ever observed has obeyed conservation of momentum, doesn't mean that tomorrow the rules won't change. It's never happened before, but to paraphrase Richard Feynman, you have to go with what your experiments tell you. If tomorrow your experiments start telling you that apples fall up, then that's what you have to believe. So you can't build an AI by specifying the exact action - the particular chess move, the precise motor output - in advance. It also seems that it would be impossible to prove any statement about the real-world consequences of the AI's actions. The real world is not knowably knowable. Even if we possessed a model that was, in fact, complete and correct, we could never have absolute confidence in that model. So what could possibly be a "provably Friendly" AI? You can try to prove a theorem along the lines of: "Providing that the transistors in this computer chip behave the way they're supposed to, the AI that runs on this chip will always *try* to be Friendly." You're going to prove a statement about the search the AI carries out to find its actions. To prove this formally, you would have to precisely define "try to be Friendly": the complete criterion that the AI uses to choose among its actions - including how the AI learns a model of reality from experience, how the AI identifies the goal-valent aspects of the reality it learns to model, and how the AI chooses actions on the basis of their extrapolated goal-valent consequences. Once you've formulated this precise definition, you still can't prove an absolute certainty that the AI will be Friendly in the real world, because a series of cosmic rays could still hit all of the transistors at exactly the wrong time to overwrite the entire program with an evil AI. Or Descartes's infinitely powerful deceiving demon could have fooled you into thinking that there was a computer in front of you, when in fact it's a hydrogen bomb. Or the Dark Lords of the Matrix could reach into the computer simulation that is our world, and replace the AI with Cthulhu. What you can prove with mathematical certitude is that if all the transistors in the chip work correctly, the AI "will always try to be Friendly" - after you've given "try to be Friendly" a precise definition in terms of how the AI learns a model of the world, identifies the important things in it, and chooses between actions, *these all being events that happen inside the computer chip*. Since human programmers aren't good at writing error-tolerant code, computer chips are constructed (at a tremendous expense in heat dissipation) to be as close to perfect as the engineers can make them. For a computer chip to not make a single error in a day, the millions of component transistors that switch billions of times per second have to perform quintillions of error-free operations in a day. The inside of the computer chip is an environment that is very close to totally knowable. Computer chips are not actually perfect. The next step up would be to prove - or more likely, ask a maturing AI to prove - that the AI remains Friendly given any possible single bitflip, then any possible two bitflips. A proof for two bitflips would probably drive the real-world probability of corruption to very close to zero, although this probability itself would not have been proven. Eventually one would dispense with such adhockery, and let the AI design its own hardware - choosing for itself the correct balance of high-precision hardware and fault-tolerant software, with the final infinitesimal probability of failure being proven on the assumption that the observed laws of physics continue to hold. The AI could even write error-checking code to protect against classes of non-malicious changes in physics. You can't defend against infinitely powerful deceiving demons; but there are realistic steps you can take to defend yourself against cosmic rays and new discoveries in physics. In real life, a transistor has a substantially higher probability than one-in-a-million of failing on any given day. After all, someone might spoon ice cream into the computer; lightning might strike the electrical line and fry the chip; the heatsink might fail and melt the chip... that sort of thing happens much more often than once in every three thousand years, the frequency implied by a 0.000001/day failure rate. So if you look at one lone transistor, nothing else, and ask the probability that it will go on functioning correctly through the whole day, the chance of failure is clearly greater than one in a million. But there are millions of transistors on the chip - perhaps 155 million, for a high-end 2006 processor. Clearly, if each lone transistor has a probability of failure greater than one in a million, the chance of the entire chip working is infinitesimal. What is the flaw in this reasoning? The probability of failure is not conditionally independent between transistors. Spooning ice cream into the computer will destroy the whole chip - millions of transistors will fail at the same time. If we are told solely that one transistor has failed, we should guess a much higher probability that a neighboring transistor has also failed, since most causes of failure destroy many transistors at once. Conversely, if we are told that one transistor is still working properly, this considerably increases the chance that the neighboring transistor is still working. If event A has a probability of 1/2, and event B has a probability of 1/2, then the joint probability of A&B both occurring can have a probability of 1/4, 1/2, 0, or anything in between. The key is the conditional probability p(B|A), the probability that B occurs given that A occurs - the two events are not necessarily independent. The chance that it rains and that the sidewalk gets wet is not the product of the probability that it rains and the probability that the sidewalk gets wet. The reason a computer chip can work deterministically is that the conditionally independent component of a transistor's chance of failure is very small - that is, the individual contribution of each extra transistor to the overall chip's chance of failure is infinitesimal. If this were not true, if each additional transistor had any noticeable independent chance of failing, it would be impossible to build a computer chip. You'd be limited to a few dozen transistors at best - especially if they had to switch trillions of times per day. For a Friendly AI to continue in existence, the cumulative probability of catastrophic failure must be bounded over the intended working lifespan of the AI. (The actual intended working lifespan might be, say, a million years; I hope that humanity will not need the original Friendly AI for anything like this length of time. But we would calculate the cumulative bound over a google clock ticks, just to leave error margin.) If the Friendly AI accidentally slices off a human's arm, but is properly "horrified" in a decision-theoretic sense - retains the same goals, and revises its planning to avoid ever doing it again - this is not a *catastrophic* failure. An error in self-modification - an error in the AI rewriting its own source code - can be catastrophic; a failure of this type can warp the AI's goals so that the AI now chooses according to the criterion of slicing off as many human arms as possible. Therefore, for a Friendly AI to rewrite its own source code, the cumulative probability of catastrophic error must be bounded over billions of sequential self-modifications. The billionth version of the source code, designing the billionth-and-first version, must preserve with fidelity the Friendly invariant - the optimization target that describes what the AI is trying to do as efficiently as possible. Therefore, the independent component in the probability of failure on each self-modification must be effectively zero. That doesn't mean the probability of the entire AI failing somehow-or-other has a real-world value of zero. It means that, whatever this probability of failure is, we think it's pretty much the same after ten billion self-modifications as after one billion self-modifications. Sounds difficult, doesn't it? George Polya advises us to try to think of a problem similar to ours and solved before. We find that, interestingly and suggestively, a formal mathematical proof of ten billion steps can be as strong as a proof of ten steps. The proof is as strong as its axioms, even for extremely long proofs. This doesn't mean that the conclusion of a formal proof is perfectly reliable. Your axioms could be wrong; you could have overlooked a fundamental mistake. But it is at least *theoretically possible* for the system to survive ten billion steps, because *if* you got the axioms right, then the stochastically independent failure probabilities on each step don't add up. Even if the proof-checker has a nonzero independent chance of making a mistake on each step, you can get arbitrarily low error probabilities by double-checking or triple-checking. When computer engineers *prove* a chip valid - a good idea if the chip has 155 million transistors and you can't issue a patch afterward - the engineers use human-guided, machine-verified formal proof. Human beings are not trustworthy to peer over a purported proof of ten billion steps; we have too high a chance of missing an error. And present-day theorem-proving techniques are not smart enough to design and prove an entire computer chip on their own - current algorithms undergo an exponential explosion in the search space. Human mathematicians can prove theorems far more complex than modern theorem-provers can handle, without being defeated by exponential explosion. But human mathematics is informal and unreliable; occasionally someone discovers a flaw in a previously accepted informal proof. The upshot is that human engineers guide a theorem-prover through the intermediate steps of a proof. The human chooses the next lemma, and a complex theorem-prover generates a formal proof, and a simple verifier checks the steps. That's how modern engineers build reliable machinery with 155 million interdependent parts. Proving a computer chip correct requires a synergy of human intelligence and computer algorithms, as currently neither suffices on its own. The idea is that a Friendly AI would use a similar combination of abilities when modifying its own code - could both invent large designs without being defeated by exponential explosion, and also verify its steps with extreme reliability. That is one way a Friendly AI might remain *knowably, provably* Friendly even after carrying out a large number of self-modifications. And this proof comes with many caveats: The proven guarantee of "Friendliness" would actually specify some invariant internal behavior - the optimization target, the search carried out, the criterion for choosing between actions - and if the programmers screw this up, the "Friendly" AI won't actually be friendly in the real world. Moreover, there would still be the standard problem of induction - maybe the previously undiscovered "sorcery addenda" to the laws of physics state that the program we've written is the exact ritual which materializes Azathoth into our solar system. Which only goes to say that mere mathematical proof would not give us real-world *certainty*. But if you *can't even prove mathematically* that the AI is Friendly, it's practically *guaranteed* to fail. Mathematical proof does not give us real-world certainty. But if you proved mathematically that the AI was "Friendly" given its transistors, then it would be *possible* to win. You would not *automatically* fail. ** Note also that, although very few transhumanists seem to realize it, there's an analogous problem for stability of self-modifications in humans. Your brain was not designed to be end-user-modifiable. If we were a lot smarter, we might be able to tinker with a messy, biological human brain that has no user-serviceable parts and evolved all its elements to operate within very narrow design parameters that don't include self-improvement. Unfortunately, we can't make ourselves a lot smarter without hacking the brain - rather drastically, if we want augments that are smarter than the smartest existing humans. Catch-22. One possible solution is to focus our messy, fragile intelligence down onto the source code of a compact, extremely reliable seed that even our tiny brains can check over. Then the seed reliably grows, along deterministic pathways, into something superintelligent enough to tackle the much more difficult problem of upgrading a messy biological human. It's not the only possible way over the hump; you could try extremely careful and conservative intelligence-improvement modifications on a large group of humans checking each other, and hope that they got smart enough to spot new errors in more extreme modifications. Frankly, though, eventually one of them would just build a damn AI, and to hell with the overcomplicated paranoia of carbon chauvinism. -- Eliezer S. Yudkowsky http://singinst.org/ Research Fellow, Singularity Institute for Artificial Intelligence ----- This list is sponsored by AGIRI: http://www.agiri.org/email To unsubscribe or change your options, please go to: http://v2.listbox.com/member/[EMAIL PROTECTED]