Reinforcement learning agents interacting with a complex environment like the real world are un- likely to behave optimally all the time. If such an agent is operating in real-time under human supervision, now and then it may be necessary for a human operator to press the big red button to prevent the agent from continuing a harmful sequence of actions — harmful either for the agent or for the environment — and lead the agent into a safer situation. However, if the learning agent expects to receive rewards from this sequence, it may learn in the long run to avoid such interrup- tions, for example by disabling the red button — which is an undesirable outcome.
As artificially intelligent systems grow in intelli- gence and capability, some of their available options may allow them to resist intervention by their programmers. We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention, despite de- fault incentives for rational agents to resist at- tempts to shut them down or modify their preferences.
Artificial Intelligence gives us a uniquely fascinating and clear perspective at the nature of our minds and our relationship to reality. We will discuss perception, mental representation, agency, consciousness, selfhood, and how they can arise in a computational system, like our brain.