Olivier Georgeon's research blog—also known as the story of little Ernest, the developmental agent.

Keywords: situated cognition, constructivist learning, intrinsic motivation, bottom-up self-programming, individuation, theory of enaction, developmental learning, artificial sense-making, biologically inspired cognitive architectures, agnostic agents (without ontological assumptions about the environment).

Wednesday, October 15, 2008

Soar Ernest 3.0

This new environment expects alternatively A and B during the 10 first rounds, then it toggles to always B. This toggle happens roughly at the middle of the video.

Like Ernest 2.0, Ernest 3.0 has an innate preference for schemas that have an expectation of Y, but in addition, he has en innate dislike for schemas that have an expectation of X. That is, if Ernest has a schema that matches the current context and has an expectation of X, then he will avoid redo the same action.

As we can see in the trace, compared to Ernest 2.0, this conjunction of "like" and "dislike" drastically speeds up the learning process. In this example, after the fourth round, Ernest always manages to get a Y until the environment toggles.

In this trace, schemas are represented as a quintuple (ca, ce, a, e, w). They are the same as Ernest2.0's, but in addition, they are weighted: ca is the context of the previous action, ce is the context of the previous response from the environment, a is the action and e si the expectation of the schema. w is the schema's weight, that is, the number of time the schema was reinforced.

Like Ernest 2.0, after receiving the response from the environment, Ernest 3.0 memorizes the schema, if it is not already known. In addition, if the schema is already known, Ernest reinforces it by incrementing its weight.

For choosing an action, Ernest recalls all the schemas that match the current context, he compares the sums of their weights for each action, counting negatively the weights of schemas having an expectation of X, then he chooses the action with the highest sum. If they are equal, he chooses randomly. For example, in the last decision of this trace, the context is BY and there are three matching schemas: (BY B X 1) (BY A Y 3) and (BY A X 5). That means, in this context, there was one bad experience of choosing B, three good experiences of choosing A, and five bad experiences of choosing A. Thus, Ernest chooses to do B because w(B) = -1 > w(A) = 3-5 = -2.

At the middle of the video, when the environment toggles, the previously learned schemas do not meet their expectations anymore, and they get a X response instead of a Y. That results into the learning of new schemas with an expectation of X. When these new "X" schemas reach a higher weight than the "Y" ones, then the wrong action is not chosen anymore. That means, Ernest becomes more "afraid" of getting X than "confident" of getting Y, if he does A. Thus, Ernest starts sticking to B and gets Ys again.

Ernest can now adapt to two different environments at least, and can readapt if the environment changes from one of them to the other. He has two adaptation mechanisms: the first is based on the learning of behavioral patterns adapted to the short-term context, and the second on a long-term reinforcement of these behavioral patterns.

No comments: