The objective is to find an optimal policy which maximizes the expected average reward per time step over infinite horizon.

英美