Data Science · Case study · 2024

Learning rewards from behavior

An academic project, implemented from scratch. Instead of learning how to act given a reward, Inverse Reinforcement Learning asks the opposite question: given how an agent does act, what reward was it pursuing? This write-up rebuilds the full Bayesian IRL pipeline — environment, forward solver, demonstrations, and the PolicyWalk sampler — and reports honestly what the experiments show.

PythonNumPyMarkov Decision ProcessesMCMCBayesian inferenceMatplotlib

View code

The problem: inverting RL

Classic reinforcement learning takes a reward function and searches for a policy that maximizes it. Inverse reinforcement learning (IRL) flips the arrow: we observe an agent behaving — a sequence of states and the actions it chose — and we try to recover the reward function that best explains that behavior.

This matters whenever the reward is the hard part. Hand-specifying what we want an agent to value is notoriously brittle; it is often easier to demonstrate good behavior than to write down its objective. IRL is the formal route from demonstrations back to objectives.

The Bayesian framing, following Ramachandran & Amir (2007), treats the reward function as a random variable: we place a prior over reward functions, define a likelihood of the observed actions given a reward, and sample from the resulting posterior. The goal of this project was to build that entire chain by hand, in a controlled gridworld, to understand each moving part.

The environment: a gridworld MDP

Everything happens on an N×N grid modeled as a Markov Decision Process. Each cell is a state; the actions are the four moves L, R, U, D. The world is stochastic: with probability 0.8 a move goes where intended, and with a slip probability of 0.2 the agent veers to a perpendicular cell instead. The discount factor is γ = 0.9. Moves that would leave the grid keep the agent in place.

Structured gridworld MDP with hand-set rewards — The structured PartialMDP. Each cell carries a reward — a goal at H1 (+2.0), penalties at B4 and C7 (-1.0), and mild positive/negative zones. This is the ground-truth reward the agent will be assumed to follow.

To stress-test the method beyond one hand-crafted map, a second class RandomPartialMDP populates the grid with random semantic features — Treasure, Bomb, Mud, Water, Mountain — each mapped to a reward value. This gives an endless supply of fresh worlds to evaluate the algorithm on.

Randomly generated gridworld MDP — A randomly generated MDP. Feature counts (6 treasures, 12 bombs, 10 mud…) are scattered across the grid and converted to rewards, producing a varied reward landscape for testing.

Solving the forward problem: Policy Iteration

Before we can invert anything, we need to be able to solve the forward problem — given a reward, find the optimal policy. I implemented Policy Iteration with the three classic steps:

Evaluation — iteratively compute the value of every state under the current policy until it converges.
Improvement — at each state, switch to the action with the best expected value.
Iteration — repeat evaluation and improvement until the policy stops changing.

This solver is the workhorse reused throughout the rest of the project: every candidate reward proposed by the Bayesian search is turned into a policy through Policy Iteration.

Generating demonstrations: an imperfect tutor

IRL needs observed behavior to learn from. Rather than assume a perfect demonstrator, I built an imperfect tutor: it mostly follows the optimal policy, but with a 5% chance picks a random action (optimal_action_prob = 0.95). Starting from an initial state, it rolls out a trajectory and records the resulting sequence of (state, action) pairs.

This noise is deliberate and realistic — real demonstrators make mistakes — and it is exactly what the Bayesian likelihood is designed to tolerate, rather than requiring the observations to be perfectly consistent with a single reward.

The Bayesian framework

With an environment, a solver, and demonstrations in hand, the inference machinery has three pieces.

Priors over rewards

I implemented three candidate priors — UniformPrior (any reward within a bound is equally likely), GaussianPrior (rewards near zero are favored, penalizing extreme values), and BetaPrior. The Gaussian prior is the one carried through the experiments.

Likelihood: a Boltzmann policy

The likelihood links a reward to the observed actions. For a given reward we compute the Q-values (via a value-iteration-style sweep), then assume the tutor picks actions with a softmax / Boltzmann distribution over those Q-values: P(a | s) = exp(α·Q(s,a)) / Σ exp(α·Q(s,·)). The parameter α encodes how rational (deterministic) the agent is assumed to be. The likelihood of a set of observations is the product over all observed (state, action) pairs.

Posterior and acceptance ratio

The posterior is simply prior × likelihood. Because we will compare candidate rewards rather than normalize, the key quantity is the ratio of posteriors P(R₁ | O) / P(R₂ | O) — exactly what a Metropolis-Hastings sampler needs to decide whether to accept a new proposal.

PolicyWalk: sampling the reward space

PolicyWalk is the MCMC algorithm at the heart of Bayesian IRL. The reward space is too large to explore exhaustively, so we random-walk through it: start from a random reward, repeatedly propose a small perturbation to a neighboring reward, recompute the induced policy, and accept the proposal with probability min(1, posterior(R')/posterior(R)). Over many steps this traces out the posterior — concentrating on rewards that explain the demonstrations well.

The recovered reward vectors are directly interpretable: a positive value marks a state the agent is inferred to find desirable, a negative one a state it avoids. Reading them back against the grid is the qualitative sanity check that the inference is doing something sensible.

A simulated-annealing variant

I also implemented CoolingPolicyWalk, which adds simulated annealing: a temperature T that decays each iteration and reshapes the acceptance probability via exp(Δlog-posterior / T). Early on, high temperature permits exploratory, occasionally-worse moves; as T cools, the walk settles toward high-posterior regions. The aim is to escape poor local optima that a plain walk can get stuck in.

Evaluation: 0-1 policy loss

How do we score a recovered reward? Not by comparing reward numbers directly — many different rewards induce the same optimal behavior — but by behavior. The 0-1 policy loss is the fraction of observed (state, action) pairs where the policy implied by the recovered reward disagrees with what the tutor actually did. I compared plain PolicyWalk against the cooling variant across grid sizes 5, 8, and 10.

Policy loss vs iterations, grid size 5 — Grid 5×5. PolicyWalk fluctuates around the demonstrations (dipping but also spiking to 0.6); the cooling variant stays flat at 0.5 — stable, but not improving.

Policy loss vs iterations, grid size 8 — Grid 8×8. PolicyWalk briefly reaches zero loss — recovering a reward fully consistent with the demonstrations — before drifting back up. The cooling walk holds steady near zero.

Policy loss vs iterations, grid size 10 — Grid 10×10. Both methods plateau at a high loss (~0.9) with little movement — on the largest grid, with these demonstrations, neither recovers behavior well.

The honest reading: the results are mixed and noisy, and that is itself informative. PolicyWalk is high-variance — it can find a zero-loss reward (grid 8) but does not reliably stay there. The cooling variant trades that volatility for stability, but on these runs stability sometimes means being stuck. On the 10×10 grid both struggle, illustrating how the reward posterior gets harder to explore as the state space grows and the demonstrations stay short.

Takeaways

This was a from-scratch build rather than a polished product, and the value is in what it forced me to understand end to end:

Component	What it demonstrates
MDP + stochastic transitions	Modeling sequential decisions under uncertainty
Policy Iteration	Dynamic programming, the RL forward problem
Imperfect tutor	Realistic, noisy demonstrations
Priors, Boltzmann likelihood, posterior	Bayesian modeling from first principles
PolicyWalk (MCMC)	Metropolis-Hastings sampling over a structured space
Simulated annealing variant	Exploration vs exploitation in samplers
0-1 policy loss across grid sizes	Behavior-based evaluation, honest reporting

What I take away: implementing a method from the paper up — environment, solver, likelihood, sampler, evaluation — teaches far more than calling a library. The noisy results aren't a failure to hide; they show the real difficulty of reward inference, and reading them honestly is part of the work.