← Back to projects

Data Science · Case study · 2024

Learning rewards from behavior

An academic project, implemented from scratch. Instead of learning how to act given a reward, Inverse Reinforcement Learning asks the opposite question: given how an agent does act, what reward was it pursuing? This write-up rebuilds the full Bayesian IRL pipeline — environment, forward solver, demonstrations, and the PolicyWalk sampler — and reports honestly what the experiments show.

PythonNumPyMarkov Decision ProcessesMCMCBayesian inferenceMatplotlib

01

The problem: inverting RL

Classic reinforcement learning takes a reward function and searches for a policy that maximizes it. Inverse reinforcement learning (IRL) flips the arrow: we observe an agent behaving — a sequence of states and the actions it chose — and we try to recover the reward function that best explains that behavior.

This matters whenever the reward is the hard part. Hand-specifying what we want an agent to value is notoriously brittle; it is often easier to demonstrate good behavior than to write down its objective. IRL is the formal route from demonstrations back to objectives.

The Bayesian framing, following Ramachandran & Amir (2007), treats the reward function as a random variable: we place a prior over reward functions, define a likelihood of the observed actions given a reward, and sample from the resulting posterior. The goal of this project was to build that entire chain by hand, in a controlled gridworld, to understand each moving part.

02

The environment: a gridworld MDP

Everything happens on an N×N grid modeled as a Markov Decision Process. Each cell is a state; the actions are the four moves L, R, U, D. The world is stochastic: with probability 0.8 a move goes where intended, and with a slip probability of 0.2 the agent veers to a perpendicular cell instead. The discount factor is γ = 0.9. Moves that would leave the grid keep the agent in place.

Structured gridworld MDP with hand-set rewards
The structured PartialMDP. Each cell carries a reward — a goal at H1 (+2.0), penalties at B4 and C7 (-1.0), and mild positive/negative zones. This is the ground-truth reward the agent will be assumed to follow.

To stress-test the method beyond one hand-crafted map, a second class RandomPartialMDP populates the grid with random semantic features — Treasure, Bomb, Mud, Water, Mountain — each mapped to a reward value. This gives an endless supply of fresh worlds to evaluate the algorithm on.

Randomly generated gridworld MDP
A randomly generated MDP. Feature counts (6 treasures, 12 bombs, 10 mud…) are scattered across the grid and converted to rewards, producing a varied reward landscape for testing.

03

Solving the forward problem: Policy Iteration

Before we can invert anything, we need to be able to solve the forward problem — given a reward, find the optimal policy. I implemented Policy Iteration with the three classic steps:

  • Evaluation — iteratively compute the value of every state under the current policy until it converges.
  • Improvement — at each state, switch to the action with the best expected value.
  • Iteration — repeat evaluation and improvement until the policy stops changing.

This solver is the workhorse reused throughout the rest of the project: every candidate reward proposed by the Bayesian search is turned into a policy through Policy Iteration.

04

Generating demonstrations: an imperfect tutor

IRL needs observed behavior to learn from. Rather than assume a perfect demonstrator, I built an imperfect tutor: it mostly follows the optimal policy, but with a 5% chance picks a random action (optimal_action_prob = 0.95). Starting from an initial state, it rolls out a trajectory and records the resulting sequence of (state, action) pairs.

This noise is deliberate and realistic — real demonstrators make mistakes — and it is exactly what the Bayesian likelihood is designed to tolerate, rather than requiring the observations to be perfectly consistent with a single reward.

05

The Bayesian framework

With an environment, a solver, and demonstrations in hand, the inference machinery has three pieces.

Priors over rewards

I implemented three candidate priors — UniformPrior (any reward within a bound is equally likely), GaussianPrior (rewards near zero are favored, penalizing extreme values), and BetaPrior. The Gaussian prior is the one carried through the experiments.

Likelihood: a Boltzmann policy

The likelihood links a reward to the observed actions. For a given reward we compute the Q-values (via a value-iteration-style sweep), then assume the tutor picks actions with a softmax / Boltzmann distribution over those Q-values: P(a | s) = exp(α·Q(s,a)) / Σ exp(α·Q(s,·)). The parameter α encodes how rational (deterministic) the agent is assumed to be. The likelihood of a set of observations is the product over all observed (state, action) pairs.

Posterior and acceptance ratio

The posterior is simply prior × likelihood. Because we will compare candidate rewards rather than normalize, the key quantity is the ratio of posteriors P(R₁ | O) / P(R₂ | O) — exactly what a Metropolis-Hastings sampler needs to decide whether to accept a new proposal.

06

PolicyWalk: sampling the reward space

PolicyWalk is the MCMC algorithm at the heart of Bayesian IRL. The reward space is too large to explore exhaustively, so we random-walk through it: start from a random reward, repeatedly propose a small perturbation to a neighboring reward, recompute the induced policy, and accept the proposal with probability min(1, posterior(R')/posterior(R)). Over many steps this traces out the posterior — concentrating on rewards that explain the demonstrations well.

The recovered reward vectors are directly interpretable: a positive value marks a state the agent is inferred to find desirable, a negative one a state it avoids. Reading them back against the grid is the qualitative sanity check that the inference is doing something sensible.

A simulated-annealing variant

I also implemented CoolingPolicyWalk, which adds simulated annealing: a temperature T that decays each iteration and reshapes the acceptance probability via exp(Δlog-posterior / T). Early on, high temperature permits exploratory, occasionally-worse moves; as T cools, the walk settles toward high-posterior regions. The aim is to escape poor local optima that a plain walk can get stuck in.

07

Evaluation: 0-1 policy loss

How do we score a recovered reward? Not by comparing reward numbers directly — many different rewards induce the same optimal behavior — but by behavior. The 0-1 policy loss is the fraction of observed (state, action) pairs where the policy implied by the recovered reward disagrees with what the tutor actually did. I compared plain PolicyWalk against the cooling variant across grid sizes 5, 8, and 10.

Policy loss vs iterations, grid size 5
Grid 5×5. PolicyWalk fluctuates around the demonstrations (dipping but also spiking to 0.6); the cooling variant stays flat at 0.5 — stable, but not improving.
Policy loss vs iterations, grid size 8
Grid 8×8. PolicyWalk briefly reaches zero loss — recovering a reward fully consistent with the demonstrations — before drifting back up. The cooling walk holds steady near zero.
Policy loss vs iterations, grid size 10
Grid 10×10. Both methods plateau at a high loss (~0.9) with little movement — on the largest grid, with these demonstrations, neither recovers behavior well.

The honest reading: the results are mixed and noisy, and that is itself informative. PolicyWalk is high-variance — it can find a zero-loss reward (grid 8) but does not reliably stay there. The cooling variant trades that volatility for stability, but on these runs stability sometimes means being stuck. On the 10×10 grid both struggle, illustrating how the reward posterior gets harder to explore as the state space grows and the demonstrations stay short.

08

Takeaways

This was a from-scratch build rather than a polished product, and the value is in what it forced me to understand end to end:

ComponentWhat it demonstrates
MDP + stochastic transitionsModeling sequential decisions under uncertainty
Policy IterationDynamic programming, the RL forward problem
Imperfect tutorRealistic, noisy demonstrations
Priors, Boltzmann likelihood, posteriorBayesian modeling from first principles
PolicyWalk (MCMC)Metropolis-Hastings sampling over a structured space
Simulated annealing variantExploration vs exploitation in samplers
0-1 policy loss across grid sizesBehavior-based evaluation, honest reporting

What I take away: implementing a method from the paper up — environment, solver, likelihood, sampler, evaluation — teaches far more than calling a library. The noisy results aren't a failure to hide; they show the real difficulty of reward inference, and reading them honestly is part of the work.