What is dqn?
DQN stands for Deep Q‑Network. It is a reinforcement‑learning algorithm that combines the classic Q‑learning method with a deep neural network to estimate the value of actions in a given situation (state). By using a neural network, DQN can handle very large or even visual state spaces where traditional tables would be impossible.
Let's break it down
- Q‑learning: learns a function Q(s, a) that predicts how good taking action a in state s will be, based on future rewards.
- Deep neural network: replaces the huge Q‑table with a network that takes the state as input (e.g., an image) and outputs Q‑values for all possible actions.
- Experience replay: stores past experiences (state, action, reward, next state) in a buffer and samples random mini‑batches to train the network, which breaks correlation and stabilises learning.
- Target network: a second copy of the Q‑network that is updated more slowly; it provides stable targets for the learning updates, preventing wild swings.
Why does it matter?
Because it lets computers learn good policies directly from raw, high‑dimensional inputs like video frames. Before DQN, reinforcement learning struggled with anything beyond tiny, handcrafted state representations. DQN showed that a single algorithm could master many Atari games at human‑level performance, opening the door to applying RL to vision‑based tasks, robotics, and more.
Where is it used?
- Playing video games (Atari, retro consoles) and board games.
- Training robots to grasp objects or navigate using camera feeds.
- Autonomous driving simulations where the car learns from visual road data.
- Finance: learning trading strategies from market price charts.
- Recommendation systems that adapt to user behavior over time.
Good things about it
- Works with raw pixels; no need for manual feature engineering.
- Off‑policy: can learn from data generated by any policy, not just the current one.
- Experience replay makes learning more data‑efficient and stable.
- Simple to implement once the basic components (network, replay buffer, target network) are set up.
Not-so-good things
- Requires a lot of training data and compute; learning can be slow.
- Sensitive to hyper‑parameters (learning rate, replay size, network architecture).
- Can become unstable or diverge if the target network is updated too frequently.
- Not well‑suited for continuous action spaces without modifications (e.g., DDPG).
- Often overfits to the training environment and may not generalise well to slightly different settings.