Rubik’s Cube 2x2x2
The 2x2x2 Pocket Rubik’s Cube is a combinatorial puzzle with 3,674,160 states under the FRU move system (Front, Right, Upper face rotations). The goal is to reach the single solved state from a randomly scrambled configuration using the minimum number of moves.
This environment wraps the third-party Gymnasium environment rubiks-cube-222-v0
from the rubiks-cube-gym package
via WrEnvGYM2MLPro, exposing it as a fully MLPro-compatible single-agent RL
environment.
Observation Space
The observation is a single integer index in [0, 3,674,159] that uniquely identifies the current cube configuration in a precomputed state dictionary. This compact representation avoids raw pixel or tile-array observations and allows the agent to distinguish all reachable states unambiguously.
Action Space
The agent has three discrete actions corresponding to clockwise quarter-turns of the three available faces:
Action |
Move |
Description |
|---|---|---|
0 |
F |
Front face clockwise |
1 |
R |
Right face clockwise |
2 |
U |
Upper face clockwise |
Reward
Two reward modes are available via the p_shaped_reward parameter of
RubiksCube222.
Default sparse reward (p_shaped_reward=False)
The agent receives +100 when the cube is fully solved and -1 on every other step. The -1 step penalty discourages unnecessary moves and encourages the agent to solve the cube as efficiently as possible. However, because a randomly acting agent almost never reaches the solved state by chance, the agent receives almost no positive learning signal in early training, making convergence extremely slow or infeasible for longer scrambles.
Shaped reward with curriculum (p_shaped_reward=True)
Activates the built-in ShapedRewardCubeWrapper which replaces the sparse
signal with a dense, milestone-based reward. The reward at each step equals the
highest milestone score reached so far in the current episode, minus a small decay
penalty to discourage wandering (0.05 during the bottom layer phase, 0.001 during the
top layer phase).
This design means the agent always receives a meaningful gradient signal from the very first episode, even before it has ever seen the solved state.
Reward Shaping Strategy
The ShapedRewardCubeWrapper decomposes the solve into two phases — bottom layer
first, then top layer — mirroring the layer-by-layer method used by human beginners.
Progress is tracked across 9 milestones:
Milestone |
Description |
Score |
|---|---|---|
0 |
Yellow sticker visible on bottom face |
2.0 |
1 |
1st bottom corner fully solved |
9.0 |
2 |
2nd bottom corner fully solved |
23.0 |
3 |
3rd bottom corner solved (58.0 if 4th corner is in slot) |
46.0 |
4 |
Bottom layer complete |
69.0 |
5 |
Top layer permutation complete |
75.0 |
6 |
1st top corner oriented (white facing up) |
78.0 |
7 |
2nd top corner oriented |
81.0 |
8 |
3rd top corner oriented |
84.0 |
9 |
Cube fully solved |
87.0 |
Three additional mechanisms accelerate learning:
Checkpoint save/restore — when the agent reaches milestone 3 or higher and then fails to make progress within the allowed step budget, the cube is automatically reverted to that milestone state. This prevents the agent from accidentally undoing its own progress and forces it to focus only on the unsolved part of the cube.
Dynamic patience — the step budget for the top layer starts at 20 and grows by 5 each time the agent hits the limit, up to a maximum of 50. This adapts the allowed exploration window to the actual difficulty the agent is experiencing at each training stage rather than using a fixed hardcoded limit.
Curriculum learning — the scramble length starts at 1 (one move away from solved) and increases by 1 after every 3 successful solves, up to a maximum of 15. This ensures the agent is never overwhelmed by a problem far beyond its current capability and always has a realistic chance of receiving a positive reward signal.
Parameters
Parameter |
Type |
Default |
Description |
|---|---|---|---|
p_shaped_reward |
bool |
False |
Activates ShapedRewardCubeWrapper |
p_visualize |
bool |
True |
Opens a live rendering window |
p_logging |
int |
LOG_ALL |
MLPro log level |
Usage Example
from mlpro_int_gymnasium.envs import RubiksCube222
# Default sparse reward
env = RubiksCube222(p_shaped_reward=False, p_visualize=True)
# With shaped reward and curriculum learning
env = RubiksCube222(p_shaped_reward=True, p_visualize=True)
Cross Reference
Howto RL AGENT 010 — Train a SB3 PPO agent on this environment