Blog — mkelly.me

External Tilesets with Tiled and Phaser

March 3, 2019 phaser gamedev

Tiled is a popular tilemap editor, and Phaser has great built-in support for it. One feature of Tiled that Phaser doesn't support is external tilesets.

In Tiled, a tileset can either be internal, meaning all the data for the tileset is included in the tilemap itself, or external, meaning that the tileset is a standalone file separate from the tilemap. The main benefit of external tilesets is that they can be shared between maps. You can update and change the tileset without having to update per-tilemap copies everywhere.

Phaser, however, requires that tilesets be stored internally in the tilemaps they're used in. I finally ran into a point where I wanted multiple tilemaps in my game and wrote a custom loader that supports external tilesets.

It's called phaser-tiled-json-external-loader, and you can install it via NPM or manually download a JavaScript bundle to load in your game's HTML file. The README has more information on how to install and use the library. I've also got a Glitch project showing the library in action:

Why doesn't Phaser support external tilesets?

I can only speculate based on the code¹. Phaser loads tilemaps as JSON, and doesn't actually parse that JSON until you attempt to create a tilemap object during a scene's create phase. While parsing the Tiled JSON, it tries to load each tileset:

//  name, firstgid, width, height, margin, spacing, properties
var set = json.tilesets[i];

if (set.source)
{
    console.warn('Phaser can\'t load external tilesets. Use the Embed Tileset button and then export the map again.');
}

At this point, we're past the preload step of the scene, and the API for creating tilemaps isn't asynchronous, so going back and loading another external JSON file isn't really an option at this point.

In my opinion, a "proper" fix would be similar to how the images for tilemaps are handled. Even with internal tilesets, the images used in the tilesets must be loaded separately and passed when creating a tileset:

const scene = {
  preload() {
    // The tileset image is not automagically loaded by Phaser
    this.load.image('tilesetImage', 'https://cdn.glitch.com/1780c601-5e7d-42f6-8757-c55452affe65%2Ftiles.png?1551608607854');
    this.load.tilemapTiled('tilemap', 'tilemap.json');
  },

  create() {
    const tilemap = this.make.tilemap({key: 'tilemap'});
    // tilesetImage here is referring to the manually-loaded tileset image above
    const tileset = tilemap.addTilesetImage('tiles', 'tilesetImage');
    tilemap.createStaticLayer('layer1', tileset, 0, 0);
  },
};

Similarly, external tilesets should probably be a new type of thing that you could load in the preload step and associate with one (or many) tilemaps.

I tried to figure out how to write a patch like this to fix Phaser directly, but there are multiple types of tilemaps and tilesets supported in Phaser, and I don't really understand the internals well enough yet.

So how does the loader work?

So if Phaser only supports internal tilesets, and doesn't parse the tilemap until the create step, what if we loaded and inserted the external tilesets into our tilemaps before Phaser tried parsing them? Some people on the Phaser Discord recommend I write a preprocessor to do this (as they had been doing for a while), but I wanted to build something a bit more broadly reusable.

I spent a few hours reading the code for how loaders work in Phaser and found out that there are things called MultiFile loaders that support loading dependent files based on the contents of a manifest-like file. Using the MutliAtlasFile loader as a based, I wrote a new loader that:

Loads the tilemap JSON
Finds all tilesets that have a source property
Processes each source property as relative to the tilemap's URL to get the URL for each tileset
Loads each external tileset
Inserts each loaded tileset back into the tilemap JSON
Adds the modified tilemap JSON into the tilemap cache

I tested with my own game and it seemed to work fine. The remaining steps were to add a webpage-ready bundle for projects that aren't using NPM or a JavaScript bundler, write instructions, and publish the package on NPM.

Caveats

There's still some caveats to this method of loading external tilesets:

Tilesets are duplicated between tilemaps in memory. Sharing references to tilesets would require some more complex coordination in the loader, or a more detailed refactor of how Phaser handles tilesets.
I don't actually know if Phaser will avoid re-requesting a tileset if it is referenced by more than one tilemap. Presumably the browser will cache the request, at least.
The loader is limited to JSON tilemaps and tilesets. TMX formatted tilesets are not supported.

But it works for me and I did it for free so whooooooooooo caressssssss

I mean I actually could just ask the maintainer if I really wanted to. ↩

Phaser Tutorial Series: Finite State Machine

February 23, 2019 mozilla phaser gamedev

I've been working on a game using Phaser in my spare time.

One thing that's made adding new features really easy is using finite-state machines to model behavior. Almost everything in the animation above is backed by a state machine: the player, the platform, the grappling hook, the statue, and the fireballs.

This post is going to assume some familiarity with the basics of Phaser, such as the preload/create/update steps, Arcade physics, and keyboard input. You may also be able to follow along if you're not familiar with Phaser, but it's okay if not! This use of state machines isn't specific to Phaser.

What is a finite-state machine? Fuck that let's make games

Let's start with a fairly empty example project. Here it is on Glitch. You can use the remix button to create your own copy and follow along the tutorial as we go:

Pretty much all of our work is happening in client.js. It starts out looking something like this:

/* global Phaser */

const config = {
  type: Phaser.AUTO,
  width: 400,
  height: 300,
  pixelArt: true,
  zoom: 2,
  physics: {
    default: 'arcade'
  },
  scene: {
    preload() {
      this.load.spritesheet('hero', 'https://cdn.glitch.com/59aa1c5f-c16d-41a1-bfd2-09072e84a538%2Fhero.png?1551136698770', {
        frameWidth: 32,
        frameHeight: 32,
      });
      this.load.image('bg', 'https://cdn.glitch.com/59aa1c5f-c16d-41a1-bfd2-09072e84a538%2Fbg.png?1551136995353');
    },

    create() {
      // Static background
      this.add.image(200, 200, 'bg');

      // The movable character
      this.hero = this.physics.add.sprite(200, 150, 'hero', 0);
    },

    update() {

    },
  }
};

window.game = new Phaser.Game(config);

We're loading some images in the preload step, and adding the background and hero sprite in the create step. The hero is drawn on the background, but nothing else happens.

MAKE IT WALK

Let's add a this.keys variable for reading input from the keyboard. We can use that in the update method to check which keys are being pressed and set the hero's velocity appropriately:

@@ -19,6 +19,8 @@
     },

     create() {
+      this.keys = this.input.keyboard.createCursorKeys();
+
       // Static background
       this.add.image(200, 200, 'bg');

@@ -27,7 +29,20 @@
     },

     update() {
-
+      // Stop movement from last update
+      this.hero.setVelocity(0);
+
+      // Set new velocity based on input
+      if (this.keys.up.isDown) {
+        this.hero.setVelocityY(-100);
+      } else if (this.keys.down.isDown) {
+        this.hero.setVelocityY(100);
+      }
+      if (this.keys.left.isDown) {
+        this.hero.setVelocityX(-100);
+      } else if (this.keys.right.isDown) {
+        this.hero.setVelocityX(100);
+      }
     },
   }
 };

MAKE IT LOOK LIKE IT'S WALKING

Now the hero is moving about the map, but it doesn't look like he's walking. To do that, we'll need to do two things:

Define some animations from our sprite sheet in the create function. Our sheet is split into 32x32 pixel squares, so we can use generateFrameNumbers to generate animation data by giving it start and end indexes for the animation frames. These are numbered from top left to bottom right.
Trigger the proper animations in the update function. We also track whether the player is moving or not, and if they aren't, we stop the current animation to stop the player from walking. Note the true passed to the play function: this tells Phaser to not restart the animation if it's already playing.

@@ -26,22 +26,61 @@

       // The movable character
       this.hero = this.physics.add.sprite(200, 150, 'hero', 0);
+
+      // Animation definitions
+      this.anims.create({
+        key: 'walk-down',
+        frameRate: 8,
+        repeat: -1,
+        frames: this.anims.generateFrameNumbers('hero', {start: 0, end: 3}),
+      });
+      this.anims.create({
+        key: 'walk-right',
+        frameRate: 8,
+        repeat: -1,
+        frames: this.anims.generateFrameNumbers('hero', {start: 4, end: 7}),
+      });
+      this.anims.create({
+        key: 'walk-up',
+        frameRate: 8,
+        repeat: -1,
+        frames: this.anims.generateFrameNumbers('hero', {start: 8, end: 11}),
+      });
+      this.anims.create({
+        key: 'walk-left',
+        frameRate: 8,
+        repeat: -1,
+        frames: this.anims.generateFrameNumbers('hero', {start: 12, end: 15}),
+      });
     },

     update() {
       // Stop movement from last update
+      let moving = false;
       this.hero.setVelocity(0);

       // Set new velocity based on input
       if (this.keys.up.isDown) {
         this.hero.setVelocityY(-100);
+        this.hero.anims.play('walk-up', true);
+        moving = true;
       } else if (this.keys.down.isDown) {
         this.hero.setVelocityY(100);
+        this.hero.anims.play('walk-down', true);
+        moving = true;
       }
       if (this.keys.left.isDown) {
         this.hero.setVelocityX(-100);
+        this.hero.anims.play('walk-left', true);
+        moving = true;
       } else if (this.keys.right.isDown) {
         this.hero.setVelocityX(100);
+        this.hero.anims.play('walk-right', true);
+        moving = true;
+      }
+
+      if (!moving) {
+        this.hero.anims.stop();
       }
     },
   }

MAKE IT UNNECESSARILY VIOLENT

Next, let's make the player swing their sword when we press the space key. This actually involves a few steps:

Check if the space key is pressed.
Stop player movement while the sword is being swung.

We'll need to know if the hero is currently swinging their sword, so we'll add a swinging variable on this.hero that determines if the swinging animation is still playing.
Determine which direction the player is facing.

Figuring out the direction requires that we add a new variable called direction to keep track between walking and swinging. Storing this on the this.hero object makes it clear that the direction isn't for, say, an enemy we may add later.
Play the sword-swinging animation for the appropriate direction.
Once the animation is done playing, switch back to the non-sword-swinging sprites and allow movement again.

Doing all of this with the movement code is tricky, and difficult to split into single code changes. You may want to take a bit to look over the diff to understand the changes:

@@ -26,6 +26,8 @@

       // The movable character
       this.hero = this.physics.add.sprite(200, 150, 'hero', 0);
+      this.hero.direction = 'down';
+      this.hero.swinging = false;

       // Animation definitions
       this.anims.create({
@@ -52,6 +54,32 @@
         repeat: -1,
         frames: this.anims.generateFrameNumbers('hero', {start: 12, end: 15}),
       });
+
+      // NOTE: Sword animations do not repeat
+      this.anims.create({
+        key: 'swing-down',
+        frameRate: 8,
+        repeat: 0,
+        frames: this.anims.generateFrameNumbers('hero', {start: 16, end: 19}),
+      });
+      this.anims.create({
+        key: 'swing-up',
+        frameRate: 8,
+        repeat: 0,
+        frames: this.anims.generateFrameNumbers('hero', {start: 20, end: 23}),
+      });
+      this.anims.create({
+        key: 'swing-right',
+        frameRate: 8,
+        repeat: 0,
+        frames: this.anims.generateFrameNumbers('hero', {start: 24, end: 27}),
+      });
+      this.anims.create({
+        key: 'swing-left',
+        frameRate: 8,
+        repeat: 0,
+        frames: this.anims.generateFrameNumbers('hero', {start: 28, end: 31}),
+      });
     },

     update() {
@@ -59,28 +87,43 @@
       let moving = false;
       this.hero.setVelocity(0);

-      // Set new velocity based on input
-      if (this.keys.up.isDown) {
-        this.hero.setVelocityY(-100);
-        this.hero.anims.play('walk-up', true);
-        moving = true;
-      } else if (this.keys.down.isDown) {
-        this.hero.setVelocityY(100);
-        this.hero.anims.play('walk-down', true);
-        moving = true;
-      }
-      if (this.keys.left.isDown) {
-        this.hero.setVelocityX(-100);
-        this.hero.anims.play('walk-left', true);
-        moving = true;
-      } else if (this.keys.right.isDown) {
-        this.hero.setVelocityX(100);
-        this.hero.anims.play('walk-right', true);
-        moving = true;
-      }
-
-      if (!moving) {
-        this.hero.anims.stop();
+      // If we're swinging a sword, wait for the animation to finish
+      if (!this.hero.swinging) {
+        // Swinging a sword overrides movement
+        if (this.keys.space.isDown) {
+          this.hero.swinging = true;
+          this.hero.anims.play(`swing-${this.hero.direction}`, true);
+          this.hero.once('animationcomplete', () => {
+            this.hero.anims.play(`walk-${this.hero.direction}`, true);
+            this.hero.swinging = false;
+          });
+        } else {
+          // Set new velocity based on input
+          if (this.keys.up.isDown) {
+            this.hero.setVelocityY(-100);
+            this.hero.direction = 'up';
+            moving = true;
+          } else if (this.keys.down.isDown) {
+            this.hero.setVelocityY(100);
+            this.hero.direction = 'down';
+            moving = true;
+          }
+          if (this.keys.left.isDown) {
+            this.hero.setVelocityX(-100);
+            this.hero.direction = 'left';
+            moving = true;
+          } else if (this.keys.right.isDown) {
+            this.hero.setVelocityX(100);
+            this.hero.direction = 'right';
+            moving = true;
+          }
+
+          if (!moving) {
+            this.hero.anims.stop();
+          } else {
+            this.hero.anims.play(`walk-${this.hero.direction}`, true);
+          }
+        }
       }
     },
   }

MAKE IT DO MORE?

Okay so the hero is now swinging their sword, next we want to add the ability for them to jump, or maybe we want to handle collision detection, or maybe add some enemy logic to the update loop, or... well, you get the idea. We've barely added some basic functionality to the game and already the update loop is getting difficult to manage.

The core problem here is that, to add some new feature to the player, like a new weapon or ability, we need to think about every other thing the player can do. What happens if the player uses a hookshot while moving? What if they use a jump power while moving? One may freeze the player in place while the other retains their momentum. There's too much state to keep in our heads.

Enter state machines. The idea is to model the player's behavior by assigning them a single "state" to be in. When a player is in a "state", they can "transition" to another state if a condition is met, which replaces the current state with a new one. If we design our states and transitions correctly, we can control the amount of info we need to keep in our head when writing new features.

I find the state machine from the Wikipedia article on state machines to be a great example:

A state machine modelling a turnstile — A state machine diagram for a subway turnstile. The "Locked" state is the initial state.

The diagram above illustrates a subway turnstile that is locked until you drop a coin into it, which unlocks it and allows one person to walk through before becoming locked again. The state machine has two states:

Locked: The turnstile is locked. Pushing it will not let you through and remain in the "Locked" state, but inserting a coin will transition to the "Open" state.
Unlocked: The turnstile is unlocked. Inserting another coin will keep the turnstile "Unlocked", but pushing it will allow you through and transition back to the "Locked" state.

In the same way that this diagram models the behavior of the real turnstile, we can create a similar diagram that models how we want our player to behave:

A state machine modelling the hero — I am not the best diagram-maker.

The entire diagram itself is a little messy, but the point is that this model allows us to implement each state in isolation, resulting in cleaner, easier-to-maintain code.

Coding a State Machine

We're going to create a StateMachine class that handles storing the current active state, storing a list of all possible states, and transitioning from the current state to a new state. But transitioning alone doesn't really do anything.

Besides transitioning, we also want to:

Run a function when we first transition to a new state. This lets us modify the hero when we transition between states, like starting the attack animation when we enter the swing state. We'll call this the enter function.
Run a function during each update call depending on the current state. We'll call this the execute function.

There are several options for how to represent a state in our code. One is to use classes, which allows us to inherit from a base State class to get default enter and execute functions.

@@ -1,5 +1,46 @@
 /* global Phaser */

+class StateMachine {
+  constructor(initialState, possibleStates, stateArgs=[]) {
+    this.initialState = initialState;
+    this.possibleStates = possibleStates;
+    this.stateArgs = stateArgs;
+    this.state = null;
+
+    // State instances get access to the state machine via this.stateMachine.
+    for (const state of Object.values(this.possibleStates)) {
+      state.stateMachine = this;
+    }
+  }
+
+  step() {
+    // On the first step, the state is null and we need to initialize the first state.
+    if (this.state === null) {
+      this.state = this.initialState;
+      this.possibleStates[this.state].enter(...this.stateArgs);
+    }
+
+    // Run the current state's execute
+    this.possibleStates[this.state].execute(...this.stateArgs);
+  }
+
+  transition(newState, ...enterArgs) {
+    this.state = newState;
+    this.possibleStates[this.state].enter(...this.stateArgs, ...enterArgs);
+  }
+}
+
+class State {
+  enter() {
+
+  }
+
+  execute() {
+
+  }
+}
+
+
 const config = {
   type: Phaser.AUTO,
   width: 400,

There are two things to note in the code above:

possibleStates is an object whose keys refer to the state name, and whose values are instances of the State class (or subclasses). We assign the stateMachine property on each instance so that they can call this.stateMachine.transition whenever they want to trigger a transition.
stateArgs is a list of arguments passed to the enter and execute functions. This lets us pass commonly-used values (such as the hero or the current Phaser scene) to the state methods.

With this state machine implementation, we can replace our nest of if statements with classes for each state we modeled on our diagram:

@@ -27,7 +68,14 @@
       // The movable character
       this.hero = this.physics.add.sprite(200, 150, 'hero', 0);
       this.hero.direction = 'down';
-      this.hero.swinging = false;
+
+      // The state machine managing the hero
+      this.stateMachine = new StateMachine('idle', {
+        idle: new IdleState(),
+        move: new MoveState(),
+        swing: new SwingState(),
+      }, [this, this.hero]);
+

       // Animation definitions
       this.anims.create({
@@ -83,50 +131,79 @@
     },

     update() {
-      // Stop movement from last update
-      let moving = false;
-      this.hero.setVelocity(0);
-
-      // If we're swinging a sword, wait for the animation to finish
-      if (!this.hero.swinging) {
-        // Swinging a sword overrides movement
-        if (this.keys.space.isDown) {
-          this.hero.swinging = true;
-          this.hero.anims.play(`swing-${this.hero.direction}`, true);
-          this.hero.once('animationcomplete', () => {
-            this.hero.anims.play(`walk-${this.hero.direction}`, true);
-            this.hero.swinging = false;
-          });
-        } else {
-          // Set new velocity based on input
-          if (this.keys.up.isDown) {
-            this.hero.setVelocityY(-100);
-            this.hero.direction = 'up';
-            moving = true;
-          } else if (this.keys.down.isDown) {
-            this.hero.setVelocityY(100);
-            this.hero.direction = 'down';
-            moving = true;
-          }
-          if (this.keys.left.isDown) {
-            this.hero.setVelocityX(-100);
-            this.hero.direction = 'left';
-            moving = true;
-          } else if (this.keys.right.isDown) {
-            this.hero.setVelocityX(100);
-            this.hero.direction = 'right';
-            moving = true;
-          }
-
-          if (!moving) {
-            this.hero.anims.stop();
-          } else {
-            this.hero.anims.play(`walk-${this.hero.direction}`, true);
-          }
-        }
-      }
+      this.stateMachine.step();
     },
   }
 };

+class IdleState extends State {
+  enter(scene, hero) {
+    hero.setVelocity(0);
+    hero.anims.play(`walk-${hero.direction}`);
+    hero.anims.stop();
+  }
+
+  execute(scene, hero) {
+    const {left, right, up, down, space} = scene.keys;
+
+    // Transition to swing if pressing space
+    if (space.isDown) {
+      this.stateMachine.transition('swing');
+      return;
+    }
+
+    // Transition to move if pressing a movement key
+    if (left.isDown || right.isDown || up.isDown || down.isDown) {
+      this.stateMachine.transition('move');
+      return;
+    }
+  }
+}
+
+class MoveState extends State {
+  execute(scene, hero) {
+    const {left, right, up, down, space} = scene.keys;
+
+    // Transition to swing if pressing space
+    if (space.isDown) {
+      this.stateMachine.transition('swing');
+      return;
+    }
+
+    // Transition to idle if not pressing movement keys
+    if (!(left.isDown || right.isDown || up.isDown || down.isDown)) {
+      this.stateMachine.transition('idle');
+      return;
+    }
+
+    hero.setVelocity(0);
+    if (up.isDown) {
+      hero.setVelocityY(-100);
+      hero.direction = 'up';
+    } else if (down.isDown) {
+      hero.setVelocityY(100);
+      hero.direction = 'down';
+    }
+    if (left.isDown) {
+      hero.setVelocityX(-100);
+      hero.direction = 'left';
+    } else if (right.isDown) {
+      hero.setVelocityX(100);
+      hero.direction = 'right';
+    }
+
+    hero.anims.play(`walk-${hero.direction}`, true);
+  }
+}
+
+class SwingState extends State {
+  enter(scene, hero) {
+    hero.setVelocity(0);
+    hero.anims.play(`swing-${hero.direction}`);
+    hero.once('animationcomplete', () => {
+      this.stateMachine.transition('idle');
+    });
+  }
+}
+
 window.game = new Phaser.Game(config);

This is a lot to unpack. Some highlights of the changes:

We can remove the swinging variable now, as it's been effectively replaced by swing being the current state. Since SwingState doesn't do anything in it's execute function, there's no fear of accidentally moving during the swing.
Note how the transitions from the state machine we modeled above typically appear as early if statements that transition and return if their condition passes.
Some code is repeated, such as checking if the spacebar is being pressed and transitioning to the swing state. You could factor these out to avoid repeated code, but I find that ends up coupling code in a way that is harder to maintain vs keeping them separate.

Okay but why?

At first glance it may seem that the state machine code is longer than the old update method and more complex, and to some degree this is true. The reduction in complexity is not due to less code, but is instead due to less cognitive load. When we're working on the move state, we don't have to think about interfering with the idle and swing state logic as much as we previously did.

Let's say we want to add a dash in the current direction when the Shift key is pressed. Under the old code, we'd have to figure out where in the nest of if statements to check the shift key, and then probably add another level of conditions to avoid moving or attacking during a dash. With a state machine, we can add a new dash state and modify the existing states that can validly transition to a dash:

@@ -74,6 +74,7 @@
         idle: new IdleState(),
         move: new MoveState(),
         swing: new SwingState(),
+        dash: new DashState(),
       }, [this, this.hero]);


@@ -144,7 +145,7 @@
   }

   execute(scene, hero) {
-    const {left, right, up, down, space} = scene.keys;
+    const {left, right, up, down, space, shift} = scene.keys;

     // Transition to swing if pressing space
     if (space.isDown) {
@@ -152,6 +153,12 @@
       return;
     }

+    // Transition to dash if pressing shift
+    if (shift.isDown) {
+      this.stateMachine.transition('dash');
+      return;
+    }
+
     // Transition to move if pressing a movement key
     if (left.isDown || right.isDown || up.isDown || down.isDown) {
       this.stateMachine.transition('move');
@@ -162,7 +169,7 @@

 class MoveState extends State {
   execute(scene, hero) {
-    const {left, right, up, down, space} = scene.keys;
+    const {left, right, up, down, space, shift} = scene.keys;

     // Transition to swing if pressing space
     if (space.isDown) {
@@ -170,6 +177,12 @@
       return;
     }

+    // Transition to dash if pressing shift
+    if (shift.isDown) {
+      this.stateMachine.transition('dash');
+      return;
+    }
+
     // Transition to idle if not pressing movement keys
     if (!(left.isDown || right.isDown || up.isDown || down.isDown)) {
       this.stateMachine.transition('idle');
@@ -204,6 +217,32 @@
       this.stateMachine.transition('idle');
     });
   }
+}
+
+class DashState extends State {
+  enter(scene, hero) {
+    hero.setVelocity(0);
+    hero.anims.play(`swing-${hero.direction}`);
+    switch (hero.direction) {
+      case 'up':
+        hero.setVelocityY(-300);
+        break;
+      case 'down':
+        hero.setVelocityY(300);
+        break;
+      case 'left':
+        hero.setVelocityX(-300);
+        break;
+      case 'right':
+        hero.setVelocityX(300);
+        break;
+    }
+
+    // Wait a third of a second and then go back to idle
+    scene.time.delayedCall(300, () => {
+      this.stateMachine.transition('idle');
+    });
+  }
 }

 window.game = new Phaser.Game(config);

Is this fast?

No idea. I haven't hit issues with my own game. I'm not terribly concerned about performance as my game is just a demo right now, so take that with a grain of salt.

I don't think there's any glaring issues with it performance-wise, but I suspect having a bunch of state machines running each update loop might start to cause issues with their overhead. Some clever engineering could reuse states or even state machines between sprites, which might help.

What else could we do with this?

There's a lot of ideas I haven't touched upon here that are worth exploring:

If states are classes, it stands to reason you can make more than one instance and accept parameters in their constructor.
States could also subclass other states to share common logic or code between them.
In my personal game, there are exit handlers as well as enter ones.

Final Project

Here's the final version of the code used for this post, available as another Glitch project for your reading and remixing pleasure:

Data Collection at Mozilla: Browser Errors

April 11, 2018 mozilla

I’ve spent the past few months working on a project involving data collection from users of Nightly, the pre-release channel of Firefox that updates twice a day. I’d like to share the process from conception to prototype to illustrate

One of the many ways ideas become reality at Mozilla, and
How we care about and protect user privacy with regards to data collection.

Maybe JavaScript errors are a bad thing

The user interface of Firefox is written in JavaScript (along with XUL, HTML, and CSS). JavaScript powering the UI is “privileged” JavaScript, which is separate from JavaScript in a normal webpage, and can do things that normal webpages cannot do, such as read the filesystem.

When something goes wrong and an error occurs in this privileged JavaScript (let’s call them “browser errors”), it ends up logged to the Browser Console. Most users aren’t looking at the Browser Console, so these errors often go unnoticed.

While working on Shield, I found that our QA cycle¹ involved a lot of time noticing and reporting errors in the Browser Console. Our code would often land on the Nightly channel before QA review, so why couldn’t we just catch errors thrown from our code and report them somewhere?²

So let’s a great plan

I told my boss a few times that browser error collection was a problem that I was interested in solving. I was pretty convinced that there was useful info to be gleaned from collecting these errors, but my beliefs aren’t really enough to justify building a production-quality error collection service. This was complicated by the fact that errors may contain info that can personally identify a user:

There’s no limits or checks on what goes into an error message in Firefox, so we can’t guarantee that error messages don’t contain things like, say, an auth token for a URL that we couldn’t connect to.
Tracebacks for errors may signal that a user was using a specific feature in Firefox, like private browsing. It’s not clear whether “user was using private browsing” is private user data or not, but it’s gray enough to be concerning.

On top of all that, we didn’t even know how often these errors were occurring in the wild. Was this a raging fire of constant errors we were just ignoring, or was I getting all worried about nothing?

In the end, I proposed a 3-step research project:

Run a study to measure the number of errors occurring in Nightly as well as the distribution of signatures.
Estimate potential load using the study data, and build a prototype service. Grant access to the data to a limited set of employees and discover whether the data helps us find and diagnose errors.
Shut down the prototype after 6 months or so and evaluate if we should build a production version of the system.

I wrote up this plan as a document that could be shared among people asking why this was an important project to solve. Eventually, my boss threw the idea past Firefox leadership, who agreed that it was a problem worth pursuing.

What even is happening out there

The first step was to find out how many errors we’d be collecting. One tool at our disposal at Mozilla is Shield, which lets us run small studies at targeted subsets of users. In this case, I wanted to collect data on how many errors were being logged on the Nightly channel.

To run the study, I had to fill out a Product Hypothesis Document (PHD) describing my experiment. The PHD is approved by a group in Mozilla with data science and experiment design experience. It’s an important step that checks multiple things:

Do you know how to interpret the results of your experiment? Is success vs failure clear?
Have you enumerated the user data you’ll need to collect? Mozilla has a classification system for user data that needs to be applied to prevent collection of sensitive data.
Are you sending your experiment to the minimally-effective group? If we can make do with only collecting data from 3000 users rather than 30,000, we should avoid the over-collection of data.

Once the PHD was approved, I implemented the code for my study and created a Bugzilla bug for final review. Mozilla has a group of “data stewards” who are responsible for reviewing data collection to ensure it complies with our policies. Studies are not allowed to go out until they’ve been reviewed, and the results of the review are, in most cases, public and available in Bugzilla.

In our case, we decided to compute hashes from the error stacktraces and submit those to Mozilla’s data analysis pipeline. That allowed us to count the number of errors and view the distribution of specific errors without accidentally collecting personal data that may be in file paths.

I am perfect and infallible

The last steps after passing review in the bug were to announce the study on a few mailing lists to both solicit feedback from Firefox developers, and to inform our release team that we intended to ship a new study to users. Once the release team approved our launch plan, we launched and started to collect data. Yay!

A few days after launching Ekr, who had noticed the study on the mailing lists, reached out and voiced some concerns with our study.

While we were hashing errors before sending them, an adversary could precompute the hashes by running Firefox, triggering bugs they were interested in, and generating their own hash using the same method we were using. This, paired with direct access to our telemetry data, would reveal that an individual user had run a specific piece of code.

It was unclear if knowing that a user had run a piece of code could be considered sensitive data. If, for example, the error came from code involved with private browsing mode, would that constitute knowing that the user had used private browsing mode for something? Was that sensitive enough for us to not want to collect?

We decided to turn the study off while we tried to address these concerns. By that point, we had collected 2-3 days-worth of data, and decided that the risk wasn’t large enough to justify dropping the data we already had. I was able to perform a limited analysis on that data and determine that we were seeing tens of millions of errors per day, which was enough of an estimate for building the prototype. With that question answered, we opted to keep the study disabled and consider it finished rather than re-tool it based on Ekr’s feedback.

Can I collect the errors now

Mozilla already runs our own instance of Sentry for collecting and aggregating errors, and I have prior experience with it, so it seemed the obvious choice for the prototype.

With roughly 50 million errors per-day, I figured we could sample sending them to the collection service at a rate of 0.1%, or about 50,000 per-day. The operations team that ran our Sentry instance agreed that an extra 50,000 errors wasn’t an issue.

I spent a few weeks writing up a Firefox patch that collected the errors, mangled them into a Sentry-compatible format, and sent them off. Once the patch was ready, I had to get a technical review from a Firefox peer and a privacy review from a data steward. The patch and review process can be seen in the Bugzilla bug.

The process, as outlined on the Data Collection wiki page, involves three major steps:

Requesting Review

First, I had to fill out a form with several questions asking me to describe the data collection. I’m actually a huge fan of this form, because the questions force you to consider many aspects about data collection that are easy to ignore:

“Why does Mozilla need to answer these questions? Are there benefits for users? Do we need this information to address product or business requirements?”: It’s really easy to let curiosity or mild suspicion drive big chunks of work. The point of this question is to force you to think of a reason for doing the collection. Collecting data just because it is mildly convenient or interesting isn’t a good enough reason; it needs a purpose.
“What alternative methods did you consider to answer these questions? Why were they not sufficient?”: Data collection can’t simply be the first tool you reach for to answer your questions. If we want to be respectful of user privacy, we need to consider other ways of answering questions that don’t involve collecting data.
“List all proposed measurements and indicate the category of data collection for each measurement, using the Firefox data collection categories on the Mozilla wiki.”: The classification system we use for data makes it very clear how to apply our policies to the data you’re collecting. Browser errors, for example, are mostly category 2 data, but may potentially contain category 3 data and as such must be held to a higher standard.
“How long will this data be collected?”: If we can limit the time period in which we collect a piece of data, we can reduce the impact of data collection on users. I didn’t actually know time-limited collection was something to consider until I saw this question for the first time, but in fact several of our data collection systems enforce time limits by default.

Reviewing Request

Data stewards have their own form to fill out when reviewing a collection request. This form helps stewards be consistent in their judgement. Besides reviewing the answers to the review form from above, reviewers are asked to confirm a few other things:

Is the data collection documented in a publicly accessible place?: Sufficiently technical users should be able to see the schema for data being collected without having to read through the Firefox source code. Failing to provide this documentation mandates a failing review.
Is there a way for users to disable the collection?: There must be some way for users to disable the data collection. Missing this is also considered grounds for failure.

It’s important to note that this mechanism doesn’t need to be, say, a checkbox in the preferences UI. Depending on the context of the data collection, an about:config preference or some other mechanism may be good enough.

Rereing Viewquest?

In certain cases, requests may be escalated to Mozilla’s legal team if they involve changes to our privacy policy or other special circumstances. In the case of browser error collection, we wanted a legal review to double-check whether a user having used private browsing mode was considered category 2 or 3 data, as well as to approve our proposal for collecting category 3 data in error messages and file paths.

Our approach was to mimic what Mozilla already does with crashes; we collect the data and restrict access to the data to a subset of employees who are individually approved access. This helps make the data accessible only to people who need it, and their access is contingent on employment³. Legal approved the plan, which we implemented using built-in Sentry access control.

Welcome to errortown

With code and privacy review finished, I landed the patch and waited patiently for Sentry to start receiving errors. And it did!

Since we started receiving the data, I’ve spent most of my time recruiting Firefox developers who want to search through the errors we’re collecting, and refining the data we’re collecting to make it more more useful to those developers. Of course, changes to the data collection require new privacy reviews, although the smaller the changes are, the easier it is to fill out and justify the data collection.

But from my standpoint as a Mozilla employee, these data reviews are the primary way I see Mozilla making good on its promise to respect user privacy and avoid needless data collection. A lot of thought has gone into this process, and I can personally attest to their effectiveness.

Firefox uses tons of automated testing, but we also have manual testing for certain features. In Shield's case, the time being wasted was in the manual phase. ↩
Actually, we already do collect crashes as part of the [Socorro][] project, which I currently work on. But Socorro does not collect any info about the browser errors in question. [Socorro]: http://socorro.readthedocs.io/en/latest/ ↩
Only some parts of crash data are actually private, and certain contributors who sign an NDA are also allowed access to that private data. We use centralized authorization to control access. ↩

Using NPM Libraries in Firefox via Webpack

July 11, 2017 mozilla

I work on a system add-on for Firefox called the Shield Recipe Client. We develop it in a monorepo on Github along with the service it relies on and a few other libraries. One of these libraries is mozJexl, an expression language that we use to specify how to filter experiments and surveys we send to users.

The system add-on relies on mozJexl, and for a while we were pulling in the dependency by copying it from node_modules and using a custom CommonJS loader to make require() calls work properly. This wasn't ideal for a few reasons:

We had to determine manually which file contained the exports we needed, instead of being able to use the documented exports that you'd get from a require() call.
Because library files could require() any other file within node_modules we copied the entire directory within our add-on.
We didn't hit this with mozJexl, but I'm pretty sure that if a library we wanted to include had dependencies of its own, our custom loader wouldn't have resolved the paths properly.

While working on another patch, I hit a point where I wanted to pull in ajv to do some schema validation and decided to see if I could come up with something better.

Webpack

I already knew that a few components within Firefox are using Webpack, such as debugger.html and Activity Stream. As far as I can tell, they bundle all of their code together, which is standard for Webpack.

I wanted to avoid this, because we sometimes get fixes from Firefox developers that we upstream back to Github. We also get help in the form of debugging from developers investigating issues that lead back to our add-on. Both of these would be made more difficult by landing webpacked code that is different from the source code we normally work on.

Instead, my goal was to webpack only the libraries that we want to use in a way that provided a similar experience to require(). Here's the Webpack configuration that I came up with:

/* eslint-env node */
var path = require("path");
var ConcatSource = require("webpack-sources").ConcatSource;
var LicenseWebpackPlugin = require("license-webpack-plugin");

module.exports = {
  context: __dirname,
  entry: {
    mozjexl: "./node_modules/mozjexl/",
  },
  output: {
    path: path.resolve(__dirname, "vendor/"),
    filename: "[name].js",
    library: "[name]",
    libraryTarget: "this",
  },
  plugins: [
    /**
     * Plugin that appends "this.EXPORTED_SYMBOLS = ["libname"]" to assets
     * output by webpack. This allows built assets to be imported using
     * Cu.import.
     */
    function ExportedSymbols() {
      this.plugin("emit", function(compilation, callback) {
        for (const libraryName in compilation.entrypoints) {
          const assetName = `${libraryName}.js`; // Matches output.filename
          compilation.assets[assetName] = new ConcatSource(
            "/* eslint-disable */", // Disable linting
            compilation.assets[assetName],
            `this.EXPORTED_SYMBOLS = ["${libraryName}"];` // Matches output.library
          );
        }
        callback();
      });
    },
    new LicenseWebpackPlugin({
      pattern: /^(MIT|ISC|MPL.*|Apache.*|BSD.*)$/,
      filename: `LICENSE_THIRDPARTY`,
    }),
  ],
};

(See also the pull request itself.)

Each entry point in the config is a library that we want to use, with the key being the name we're using to export it, and the value being the path to its directory in node_modules¹. The output of this config is one file per entry point inside a vendor subdirectory. You can then import these files as if they were normal .jsm files:

Cu.import("resource://shield-recipe-client/vendor/mozjexl.js");
const jexl = new moxjexl.Jexl();

output.library

The key turned out to be Webpack's options for bundling libraries:

output.library: Name of the library you want to export.
output.libraryTarget: How you want to expose your library.

By setting output.library to a name like mozJexl, and output.libraryTarget to this, you can produce a bundle that assigns the exports from your entry point to this.mozJexl. In the configuration above, I use the webpack variable [name] to set it to the name for each export, since we're exporting multiple libraries with one config.

ExportedSymbols

Assuming that the bundle will work in a chrome environment, this is very close to being a JavaScript code module. The only thing missing is this.EXPORTED_SYMBOLS to define what names we're exporting. Luckily, we already know the name of the symbols being exported, and we know the filename that will be used for each entry point.

I used this info to write a small Webpack plugin that prepends an eslint-ignore comment to the start of each generated file (since we don't want to lint bundled code) and this.EXPORTED_SYMBOLS to the end of each generated file:

function ExportedSymbols() {
  this.plugin("emit", function(compilation, callback) {
    for (const libraryName in compilation.entrypoints) {
      const assetName = `${libraryName}.js`; // Matches output.filename
      compilation.assets[assetName] = new ConcatSource(
        "/* eslint-disable */", // Disable linting
        compilation.assets[assetName],
        `this.EXPORTED_SYMBOLS = ["${libraryName}"];` // Matches output.library
      );
    }
    callback();
  });
}

Licenses

During code review, mythmon brought up an excellent question; how do we retain licensing info for these files when we sync to mozilla-central? Turns out, there's a rather popular Webpack plugin called license-webpack-plugin that collects license files found during a build and outputs them into a single file:

new LicenseWebpackPlugin({
  pattern: /^(MIT|ISC|MPL.*|Apache.*|BSD.*)$/,
  filename: `LICENSE_THIRDPARTY`,
}),

(Why MIT/ISC/MPL/etc.? I just used what I thought were common licenses for libraries we were likely to use.)

Future Improvements

This is already a useful improvement over our old method of pulling in dependencies, but there are some potential improvements I'd eventually like to get to:

The file size of third-party libraries is not insignificant, especially with their own dependencies. I'd like to consider minifying the bundles, potentially with source maps to aid debugging. I'm not even sure that's a thing for chrome code, though.
Some libraries may rely on browser globals, like fetch. I'd like to figure out how to auto-prepend Components.utils.importGlobalProperties to library files that need certain globals that aren't normally available.
If several system add-ons use this pattern, we might end up with multiple copies of the same library in mozilla-central. Deduplicating this code where possible would be nice.
If there's enough interest in it, I'd be interested in pulling this pattern out into some sort of plugin/preset so that other system add-ons can also use npm libraries with ease.

Did you know that Webpack will automatically use the main module defined in package.json as the entry point if the path points to a directory with that file? ↩

Q is Scary

June 8, 2017 mozilla

q is the hands-down winner of my "Libraries I'm Terrified Of" award. It's a Python library for outputting debugging information while running a program.

On the surface, everything seems fine. It logs everything to /tmp/q (configurable), which you can watch with tail -f. The basic form of q is passing it a variable:

import q

foo = 7
q(foo)

Take a good long look at that code sample, and then answer me this: What is the type of q?

If you said "callable module", you are right. Also, that is not a thing that exists in Python.

Also, check out the output in /tmp/q:

0.0s <module>: foo=7

It knows the variable name. It also knows that it's being called at the module level; if we were in a function, <module> would be replaced with the name of the function.

You can also divide (/) or bitwise OR (|) values with q to log them as well. And you can decorate a function with it to trace the arguments and return value. It also has a method, q.d(), that starts an interactive session.

And it does all this in under 400 lines, the majority of which is either a docstring or code to format the output.

How in the Hell

So first, let's get this callable module stuff out of the way. Here's the last two lines in q.py:

# Install the Q() object in sys.modules so that "import q" gives a callable q.
sys.modules['q'] = Q()

Turns out sys.modules is a dictionary with all the loaded modules, and you can just stuff it with whatever nonsense you like.

The Q class itself is super-fun. Check out the declaration:

# When we insert Q() into sys.modules, all the globals become None, so we
# have to keep everything we use inside the Q class.
class Q(object):
    __doc__ = __doc__  # from the module's __doc__ above

    import ast
    import code
    import inspect
    import os
    import pydoc
    import sys
    import random
    import re
    import time
    import functools

"When we insert Q() into sys.modules, all the globals become None"

What? Why?! I mean I can see how that's not an issue for modules, which are usually the only things inside sys.modules, but still. I tried chasing this down, but the entire sys module is written in C, and that ain't my business.

Most of the other bits inside Q are straightforward by comparison; a few helpers for outputting stuff cleanly, overrides for __truediv__ and __or__ for those weird operator versions of logging, etc. If you've never heard of callable types¹ before, that's the reason why an instance of this class can be both called as a function and treated as a value.

So what's __call__ do?

Ghost Magic

def __call__(self, *args):
    """If invoked as a decorator on a function, adds tracing output to the
    function; otherwise immediately prints out the arguments."""
    info = self.inspect.getframeinfo(self.sys._getframe(1), context=9)

    # ... snip ...

Welcome to the inspect module. Turns out, Python has a built-in module that lets you get all sorts of fun info about objects, classes, etc. It also lets you get info about stack frames, which store the state of each subroutine in the chain of subroutine calls that led to running the code that's currently executing.

Here, q is using a CPython-specific function sys._getframe to get a frame object for the code that called q, and then using inspect to get info about that code.

# info.index is the index of the line containing the end of the call
# expression, so this gets a few lines up to the end of the expression.
lines = ['']
if info.code_context:
    lines = info.code_context[:info.index + 1]

# If we see "@q" on a single line, behave like a trace decorator.
for line in lines:
    if line.strip() in ('@q', '@q()') and args:
        return self.trace(args[0])

...and then it just does a text search of the source code to figure out if it was called as a function or as a decorator. Because it can't just guess by the type of the argument being passed (you might want to log a function object), and it can't just return a callable that can be used as a decorator either.

trace is pretty normal, whatever that means. It just logs the intercepted arguments and return value / raised exception.

# Otherwise, search for the beginning of the call expression; once it
# parses, use the expressions in the call to label the debugging
# output.
for i in range(1, len(lines) + 1):
    labels = self.get_call_exprs(''.join(lines[-i:]).replace('\n', ''))
    if labels:
        break
self.show(info.function, args, labels)
return args and args[0]

The last bit pulls out labels from the source code; this is how q knows the name of the variable that you pass in. I'm not going to go line-by-line through get_call_exprs, but it uses the ast module to parse the function call into an Abstract Syntax Tree, and walks through that to find the variable names.

It goes without saying that you should never do any of this. Ever. Nothing is sacred when it comes to debugging, though, and q is incredibly useful when you're having trouble getting your program to print anything out sanely.

Also, if you're ever bored on a nice summer evening, check out the list of modules in the Python standard library. It's got everything:

Check out this page and search for "Callable Types" and/or __call__. ↩

Pages

1 2