Thursday, January 19, 2012

TrueSkill

I mentioned earlier that Yucata has two rating systems. The rating system they use for each individual game is something called TrueSkill that broke down into 3 numbers but I didn't really know what they meant. I've had games where I won but both players had their rating go up. I've won games without an increase in my rating while my opponent lost rating. I come from worlds (Magic, League of Legends) that use an Elo system and neither of those things could happen there. I turned to my trusty companion Google to see if I could find out more...

It turns out TrueSkill is a rating system developed by Microsoft Research for use with xBox Live. It's a Bayesian system that had the same basic idea of the Elo system. The system has an idea of what the players have as current ratings. It predicts the likely outcome of the match. Then it looks at what actually happened and uses that information to update the ratings of the players. Win and it goes up. How much it goes up depends on how likely you were to win. Lose and it goes down.

But wait... I've seen losers gain. I've won and stayed even. That doesn't line up... Or does it? I read their paper on the topic to learn more. Part of me is now wishing I've done more to keep my math skills sharp but I managed to fight my way through it. The basic idea is you don't actually have one number associated with your rating, you have two. The first number is what the system thinks your rating might be right now. The second number is how confident the system is that your 'real' rating matches the first number. Then when the system reports a single number as your rating it is actually reporting a number representing a low bound of what your rating might be. (It subtracts three times the uncertainty from your rating. These numbers are a close approximation to a Gaussian distribution with the uncertainty as the standard deviation. Knocking three standard deviations off gives you a number you can be confident with 99.7% certainty to be at least your 'real' rating.

Elo is also only used for 2 player games. Microsoft's whole goal was to make a system that worked for multi-player games and team games (Halo and all the different game types there for example) and this one does. It also converges to your 'real' rating much faster than Elo does.


When it comes to Yucata in particular they set all initial ratings with a value of 1200 and an uncertainty of 400. This means your reported rating for every game on Yucata is 0 before you play a game. (The system has _no_ confidence in your rating of 1200 since you've done nothing to indicate you can play at all.) Apparently they've also set a parameter individually on each game to indicate how much randomness they think is in the game. The more randomness in the game the slower it takes to converge on your 'real' rating. (Maybe you won because you're better than me at Can't Stop. Maybe you won because you rolled better than I did. But when I beat you at Six that's all skill, baby!)

The more games you play the lower and lower your uncertainty number becomes. I've played 441 games of Roll Through The Ages, for example, and my uncertainty has come down from that initial 400 to a mere 19. On the other hand I've played only one game of Pompeii and my uncertainty there is 344. I guess beating Andrew doesn't say all that much about my skill, huh? My estimated rating is actually around the same in both games (1413 vs 1411) but the uncertainty difference (thanks to playing 440 more games) is huge. My listed rating in RTTA is 1356 which is good for 5th of 2710. Pompeii? A mere 378 rating. Good for 709th of 1055.  As for Andrew his one loss puts him at -44. (It floors to 0 but a little math gets the real number.) But with only the one data point there's not any real certainty there either. As far as the system knows I might be the second worst Pompeii player in the world! Andrew might be the second best! Or maybe I just got lucky!


It does explain how both players can gain rating from a game. The loser just needs to have his uncertainty number shrink by more than 1/3rd the amount of rating he lost. I do wish it would actually list the change in all the numbers instead of just the lower bound.

1 comment:

Andrew said...

My vote is you got lucky.