Tuesday, July 10, 2012

TrueSkill vs TrueSkill Mean

I finally got suckered into reading the Yucata.de forums about the recent change to the metagame system. Or rather, I read a couple of the numerous threads with countless posts on the topic. The Yucata forum is more civilized than most of the gaming forums I've read over the years but it doesn't make my head hurt any less. For the most part it's people making up numbers they think furthers their point of view and talking past each other in an effort to make their point known. I got dragged into it partially because I can't stand that someone might be wrong on the internet and partially because I truly believe the system is flawed and want somehow to convince the people in charge so they change it. So I parsed some data from my recent games and went flying in headfirst to provide my own made up numbers. *sigh*

At any rate, on the bus ride home I actually had an epiphany on my biggest issue. I was trying to talk myself out of starting a new account when I figured out just what it is that bothers me so much about the new system. I even talked about it a little bit in my post last week when I mentioned that no one is ever overvalued so you never get to take advantage of someone else's TrueSkill being inflated. This is the big problem! New players start off super-deflated and everyone slowly works their way towards where they should be. Two people who are where they should be can play just fine and have positive metagame EV. The problem comes because you sometimes face someone deflated and never face anyone inflated so if you keep playing random people you get hurt. The core issue is that the new ranking system compares TrueSkill values instead of TrueSkill means. I'm going to go into some details about the two and why I strongly feel the wrong one is being used.

I intend to post this on their forums tomorrow, assuming no one finds any flaws in my reasoning between now and then...


The basic idea behind TrueSkill is some engineers at MicroSoft wanted to find a way to boil down how good someone was at Halo to a single number. The reason they wanted to do this was threefold: better matchmaking in online games, bragging rights among the players, and so they could award invites/byes to big events. It's a great idea, and not really anything new since chess and Magic had a similar rating scheme going (Elo rating). The problem with using Elo was it was designed as a two-player system and Halo can have huge free-for-all games. So the MicroSoft guys crunched their numbers and came up with a new system that would work for multiplayer games and which actually gave each person two relevant numbers, not one. Instead of the system giving someone a single definite rating it gives each person a range of ratings where their actual rating is likely to reside. This range is a Gaussian distribution (bell curve) and is defined by an estimated average rating (mean) and how certain the system is that your actual rating is near the mean (sigma or standard deviation).

If you play a game for the first time the system knows nothing at all about you. As such it assigns your TrueSkill mean to be the average of the entire population (an arbitrary number defined by the system, on Yucata that number is 1200) and gives you a huge sigma since it has no clue how good you might actually be. This value is again arbitrarily defined by the system and determines how big the range of ratings will actually be. On Yucata sigma starts at 400. Because the range is a bell curve we know that there's a 99.7% chance that your actual rating will lie within 3 sigmas of your mean. On Yucata this means that 99.7% of people will have an actual rating somewhere between 0 and 2400.

Now, numbers tend to confuse people. I'd expect people who play board games online are less likely to be confused than your average person but they're still confusing. As such, leaderboards don't want to show you two numbers and expect you to understand Gaussian distributions. People want a single number to look at and brag about. People like to see numbers that go up! Having half your population go down from the starting value (which is expected; half the people are below average) isn't cool. So while the TrueSkill system behind the scenes knows about the mean and the sigma you still want a single number that people can look at and see get bigger. As such, for leaderboards, the system is designed to spit out the very lowest value of that 99.7% range. In formula form, TrueSkill = mean-3*sigma.

Note on Yucata this value for a new player to a game will be 0. And will tend to get bigger, even if someone is bad at a game, because the system will become more and more sure of their actual rating as they play games. Someone with a real rating of 600 could expect to see a mean of 700 with a sigma of 50, for example. This still gives a TrueSkill number of 550 (we're 99.85% certain they're at least 550) but they're really subpar at the game.

This is fantastic for getting people to keep playing the game. They're bad, but they won't get discouraged by having their shown TrueSkill value keep sinking. Instead it will actually creep upwards as the system hones in on just where they should be. It's just fine for leaderboards as well since in order to get a really high value you need to play a lot of games (to shrink your sigma) and win a large percentage of them (to grow your mean). I really like it for both of those reasons.

It's really important to note that the formulas used by the TrueSkill system behind the scenes don't use this displayed TrueSkill number for any reason. After a game ends it uses the old means and sigmas of the players to compute the new means and sigmas. Those are the values that actually mean something as far as how good someone is. Knowing what their extreme lower bound is doesn't actually help a whole lot. Comparing two extreme lower bounds is really of questionable use.

It's this piece of information that I think is key to my problem with the new metagame ranking system. In order to determine how many metagame points are earned after a game the system compares the leaderboard TrueSkill number instead of the TrueSkill mean. This sticks three times each person's sigma into the mix and a new player's sigma is so massive it dominates the entire formula. My sigma in Roll Through The Ages, for example, is 18. A new player's is 400. That's a difference of 1146! Let's look at the difference of the two different comparisons:

TrueSkill mean:

u1-u2 = 1394-1200 = 194

TrueSkill leaderboard value:
(u1-3*s1)-(u2-3*s2)=(1394-3*18)-(1200-3*400)=194+1146=1340

Do note that in the first formula we're not very confident in the 1200 and that's not being taken into account at all. We're assuming the new guy is average when it comes to how he'll impact the metagame ranking. In the second formula we're actually assuming he's one of the worst players on the planet. 99.85% of people rate to be better than this guy!

If the first formula was used sometimes we'd punish the new guy. Sometimes we'd give him a boost. And his opponent in that game would sometimes get a boost and would sometimes get punished. In fact, since we're starting with an assumption he's exactly average, these two will cancel out in the long run. Short term there will be some fluctuations but those sorts of things are expected in this sort of system.

If the second formula was used we'd almost always boost the new guy. Only one person in 667 is actually worse than we're assuming here. The other 666 people are getting a big boost from this formula. And by extension, the people they play against are getting a big penalty.

I believe the new metagame system is quite reasonable when two people with established rating play against each other. It isn't a coincidence that the two formula above actually approach each other in that situation. bk375 (the #4 guy on the RTTA leaderboard and someone who has started a new account) only has a sigma of 15. Compare me to him using the two formula again:

TrueSkill mean:

u1-u2 = 1394-1418 = -24

TrueSkill leaderboard value:

(u1-3*s1)-(u2-3*s2)=(1394-3*18)-(1418-3*15)=-24-9=-33

Comparing me to the new guy resulted in 194 vs 1340 or a 590% increase. Comparing me to bk375 resulted in -24 vs -33 or a 38% increase. That's a fantastically huge disparity!


My question is, why does the system use TrueSkill leaderboard value instead of TrueSkill mean value? Is there something I'm missing that makes one want to benefit people with a high sigma? (And, by extension, punish those with low sigma who play against them.) I was trying to figure out why starting a new account felt so appealing and it's this one issue that's why.

2 comments:

David Nicholson said...

TrueSkill will be more effective at encouraging people to play new games. It is rewarding people that spend their time on the site playing new games. If you want a good metagame ranking you need to simply play some of the 30 games available that you have played less than 10 times. That seems a reasonable way to encourage people to diversify games being played.

Ziggyny said...

And once I run out of those games? The problem with that argument is you're saying there's a pool of free points everyone deserves to get (the ones from their first 10 wins in each game) but I can't get most of those points on Ziggyny because I've already got 10 wins in many of the available games.

Why should I not start a new account in order to get access to the full pool?