Method to My Madness
Secret to making the 41st percentile
If you read my last post or just paid attention to the leaderboard back in March during the March Madness Algorithm competition, you know I didn't do so hot - especially compared to the previous year. The good news though is that I can share my brilliant techniques with you now - if I can remember them (it's been awhile).
Transitive Property of Sports
For anybody who has followed a sports team, this idea should be pretty familiar and easy to follow. Hypothetically speaking (emphasis on hypothetically), if the Bears beat the Packers, and the Packers beat the Seahawks, you will probably try to assure yourself that the Bears will beat the Seahawks. If you've followed a team for very long though, you know this doesn't hold up quite like the transitive property: A > B and B > C, so A > C. But that doesn't mean that it's not still useful in making predictions.
A new feature I added this year took this idea and weighed any information it could produce by how many intermediate teams were needed to make the connection. For instance, team A may have played team C in the regular season so it should be weighed the most. But there may be another connection between team A and team C where team A beat team X who beat team Y who beat team C. It seems reasonable to weigh this result less because some teams will just match up well with certain teams so the more direct the connection, the better. While the closer the connections the better, the more intermediate teams you allow, the more information you can accrue. Team A may not have had the opportunity to play team C who they are now about to face, but they may have played several teams who did which each have the potential to provide evidence one way or another as to team A's likelihood of beating team B.
Bayes Bracket Theorem
Before getting to the bracket portion of the theorem, let's refresh our memories on my favorite theorem to forget and reprove to myself - Bayes Theorem.
p(A|B) | = | probability A happens GIVEN B happened |
p(A and B) | = | probability A happens AND B happens |
p(A|B) * p(B) | = | p(A and B) = p(B and A) = p(B|A) * p(A) |
p(A|B) * p(B) | = | p(B|A) * p(A) |
p(A|B) | = | (p(B|A) * p(A)) / p(B) |
Basically the only part that isn't pure algebraic manipulation is line 3. All you have to understand is that the probability of two events happening together, p(A and B), is the same as the probability that event A happens, p(A), times the probability of event B happening given event A has happened, p(B|A). I'm not finding any great purely intuition-based explanations for why that is, but if you think about it long enough I think it becomes pretty intuitive.
The final line is the result that we want though. It's probably not abundantly clear what it's useful for so let's use the xkcd comic as an example. We would want to know the probability that the sun has exploded given that the machine said that it did. So the first term in the right side of the equation would logically be the probability that the machine said the sun exploded given that the sun did actually explode. This would be 35/36 because the only way it would lie is if two sixes were rolled. Next is the probability the sun exploded. Let's just say 1/1,000,000. Next is the probability the machine says the sun exploded. I'm cheating here but basically that will just be ~1/36 because it will pretty much only say the sun exploded when it rolls the double sixes. The exact numbers aren't that important, but the result of solving for the left side of the equation is pretty useful.
p(sun exploded | machine said sun exploded) = ((35/36)*(1/1,000,000))/(1/36) = 0.000035 or 0.0035%
This tells us that if we accept those probabilities we used, the probability that the sun actually exploded given that the machine told us it did is actually incredibly small.
Now About the Bracket Part

Since predictions are made for all possible matchups before the tournament starts, after a couple rounds, your algorithm can't know anything for certain about the teams one team would have had to beat in order to get where they are. But fortunately we have a bunch of predictions that are supposed to tell us the probability that one team beats another. This means we can use Bayes Theorem to find something like the probability that team A just beat team E given that team A got to the 4th round. I'll get to why that might be useful to know, but let's keep going with the math. The way the bracket is set up we know that there is a 100% chance that team A got to round 4 given that it just beat team E. So there's the first term for Bayes Theorem. Skipping to the last term, this is just the probability that team A beat either E or F or G or H. Then the middle term is similar but just the probability that A beats E. Now you might think that these are super simple lookups into the prediction model, but note that these probabilities don't know that E, for instance, even would make it far enough to play A. To figure out how likely it is E even makes it that far, you essentially do the same thing over again but from E's perspective. You keep working backward to earlier rounds until you get back to 100% probabilities - the matchups that were explicitly chosen for the first round. So once this has been done for all teams at all positions, you know how likely it is to see a certain matchup.
So why is this useful? I am probably not yet taking full advantage of it, but so far I have been able to use it to try to characterize the teams a team probably would have had to beat to get to the position it currently is in. Expected Value is pretty useful here. For instance, you could see how likely it is that a team would have had to beat any other team by a certain point, and then you could use that probability as a weight with each teams' seed. This would allow you to see the "average" seed you would expect this team to have beaten so far.
It can become more clearly useful with features of teams other than their seed because the bracket could be more easily unbalanced when looking at historic tournament success or something other than seed. Imagine a team could easily end up in the championship game without having to play a single team that ever made it past the first round - or something crazy like that. But maybe their opponent couldn't have possibly made it to the championship game without beating at least four previous champion teams. This is just an extreme example to show how there might be bracket imbalances you can take advantage of this way.
I don't really like using the official seed data in my prediction model. Not only does it not demonstrate the potential lack of balance in the bracket, but it also is an integer rank that doesn't necessarily tell you how much better one team is over another. That's why I have always created my own more relative seeding feature to use.
Genetic Programming
My first two years in this competition, this seed alternative feature was a sort of metric based on regular season game results. I tend to call it my own ranking system. Previously, I just made this metric/formula by hand. For instance, I may have said a home loss is 3 times worse than an away loss or something equally arbitrary. Then I would just accumulate a score for every game. Since being a little more introduced to genetic algorithms and genetic programming during my final semester at Taylor, I thought this would be a fun place to try genetic programming out.
Genetic programming is an application of genetic algorithms - which you can learn about in my first real post (the 5th paragraph is the most useful, if you're lazy). One major difference between that genetic algorithm approach and this genetic programming approach is that the individuals/offspring/solutions are much more dynamic. For the Traveling Salesman Problem (TSP), the solutions were just: go from point A to B to C to D.... For this March Madness "ranking" problem, a solution might be as simple as: opponent wins / opponent games played. Or it could get much more complicated by giving the program the ability to use more operators (addition, subtraction, multiplication, exponentiation, tetration) and more operands (home or away, score, turnovers, etc.). At first the program will just be trying random formulas, but as evolution occurs, certain sub-formulas may be combined into better formulas and in theory those useful sub-formulas will become extremely common within the pool of solutions to select from for "breeding" more solutions.
Similar to the TSP genetic algorithm, there needed to be some sort of fitness function that decided which solutions were suitable for breeding. Mine got a little complicated, but essentially every regular season game would either add or subtract from a team's rank/score. And at the end the teams were put in order from those that had the highest score to those that had the lowest score. Since the ultimate goal is to produce a score that helps predict how far a team will go in the tournament, the ranking was supposed to match historical data on how far teams got in the tournament. This seemed like a better option than trying to match a random power ranking or seeding since it was actually based on how far the teams ended up going.
It didn't take many generations of solutions for the genetic program to produce a solution that outperformed the one I had spent hours deciding upon and tweaking back when I was trying to do things manually. However, my genetic program is still far from perfect. Preventing the solutions from turning into gigantic, complex formulas would probably require building in some of the algebraic laws like associativity and commutativity. This may be able to help simplify these complex formulas which could also help prevent the same formula from showing up in 20 different forms.
There's always next year
While these ideas didn't pan out this year, I will refine them and expand them for next year. Hopefully before then though I will make an easy to follow post on my research that I started Junior year at Taylor but have continued to tweak and refine ever since. Until then...