Give me a layman's description of your rating system.
First, to avoid confusion, be aware that I publish two different sets of rankings:
- the "Massey Ratings", which utilize actual game scores and margins in a diminishing returns fashion
- the BCS compliant version, which do not use the actual score
I will summarize the latter, since that is the one relevant to college football fans.
Massey's BCS ratings are the equilibrium point for a probability model applied to the binary (win or loss) outcome of each game.
All teams begin the season rated the same. After each week, the entire season is re-analyzed so that the new ratings
best explain the observed results. Actual game scores are not used, but homefield advantage is factored in,
and there is a slight de-weighting of early season games. Schedule strength is implicit in the model,
and plays a large role in determining a team's rating. Results of games between well-matched opponents naturally
carry more weight in the teams' ratings.
The final rating is essentially a function of the team's wins and losses relative to the schedule faced.
How are your BCS ratings different than the others?
All of the BCS computer rankings
are based on wins and losses relative to schedule faced.
Most of the differences can be attributed to the particular
mathematical model used to generate the ratings.
There is no tidy term in a one line formula I can point to as the difference between mine and the others.
Here are some small data-related differences:
- does the rating utilize homefield?
- does the rating start every team at zero, or with a preseason value?
- does the rating weight more recent games more?
- does the rating include all teams, or just FBS?
Overall, the BCS computer rankings probably correlate more than a random selection of six human poll ballots would.
Give a quick bio / resume
Somebody has given me a Wikipedia page
I can't vouch that it is 100 % accurate, but it's a good place to start.
Kenneth Massey is a professor of mathematics at Carson-Newman University in
Jefferson City, TN. He received his B.S. from Bluefield College and a master's degree (ABD) in mathematics from Virginia Tech.
His research involved Krylov subspaces in the field of numerical linear algebra.
Kenneth is a partner with Limfinity consulting,
and produces the Massey Ratings, which provide objective
team evaluation for professional, college, and high school sports. His college football
ratings have been a component of the Bowl Championship Series since 1999.
Massey has also worked with USA Today High School Sports since 2008.
How did you get involved with the BCS?
I started working on college football ratings as an honors project in
mathematics while at Bluefield College in 1995.
Continuing this interest as a hobby, I developed a web page and helped pioneer
the organization of college football rankings via my composite.
The BCS, which started in 1997, realized the need to expand its sample of computer ratings
from three to seven. My web site became a central resource point as the BCS officials
searched for quality, respected, and well-established computer ratings.
I received a phone call from SEC commisioner Roy Kramer in the spring of 1999 to discuss
the prospect of adding my ratings to the BCS formula.
Mine were chosen because of their demonstrated
accuracy and conformance to the consensus, and my personal expertise in the field.
How do you evaluate a computer rating to know which one is the best?
By nature, a rating system tries to explain past results according to some
model of the probabilistic and dynamic aspects of competitive sports.
Therefore a rating system may by definition be the "best" according to
the objective function it tries to optimize, e.g. to minimize the MSE between
actual and model scores (in hindsight).
Depending on the objective, data used, and weightings, some systems may be
designed to "predict" future outcomes, instead of merely "retro-dicting" past outcomes.
This is a valid approach if you accept the maxim that past results are an indicator
of future performance.
On the college football and basketball
composite pages, I compute two metrics: correlation to consensus and ranking violation percentage.
These are not meant to assess the quality of a rating system, but only to provide a crude reference point
for comparing and contrasting different systems.
Todd Beck's prediction tracker
documents predictions made a priori. It has proven difficult for any rating system to consistently
be superior to the Vegas lines.
Explain how your system uses margin of victory (MOV).
The BCS compliant version does not use MOV at all. There is no distinction between a 21-20
nailbiter, and a 63-0 blowout.
The main version does consider scoring margin, but its effect is
diminished as the game becomes a blowout. The score of each game is
translated into a number between 0 and 1. For example 30-29 might give 0.5270,
while 45-21 gives 0.9433 and 56-3 gives around 0.9998
The maximum is topped at 1, so the curve flattens out for blowout scores.
In addition, I do a Bayesian correction to reward each winner, regardless of the game's score.
The net effect is that there is no incentive to
run up the score. However, a "comfortable" margin (say 10 points) is preferred to
a narrow margin (say 3 points).
In summary, winning games against quality competition overshadows blowout scores against inferior opponents.
Each week, the results from the entire season are re-evaluated based on the latest results.
Consistent winners are rewarded, and a blowout score has only marginal effect on a team's rating.
Explain how your system uses strength of schedule (SOS).
Results and SOS are the yin and yang of computer ratings. Simply put, a team's rating
measures their performance relative to the schedule they faced. In a true computer rating,
rating and SOS are inter-dependent, and are calculated in conjunction with each other.
This relationship is implied by the solution to a large system of simultaneous equations, which
represents an equilibrium of some mathematical model. Ratings are a function of SOS, and vice-versa.
Many fans are familiar with crude RPI type systems that may account for SOS by the records of opponents
and opponents' opponents. A sophisticated computer rating system does not utilize such artificial and ad hoc factors.
Instead, because SOS is an implicit part of the model, it accounts for opponent strength to an effectively infinite
number of levels. It makes no sense to say that the SOS component is a certain percentage of the total rating.
To gain a glimpse of how SOS can be implicit, consider this simple equation:
rating = performance + SOS
Here performance could be related to win-loss record, or some other objective measure of success.
Each team has an equation of that form, which may be re-arranged to get:
SOS = rating - performance
Rating and schedule are functions of each other. All possible connections
between teams are accounted for as the cream rises to the top and an equilibrium is reached.
As a corollary, there is no need for explicit reference to conference affiliation.
Conference strength is accounted for automatically as the model surveys the full topology
of the team/game network.
Aggressive scheduling is not penalized, but instead raises the potential rating that a team may reach.
However, scheduling alone doesn't earn a high rating - there must be success against it.
Faith without works is dead. -- James 2:20
For a single game, it is better to defeat a poor team than lose to a good team. However,
that team's ranking may fall because another team had a more impressive win.
Depending on the gap in SOS,
an 11-1 team with a tough SOS may be rated higher than a 12-0 team that faced an easier schedule.
Of course, there are symmetric forces at work toward the lower end of the rating spectrum.
Since the actual rating model involves non-linear equations, the notion of "average" SOS may
be misleading. Games between equally matched teams are more influential to a team's overall rating.
For example, the #1 team's risk/reward is greater for playing #2 and #80 than for playing #30 and #31.
Does conference affiliation affect the ratings?
I don't do any prior weighting of conferences, and therefore conference affiliation
plays no direct role in the ratings. Schedule differences are implicit in the model. Conferences
that perform well in inter-conference matchups will naturally be rated higher. Since these
games provide a reference point for the entire conference, the rising tide lifts all boats.
For this reason, non-conference games are in some sense more important than conference games.
I want to develop my own rating system. Where do I start?
If you want to research existing rating systems, visit the theory page
In particular, I recommend
David Wilson's directory
To get the feel for how computer ratings work, you may want to try this iterative procedure:
- set each team's rating to zero
- calculate each team's SOS to be the average rating of their opponents
- calculate each team's rating to be their average net margin of victory plus their SOS
- go back to step 2 and repeat until the ratings converge
The data page
provides data for many leagues.
Do you encourage sports wagering using your numbers?
Absolutely not! Please read the disclaimer
How does the betting line get set?
Oddsmakers use computer rating models as a tool when setting initial lines for games.
However, they also incorporate data related to injuries, officials, matchups, motivation, travel,
days off, weather, etc.
Actual wagers then determine how the line moves to balance the bets so that the bookmaker is hedged.
What software do you use?
I use only open-source software, including the common LAMP framework (Linux, Apache, Mysql, PHP/Python).
The main rating software is written in C++. Data is stored in a MySQL database, and
If you would like to write your own rating software, Octave
is a good high-level mathematical language.
How do you deal with forfeits?
Forfeited results are not factored into the computer ratings. If an on-the-field
outcome was later forfeited, the original score is used in the calculations,
but the result is stricken from the win-loss records.
How do you deal with exhibition games?
Occasionally, especially in college basketball, one team counts
a game in official records, but their opponent doesn't.
This is a conundrum since we can't really judge the sincerity of strategy and effort in such contests.
However, for record keeping practicality, I have decided that a game does not
get marked exhibition unless both participating teams deem it so.
Are you satisfied with the BCS formula?
Over the years, the BCS has gotten criticized for fine-tuning its formula. Recent changes
have simplified the system for the better and removed extraneous redundancies.
The current setup is a good balance of the traditional human polls, which the fan base is most comfortable with,
and the objective computer component. Over the years, the two methods have tended to converge as
the computers have revealed to the human voters the dangers of regional bias and misunderstanding of schedule strength.
There will always be controversy when the formula must split hairs between #2 and #3, but the system
is stable and beginning to be accepted by the media and fans.
What would you change?
I think the biggest thing that is hurting college football is the lack of quality inter-conference
games. Due to the premium placed by the media on win-loss records, most athletic directors are
trying to assure themselves of 3-4 easy home wins each year in their out-of-conference schedule.
Great matchups like Texas vs Ohio St in 2005 are few and far between. I would like to see
something like an ACC-SEC challenge, whereby teams are matched up for 12 games over one weekend.
This would get fans and media excited, and also provide a more solid basis for comparing teams
from different conferences.
An increase in good matchups would require a shift in philosophy by the human polls. Currently
they start with a preseason notion of who will be good, and adjust each week according
to a predictable pattern. They are reluctant to go back and re-evaluate earlier results
in light of new information, and thus prior biases are likely to compound as the season goes on.
One result is that a 8-3 team that played a brutal schedule is often penalized, while and 11-0
team with a weak schedule is rewarded. Sometimes this strategy backfires (e.g. Auburn 2004),
but in general padding ones record with wins means a higher poll ranking. Delaying the release
of human polls until mid-season would help minimize bias and develop more respect for teams
that play challenging early season non-conference schedules.
Do you favor a college football playoff?
Every year we hear the arguments that college football should have a
playoff system. The true champion should be decided "on the field", not
by biased sportswriters or computer geeks. To hear such talk, it is easy
to believe that a playoff would produce an undisputed national champion.
I argue that the BCS is in fact a 2-team playoff.
Regardless of the playoff field's size, there will always the (n+1)st team that claims to deserve a chance to prove itself.
The issue is therefore to develop a system that picks the most deserving teams to participate.
The BCS's job is to place the two best teams in the national championship game.
Perhaps paradoxically, the best team has a better chance of winning the college football championship
than it does in college basketball or the NFL. The fewer rounds that must be navigated, the less
chance that the best team will be upset.
A playoff provides a relatively small sample of game results.
The best system should make the regular season most important.
The BCS system is the best college football has ever had to determine an undisputed champion.
It is really a two team playoff. So the correct question is whether I favor
expanding the number of teams in the playoff. College football is unique among all sports
in that every regular season game is vital. Also, the bowl system provides a great reward to
players and fans, allowing many schools to finish the year on a high note.
A playoff should not ruin this. I would favor a 4 team playoff (the so called "plus one" system),
or possibly 8 teams if it was done right, but no more than that.
What is the purpose of this web site?
I maintain this web site primarily as a hobby which creatively combines my interests
in math, sports, and computer programming. I enjoy sharing my work
with the visitors to my site. The Massey Ratings also serve an official purpose,
most notably as a component of college football's BCS.
How long have you done ratings?
I began my foray into computer ratings with college football in 1995.
Since then, I have produced ratings for many pro, college, and high school sports.
I continually work to improve the content and quality of my ratings and web site.
The Massey Ratings have been part of the BCS since 1999.
What is the purpose of the computer ratings?
In any competetive league, there should
be an objective and robust method to measure the performance of each team/individual.
Win-loss records may be misleading if teams play disparate schedules, and polls
suffer from human limitations and subjectivity.
After devising a mathematical model for the sport, an algorithm is implemented,
and the resulting computer ratings objectively quantify the strength of each team based on the
How does your rating system work?
There is really no simple answer to this question (although I was once asked
to provide one during a live interview on ESPN). Basicly, the ratings
are the solution to a large system of equations, which comes from a statistical
model and actual game data. For more details,
see the Massey Rating Description
How much time does it take?
I have written a fairly robust software to automate the calculations and
web page generation. Daily updates require little intervention on my part.
The bulk of my time is spent maintaining data files and writing computer code.
How big is your operation?
I am the sole proprietor of the Massey Ratings.
Hats I wear include: researcher, developer,
programmer, database manager, webmaster, and marketer.
Of course, I don't work in a vacuum, and this effort would
be impossible without internet resources and generous folks that
have contributed over the years. See the credits
Where do you get your data?
Most scores are collected electronically from a variety of publicly domain
. Many individuals have graciously shared the
results of their own data collection efforts.
When convenient, basic consistency checks are run on multiple independent sources
to verify the data's accuracy.
I have written software that parses web pages and other sources, extracts
the pertinent data, and merges it with my own database.
Corrections and hard-to-find scores are entered manually.
How often do you update your pages?
Automated daily updates are scheduled for many of the mainstream sports.
Each week (usually Monday), I do a full update of all the leagues.
Is the Massey computer system the best?
This depends on what goals you feel a rating system should meet. Should the
rating system be predictive, or should it only measure and reward past performance
(such as to determine who deserves the college football national
championship)? What data is available? How is the model defined?
Basicly any rating system
can claim to be the best with respect to what it sets out to accomplish.
That said, I believe that the Massey Ratings satisfy all of the desirable
properties of a rating system. The
sophistication of the model and algorithm is beyond any other method I'm
aware of. Every feature of my system is based on sound statistical assumptions
regarding the nature of sports and games.
There are rarely any skewed or highly abnormal results, and the Massey
Ratings are highly correlated with the consensus.
My rating model has undergone several revisions.
These changes are necessary to improve the quality of my ratings in light of
new ideas, gained experience, and access to more historical data with which
to refine the method.
What's the difference between "rating", "ranking", and "poll"?
A "ranking" is simply an ordinal number (such as 1st, 2nd, 3rd,...) that indicates
a team's placement in a strictly non-quantitative sense. In contrast, a team's
"rating" is generally a continuous scale measurement and must be interpretted
on a scale by
comparing it with other teams' ratings. For example, I can rank three teams as
follows: (1) Team A, (2) Team B, (3) Team C. This tells me that according to my
ranking criteria, A is better than B, and B is better than C. However, it does
not tell me how much
better. If ratings are assigned as
(A = 9.7, B = 9.5, C = 1.2), then it is easy to see that in fact A and B are quite
competitive while C is significantly inferior.
A poll is fundamentally different from a rating.
Polls typically result from the tabulation
of votes. For example, each ballot in college football's AP poll is the opinion
of one writer who should be #1,#2,#3, etc. So a poll is really a composite of
many opionions or preferences. In contrast, a computer ranking
is obtained from a single "measurement" of how good each team is based on the defining
Team A beat Team B, so why do your ratings still have B ahead of A?
This situation is usually called and "upset."
It is generally impossible order the teams to eliminate all inconsistencies
in actual game outcomes. Teams are not evaluated on the basis of one game,
in which there is potential for high deviation from typical performance levels.
Instead, a team's rating is based on its body of work, in some sense an "average" level of performance over the entire season.
Your ratings stink! Why isn't my team ranked higher?
The implementation of a computer rating
algorithm is completely objective. So if the computer gives your team a bad (or good)
rating, it shouldn't be taken personally. You have the right to disagree
with the computer, but more than likely this is evidence of your own subjectivity.
I do not meddle with the algorithm to "fix" the ratings. The
model defines certain criteria that determine a team's rating, and the results
are published on this web site without any human intervention.
What about predictions?
Ratings are designed to reflect past performance, namely:
winning games, winning against good competition, and winning convincingly.
As a consequence, the ratings have some ability to predict the outcome of future games.
For many sports, I post predictions of upcoming games and monitor their success.
In most cases, I would trust a computer's prediction over a human's.
However, while this is often the most popular and entertaining
application of computer ratings, it is not my primary purpose.
Predictions are obtained
by extrapolating the analyisis of past performance to estimate future performance.
Usually, the past is a resasonable indicator of what to expect in the future.
However sporting events are to a great extent random, so upsets will occur.
computer ratings are ignorant of many important factors such as injury, weather,
motivation, and other intangibles. With this in mind, it is not wise to hold
unrealistic expectations of the predictions.
The purpose of the Massey rankings is to order teams based on
achievement. This objective may occasionally yield some surprising results:
for instance having good teams from weak conferences ranked higer
than one might expect. This is not to say that such a team is
"better" than all teams below it. It is simply being rewarded for
its success at winning the games it has played.
It is incorrect to assume that the Massey model
is predicting that the higher ranked team should
defeat a lower ranked team. Model predictions can be derived from latent variables, and
may not agree with the rankings.
Why do you post three different rating systems?
While the algorithms that produce computer ratings are objective, the choice of the
model itself is not. Multiple systems provide the opportunity to compare alternative
interprettations of the same data. Although there is general agreement,
computer ratings are also quite diverse.
The Massey Ratings are my creation, while
the Markov and Sauceda models were developed with help from friends of mine.
Are your ratings biased toward Virginia Tech?
I went to graduate school at Virginia Tech, and I am a proud Hokie fan.
However, my ratings are completely objective with
regard to every team, including Virginia Tech.
To the computer, "Virginia Tech" is just a name, and the ratings
would be the same if every team were anonymous.
The integrity of my ratings has been verified by the BCS.
Historical results show that the Hokies' Massey ranking
conforms to those produced by other polls and computer ratings.
How do you generate preseason ratings?
The BCS compliant ratings do not use preseason information, so everyone starts at zero.
A team's rating may look funny or fluctuate wildly until there is enough evidence to get a more precise
measurement of the team's strength. As games are played, the computer gradually 'learns' and
the cream rises to the top.
For the main version, preseason ratings compensate for the lack of data early in a given season.
They give the computer
a realistic starting point from which to evaluate teams that have played zero or few games.
This limits dramatic changes that could be caused by isolated results not buffered by the context of other games.
The effect of the preseason ratings gradually diminishes each week. When every
team has played a sufficient number of games to be accurately evalulated
based on this year alone, the preseason bias will be phased out.
Preseason ratings are based on an extrapolation recent years' results,
tuned to fit historical trends and regression to the mean.
A team's future performance is expected to be consistent with the strength of the program,
but sometimes there may be temporary spikes.
Other potentially significant indicators (ex. returning starters, coaching changes, and recruiting) are ignored.
Therefore, preseason ratings should not be taken too seriously,
How can X be ranked so high after losing their last game so badly?
The model assesses each team's body of work, and is not prone to over-reacting to one result.
Recognizing the variation in outcomes, some games can be classified as upsets,
and blowout margins are not given undue influence.
The entire season is constantly re-evaluated, and the quality of each win, or the
severity of each loss will be adjusted in light of the opponents' other results.
The BCS version doesn't consider margin of victory, so there's no sense even mentioning how
many points X lost by. Schedule strength becomes a dominant criterion.
When two highly ranked teams play, the loser
is not penalized much if they have already beaten some quality teams.
Why do you have a BCS-compliant version of the ratings?
At the request of the BCS, I develped the BCS
use only the bare minimum of data: winner and location. There are several good reasons
for such a system.
- It completely eliminates any incentive for the unsportsmanlike conduct of running up the score.
- It purely rewards wins and penalizes losses regardless of the score. "You play to win the game."
- Occam's Razor: we intuitively should prefer a simpler model. There's a certain
elegance to using only the win/loss outcome without regard for how it was obtained.
- When many games have been played, a good model should converge to the win-loss model anyway.
- It poses an interesting challenge, given the sparsity and variability of the data.
What advantage does a computer ranking have over human polls?
Computer ratings have two main advantages. One, they can deal with
an enormous amount of data (hundreds of teams and thousands of games).
Second, they can analyze objectively - every team is treated the same.
This latter property is often a two-edged sword because it can cause
disagreement with public opinion, which is stoked by the media.
The public demands an objective system that plays no favorites and
doesn't encourage a team to run up the score. However, they don't
always agree with the inevitable consequences of such a system.
True, insufficient data can produce abnormal and flawed results.
However, computers have no ego, so a good model will correct itself,
and provide remarkable insights long
before a human will become aware of (and admit) his mistake.
In general, it doesn't take long for computer ratings to overtake
a human poll in terms of accuracy and fairness.