Staff Blog: Reviews -- What's in a number?

Reviews -- What's in a number?

On 11/21/2011 at 03:17 AM by Jason Ross

Seriously. What's in a number, anyway?

I mean that. The other day, I mentioned the reviews from Tom Mc Shea for Zelda: Skyward Sword and Uncharted 3. The former received a 7.5, the latter received a 9.0.

What's the difference? One's higher than the other, sure. Ok. There's that difference. But, I mean, do we approach this like one would approach the grading scale? I think most would agree that games tend to range in the higher areas of the 100-point scale. It doesn't even take MetaCritic to figure that out. That's just the way things are.

So what's the difference between a 7.5/75 and a 9.0/90? Does that mean out of every hundred games the reviewer has played, the former scored game would just barely make a top-25 list, while the latter would make the top-ten? No, not really. Does it mean the first score is like earning a C? I believe most grading scales currently give 75's a C grade, right? That means the former game is a low “A,” right? But C is supposed to be average, right? If these numbers correspond with letter grades, then why not just use letter grades? People already associate those with things, right?

So does that mean Tom Mc Shea thinks Skyward Sword is a C game? I'm not sure. I know GameSpot qualifies a 7.5 as “good.” An 8 is “great.” What's the difference between a good game and a great one? Numerically, apparently anything starting with a 7 is good. Next time someone asks you to guess a number between one and ten, don't just guess a good number, guess a great one! Pick 8. GameSpot says that's great!

Anyway, if numbers correspond to grades, what's the point of all those failing numbers? 60-69 is a D, yeah. Below that? F. So if we do quantify a failing grade as those below 60, what qualifies failure in a game?

I mean, how do you decide a game is only 60% of what a game should be? Is there a grading rubric? I'm referencing GameSpot a lot here, yeah. Deal with it.

Looking at GameSpot, they don't even have something as comprehensive as PixlBit's Policies page, which grants reviewers and readers an idea of how to score games. Three stars is average. More is above average. Less is below average. One star generally means aside from being able to function, the game isn't much good. No stars means the game is literally broken. Four stars mean practically everyone should consider the game. It might not be for everyone, but likely there are qualities everyone can appreciate in it. Five star games are conceivably the best games out there. Do we always match this standard? Probably not. I know I try to, and I encourage fellow staff to use this rubric as the basis for scoring, as do Nick and Chessa, but again, we can't play every game, and even between us, with clear language defining these areas, there are places in which we disagree.

So what's the difference between a 7.5 and a 9? 1.5. The difference between “Good” and “Editor's Choice”? Looking at metacritic, scores for games between 90-100 means “Universal Acclaim.” 75-89 means “Generally Favorable Reviews. 50-74 means “Mixed or Average Reviews.” 20-49 means “Generally Unfavorable Reviews,” and the last 19 points means there's an overwhelming dislike. Noteworthy, for movies, TV, and music, metacritic attempts to normalize the scores so the same blocks listed above, where each 20 points represents a tier on their scale.

What's notable, even for games? Being in the middle tier on this list only requires a consensus score of 50. That's well below a “D” on the grading scale. For other types of entertainment media, the score is even lower! Does this mean the most popular review conglomerate on the web doesn't find much value in interpreting scores as grades? I'm not intending to put words in anyone's mouth, but I personally believe that's the case. Also notable? The bottom 2/5ths of their rubric contains ½ of the possible points. From this, we can infer what we already likely have: Review scores do tend to skew on the higher end of 1-100 range. In fact, showing further imbalance, the first tier's range is just 10, the second's a larger 15, and the third's is a pretty massive 25. This definitely isn't your high-school grading scale.

But when looking on scores, what do they mean? How is an 82 substantially different than an 89? How would a 76 metascore mean much different from even a 72? They're in different tiers, right?

I want to iterate something now. I don't know where I'm going with this. I'm doing the light research while I write, while I think. There's a reason this is a blog entry and not an editorial. Don't expect a grand conclusion.

I think the only point I can make is this: Even metacritic, with all of its review averages says something: Not all score gaps are created equal in modern gaming reviews, and more, 100 possible ranking places are too many. Seriously. Look: They take their averages, and qualify them into five catergories, right? Sure. Yeah. But they go one step beyond that, and divide them into just three more-encompassing areas. Green, Yellow, and Red. I must presume that green means go. Yellow scores must warrant consideration, and red scores likely mean something like stop. Kind of like traffic lights, aye?

The point is, while the premise of ranking something with 100 potential points seems like it might be useful, so little meaningful information can be portrayed in something like a score that having 100 different places it can go just means there's too many for any real meaning. Oh, hey, I'm not talking about metacritic anymore. Now I'm back to normal reviews. What I'm saying is that I do find fault in those publications that give scores like 82, 83, and 84. It's pointless. Irrelevant. Meaningless. As if a number can accurately portray very slight differences between qualities of games. Even the aggregate place with all the numbers from a large variety of reviewers wants to make things simpler to process and more informative through the basic generalities.

So what do we have so far? First, I struggle to find value in numbers in general, especially when those numbers don't have some predefined weight. Secondly, when using numerical scores, generally placing games in tiers seems to be a lot easier to understand. If one were to say that Uncharted 3 is in the top tier of video games this year, and the same person were to say Zelda: Skyward Sword is in the second or third-highest tier, that holds a lot more weight than 9.0 and 7.5, right?

In this respect, that's one thing I like about PixlBit. The fact is, we've got 11 different tier placements, where six have clear meanings, and the other five are in-between. There's enough variety in the design of the meanings that the middle-ground is pretty much in the middle, and due to our definitions, let me say, I am not afraid to use the middles. Do I see Epic Mickey as a below-average title? Yes. Did I think in the big picture, Sonic Colors was right underneath that line of average platformers? Certainly. What did my score say about Naruto Shippuden: Dragon Blade Chronicles? It was a functional game, but it didn't really have any inherent value in being played.

Alternatively, I think Kirby's Return to Dreamland is a great light-hearted platformer for individuals and groups of players. It isn't the best, it's conceivable that an earlier Kirby game, Super Star had some better points. I gave it four stars. To me, that says it's absolutely a game someone interested in the genre should consider buying. No, not everyone in the world would find its value, but it's up there. Spider-Man: Dimensions for the Nintendo DS received a 3.5 games. It had faults, but also had several strong points.

Essentially, I'm just touting my own record with this. Yes, I did find Spider-Man: Dimensions to be a better game than Sonic Colors, both in construction and fun-factor. Generally, I believe the games sit on two separate, clearly marked tiers. While both sit between our defined stars, that's because these games kind of rest in the mean of the other tiers, plain enough. Other publications I haven't addressed do use letter grades, like in school, and some use 20 or 21 data points over a 10-point scale, like GameSpot currently does. My thought is that anything really over 11 is a little bit of overkill. Sometimes I wonder if 11 is a bit too much, but in my reviewing history, I have to say the range has served me pretty well. Additionally, a visual of “stars” helps kind of keep the idea of numbers a little bit out of their head, at least in my perspective. I don't personally believe that 3.5 PixlBit stars is exactly the same as a 7/10 , even though it relates to that mathematically. That's just my own feeling. Is it lower or higher? Eh. Like I said, I can't figure out what the numbers always mean.

I mention utilizing fewer data points not just to laud PixlBit's system, but to introduce another idea. There's a small, but likely growing demand for people looking for publications to rid themselves of scores. Understandably, they, like me, have difficulty discerning the value in one number over another, and like me, they know most people ignore writing, and instead opt to quickly view a score, each with his or her method to evaluate said scores. For a while, that “blog” Kotaku tried eliminating scores, and it didn't really work out. Why not? Because in a couple cases, the review would contain very little criticism or analysis. It would read more like a block of information. Reviews are meant to pass judgment and opinion on in a way the reader can make informed decisions, but plain information really can be just as meaningless as pinpointing a game's position on an abstract 100-point scale to those who are truly undecided.

Similarly, other individuals and publications want their reviews to say one of three things: Buy, Rent, or Pass. I cannot conceive of a situation where those three values are simply enough. Sometimes a game may receive a review that considers it to be a masterpiece... but what if it doesn't have much merit in playing again? What if it's only a few hours long? Take a look at Super Metroid. It's very easy to believe that Super Metroid is a game players can beat in a handful of hours. Playing through the game just once will provide gamers with a clear, complete experience. While the game can be played several times over in several different ways, that idea isn't appealing to everyone out there. It's more a matter of choice. What one person would consider a must-buy due to the game's expansive replayability, another would say is a rental because it's relatively short. Similarly, take a look at the often-criticized movie-like games, like the slew of mid-90's PC titles that used live-action video to create something of a “Choose-your-own adventure” story. People buy and rent movies, don't they? In the 90's, would relieving the experience be something valued, worth purchase over rental? Don't ask me.

More contemporary, how about an Ace Attorney game? These things play out kind of like mystery novellas, and in all reality, they're pretty short. You've seen the story once, would you want to see it again? Not everyone would, but some people are fanatical over Phoenix Wright, right? How about Zelda? Much of the gameplay is described as being emblematic of puzzles. Once a puzzle has been completed, in most cases, players will know how to complete said puzzle in the foreseeable future, right? For at least some, wouldn't that make some games in the franchises title worthy of renting without buying?

Moreover, people enjoy different genres to different degrees. One man might say “Pass” to Modern Warfare because he has a strong aversion to violence, for whatever reason. Another might say “Pass” to Super Mario World, because it's too cartoony for him. Buy/Rent/Pass just isn't a viable option, because it generalizes too much about everyone out there, and in result, like a hundred numbers, looses quite a bit of meaning pretty quickly.

I've droned on to the fourth page in my word editor this evening. I think that's about enough talk about review numbers for now. I think I've come to a stronger-made conclusion tonight than I did about last night, and the necessity of consistency. What have I concluded? Too many data points creates a lack of meaning, while too few create a lack of meaning, too. In my book, I'm happiest with a handful or two of points to rate things on that have well-written, thought-out meaning to them. I don't know how aggregate scores work around that, not exactly, but that's not really of my interest, in all fairness. Oh, and I also don't see why any individual gaming enthusiast would really care about something like a metascore being altered by any one review outside of its prior range, but maybe we'll get into that later.

Maybe.

Nick DiMola Director

11/21/2011 at 08:35 AM

Reviews are an interesting subject and I've enjoyed what you have here, Jason, even the memory stuff. As I've stated a number of times in the past, Chessa and I thought long and hard about how we'd be scoring games on this site before it was opened.

At a point we had discussed even the Buy, Try, Avoid system, which we actually thought was pretty good - but there were a few problems. #1, a recommendation of that sort is a little ambiguous for one (who does it apply to, does every good game deserve to be bought, etc). #2, a recommendation is just that, a recommendation. It's not a critical evaluation of the game, it's just what you think someone should do. While your text should absolutely reflect your critical opinion, like it or not, it's a critic's job to provide a rating. #3,you can't get picked up on Metacritic. Despite anyone's feelings on the site, they are a key player in this industry and being aggregated there is extremely important.

Of course, this has resulted in the 5 star + Recommendation system we have here. We never quite go with the Buy, Try, Avoid wording, but try and steer recommendations to tell different players what's worth their money. Some games can hit all three recommendations given their content - Demon's/Dark Souls comes to mind.

Naturally, I agree with you in terms of using 5 stars. I think it's one of the best mediums to dispel a score. It has enough granularity to properly classify a game and leaves little to be ambiguous.

Now, let me share a story. Back when I was a kid, I also read Nintendo Power. If you remember correctly, for a long time they used the 100 point scale to rate games. I was really into this rating scale. In my mind, it helped categorize and order every game out there. The granular scores meant, to me, that I could definitively say one game was better than another and if you were to make a list, x game would come before y game because x got a 95 and y got a 93.

If you remember still, Nintendo Power switched their rating system to the same 5 star scale we use today. I was infuriated. How on earth could I possibly get an idea if I should buy one game over another? Both x and y now have a 5 star score - which one is better?

The fact of the matter is, it doesn't matter, it all comes down to preference because they're clearly made from the same stuff. What I didn't understand and maybe Nintendo Power didn't at first either, is that the 100 point scale may give you the flexibility to score one game over another by a minute amount and that works great for a while, but what happens when two games get the same score and you clearly state one is better than the other? Well the whole ordering system goes out the window, that's what.

Fact of the matter is, it's not our job, in my opinion, as critics to figure out where each game sits in the grand scheme of things. Metacritic should be worrying about that one. It's our job to score the game based on some quantifiable and well defined scale. I appreciate that you use our policy guidelines, as I consult them each time I write a review to help clearly designate how I feel about a game. I can't imagine having the burden of choosing a score out of 100. How do you justify it? What do you say to the inevitable ranking questions?

I think any system, outside of the 100 point scale, can work great as long as the guidelines are in place that define each score. I'll probably write a staff blog of my own on why I think scores tend to skew high and an interesting cycle I've begun to notice just within the PixlBit ecosystem.

Jason Ross Senior Editor

11/22/2011 at 12:40 AM

Tough luck, Nick. I'm not going to have another one of these up tonight. Got too much other stuff to do, including my own Holiday Buyer's Guide, a GameCube retrospective piece, a Sonic Generations Review, and some editing of my Otomedius Excellent review... Dun dun dun!

PixlBit

Reviews -- What's in a number?

Comments

Nick DiMola Director

Jason Ross Senior Editor

Information

Followers

Following

Game Collection

58 Games on 10 Systems

Support

Xbox Live

Friend Codes