Thursday, June 10, 2010

Read My Lipsync

My recent experiences playing Red Dead Redemption, along with Kirk's recent complaints in his Splinter Cell: Conviction review, are a reminder of a silent scourge amongst the narrative games of today. Despite all the technological innovations we've seen, particularly in graphics, I'm struck by one key area we appear to have mostly failed at: syncing the lips (and body language) of digital characters to their audio. I know, it sounds nit-picky, but for me it's one of the most annoying shortcomings of many games, particularly narrative titles.

As games become more realistic in how they render their characters, the problem is becoming more and more apparent - photo-realistically rendered bodies, guns, accessories and clothes... and lips that flap like something out of The Muppets. Alternately, games can have hyper-realistic character models with decent lip-syncing but that suffer issues with the uncanny valley - dead eyes, slack or improper facial expressions and just general creepiness (Fallout 3, I'm looking at you and your low-light eye reflections). Back in the day this wasn't too big of an issue because the graphics were, on the whole, relatively poor and unrealistic. Describing most of games as "stylized" would be polite. Take a look at this clip from the end of the first Half-Life (if you need a spoiler warning, you're either under 18 or need to get on Steam Right now and install it. Or both.):

Now, at the time the lip syncing in Valve's flagship title was pretty impressive, and the fact that the game largely did the lip-syncing in real time based on the audio files was super helpful to Valve and HL mod developers. I worked on a mod called Half-Life: Chronicles, and I remember loving the fact that we could just have the audio recorded, link it in Hammer, and presto-changeo, we had brand-new talking NPCs.

Now, lets' compare that lip syncing to that in the finale of Half-Life 2: Episode 2, which was a bit more involved to accurately generate:

The animations and lip-syncing of Alyx in particular (as well as other characters) has been pretty widely praised by critics and gamers at large...though given the disturbing amount of Alyx Vance related porn I found on YouTube while looking for the two videos I just posted, it may be a manifestation of... Laura Croft Syndrome, or an irrational over-attachment by some gamers to attractive female game characters. (I think these days they're calling it "Bayonetta Was Too A Well-Developed Character" Syndrome - Ed.)

So, what's the big challenge with lip-syncing? It's been done successfully in animation for years, and all game designers are doing is animating their game character models, right? No bigs, create an algorithm to match the lips to the dialog and then break for lunch.

You wish.

First, lip syncing in cartoons and other types of traditional animation is somewhat easier because the images are typically stylized. This is why lip-syncing is less of an issue in the original Half-Life video - the characters look, for lack of a better term, cartoony. As characters have become more realistic, we expect their faces to move like those of real people, and we notice when they don't - we're back in the uncanny valley.

Obviously, when we speak we're communicating with far more than our lips; Valve seized on the work of Dr.Paul Ekman, who's done extensive research into how people show emotion, particularly in their facial expressions. Ekman's work on the Facial Action Coding System, or FACS, has been used by a variety of professions from law enforcement to entertainment to help generate and authenticate the display of human emotions. You might be familiar with the concept from the FOX show "Lie to Me", which is loosely based on Ekman's work.

Clearly part of the problem is that while we can automate the movement of lips to an extent, there are still the broader issues of acting and choreography. Some companies try to skirt the body language issue with motion capture, but doing MoCap on facial animations is a more involved process. As it stands, about the only way to really fine tune the facial expressions that go along with dialog is to tune it by hand or through a semi-automated process, which is pretty painstaking.

However, some upcoming games look like they're taking advantage of new technology to overcome the limitations of MoCap. Rockstar's upcoming LA Noire uses a new technology called "Motion Scan" which according to a recent feature on the game in EDGE scans the entirety of an actor's head using 32 cameras at once. According to EDGE, "The result is not so much an animation – indeed, animators are required to do little more than compose the moving model in a scene – but a 3D film of the actor, make-up and all." Sounds promising...assuming we're dealing with more than just marketing hype.

Another possible solution could be to explore using tone and pitch to do rough-casting of facial animations. It sounds a bit goofy, but for all I know it's already being used. Basically, the system would make value judgements on tone and inflection in an effort to create a sort of rough keyframe that an animator could then come in and fine-tune. I understand that Mass Effect 2 might have used a technology similar to this, and if that's true, I'm fairly impressed; while ME2 had its share of issues, I was pretty happy overall with the facial animations and emotions.

Hopefully, when combined with skilled facial animators who pay great attention to the smallest details, those kinds of new technology will result in more engaging and interesting characters in future games. I'm waiting for that moment in a game when a simple glance or facial expression from an NPC says as much as a real person's would. I'm not too worried about it though; as graphical flair becomes commoditized, developers will be forced to find new ways to make their titles stand out. The smart money says that focusing on the believability of their character animations is one of the first things they'll do.


Andrew said...

This is a major AI issue (you mentioned the Uncanny Valley of course).

Animation simply can only go so far if you actually want content. The cutscenes of Halo or Gears of War might have sufficiently detailed animations of course, but Mass Effect? Can't do every line with something - so they just animate a small amount and apply them as they see fit to each line (each one would have an animation attached, thus the constant camera changes to hide jumps).

So it's AI, or a subset of it, to do the actual emotional responses set to each voice piece if it ever came to that.

There's some interesting prototypes of this kind of tech but none of it is very mature. The amount of detail to really get a good set of procedural AI generated responses is quite bogglingly complex; well, at least to not just look rubbish anyway.

I think a lot could be gained if a game took a leap at a lower level; animated or cartooney characters (as mentioned; higher Uncanny Valley threshold) and using such techniques such as having sliders for different emotions (tied into dialogue; or perhaps even modulating it a bit as seen fit on the fly!), put onto something with less human features thus less complexity could work well.

Needs to walk before running though. There currently is no AI in any game really for voice stuff (it might be pre-genned using algorithms though as in Half Life 2 etc.), so getting some at some point would be nice!

Andrew said...

Oh, and also; not to say that there isn't AI or procedural emotions for movement/responses/animations - Spore I think used some subset of it certainly for the procedural models to do such actions, and other games have this in bits and pieces.

Much easier to alter how people walk or how exaggerated an animation is then to do facial stuff.

Andrew said...

Oh, also, I'm more a sucker for more realistic AI behaviour then the facial stuff; but I think that goes without saying for most people but of course is nothing to do with the topic of facial stuff.

Just likely any development is working on behaviour as a priority (sorry, not even likely; it is the priority according to programmers I've talked to) since that still does suck a lot in general and is more gameplay orientated anyway! :)

Sam Shahrani said...


I'm not totally sure I agree that this is a problem that could be solved purely with AI. I think that, at least for the forseeable future it will have to be a hybrid solution.

You make a great point about setting the bar lower, though. What if we tried the system out on Mario first, or Sam & Max, or even the next Penny Arcade game? Or set it even lower, and work something like it into Flash, where it could use the elements of traditional animation to auto-animate character lips?

Ashelia said...

Right on the ball--the weird animations seem to be the last vestiges of the late 1990's and early 2000's in gaming animation. Even the games that do it much better still have some awkward moments. I mean, as stunning as even Heavy Rain looks, it has some very weird looking scenes regarding lip syncing and even a off step or too.

At least voice acting is picking up, though. I remember the days of the original Resident Evil and now look at UNCHARTED 2, Mass Effect, and so on. Leaps and bounds, but it took its time. Perhaps this is something that just needs a little more time too.

Jay said...

Sam, I'm surprised you use Red Dead Redemption as an example. Of all the dialogue-heavy games to choose from, I thought RDR fared incredibly well when it came to lip-syncing and gestures.

Anyone who's played a previous Rockstar game knows exactly what you're talking about, but I was amazed at the vast improvements to their character models' visual communication. Maybe I was one of the lucky ones who didn't experience as many animation glitches that could lead to the audio/video going out-of-sync (I played on xbox 360, if it makes a difference). But man, I was hella impressed with the strides Rockstar made since GTA4.

Certain characters, like Seth and Nigel West Dickens for example, were incredibly well animated imo. The gestures of West Dickens as he shucked his various wares felt like the real deal to me, as did Seth's anxious, spazz-tastic demeanor. And while Irish may not have been one of the better characters on paper, he became more and more realistic with every staggered step and slurred word. And John, my God especially John, seemed to never miss a beat when it came to his lips moving to the dialogue. And considering how much of it he had throughout the game, that's a hell of an accomplishment.

So I get what you're saying here in general, and I agree that at its worst this problem can totally take you out of the experience. I just think there are SO many better examples to cite than Red Dead Redemption. Or maybe I'm just blind :D

Andrew said...

Oh, it'd not only be an AI solution in the near future (it's not technically feasible to get out of the uncanny valley on that basis just yet), but the far future I don't see why (and I'm using AI as a loose term here but it is that umbrella if a limited part of it) it'd not be fully AI driven once the thing is hand-tuned to begin with.

I could find some example videos to show what I mean, but procedural human facial stuff is still not quite there so not really worth me digging out the company names and examples.

It would be worth trying on some game like Sam and Max though, especially since they can be exaggerated a lot, something that really is impossible to do without with limited animations.

Sam Shahrani said...


You're right that Red Dead has unusually good syncing, bu there definitely were some bad moments; paradoxically, they were more noticable because of the overall good quality.

I agree that other Rockstar titles are much worse, but I'm also trying to point to relatively recent games that exhibit this, and Red Dead is pretty big right now. Still, in future Ill be more careful with the explanations. Thanks for the feedback!

Jay said...

Sam, I didn't mean to come off like I was bashing your example of choice. Obviously, you're speaking from your own experience, which may have been different from mine (or anybody else's for that matter).

BUT... In hindsight, I think I got so hung up on trying to put all of the pieces of your article into context specifically with RDR that I overlooked your point about the relationship between technology and the uncanny valley. Just as you say, the bugs are much more noticeable in a game that's super polished (like RDR). It's like the depth of the uncanny valley is directly proportional to technological advancement. As the quality goes up, the potential for interactive dissonance goes up with it. *CLICK* (light bulb turns on over head).

Anyway, I guess it's been awhile since I studied communication theory so I just mean to say I get it, and I dig. Good stuff, Sam!

Kirk Hamilton said...

The thing I notice most is that even in AAA games that feature really great cutscene lipsyncing,(Brutal Legend, Uncharted 2, RDR), the in-game syncing always lags well behind the cutscenes.

A cutscene will finish and the requisite post-cutscene in-game dialogue, ("Let's go!" "We musn't tarry!") always sort of floats out of the character models as their jaws sort of flap up and down, Half-Life style.

Of course, that's a result of developers having to focus their resources and processing power on so many other things - I'd imagine having accurate in-game lip syncing is lower on the list of priorities than, say, keeping a stable framerate or maintaining the physics engine.

But as Ashelia points out, in time, it'll improve. The last big shift I can remember was when we got away from CG cutscenes and started making in-engine cutscenes that looked as good or better than CG did. I bet that the next jump will be to bring the in-game animations and lip-syncing up to the level of the in-engine cutscenes. Uncharted 2 already sported some impressively seamless transitions between cutscene and gameplay, as did MGS4.

And of course, you know this already Sam, but I share your cautious optimism about the tech in LA Noire. I'm hopeful about that game for a lot of reasons, actually - that EDGE preview is pretty interesting.