So there’s the video from E3 of Peter Molyneux showing off a project his company is working on, based on Microsoft’s Project Natal. It’s a character named Milo living in a tiny virtual world. The video shows a woman named Claire interacting with Milo in ways that seem wondrous and amazing.
But how much of what we think we see is what we’re actually seeing?
Let’s go through the video step by step.
First Claire says “Hi Milo, how are you doing?” Milo stops swinging and walks to camera. What happens here? Milo’s voice recognition hears “Milo” and triggers Milo from the swinging loop to interact with the person. Milo walks towards portion of screen near Claire. Camera could be coordinating that move to that location. If more than one person was in the room, would Milo know where to go? Possibly, with voiceprint matched to facial recognition. There was also a cue icon on the screen that seemed to indicate what the user was to do to start the encounter.
Milo says “Hi Claire, are you ok?” Probably a canned response. Name based on voiceprint? Face? Scripted? How is “Claire” articulated? Prerecorded? Carefully built from phenomes? Milo’s voice in general for that matter. “Are you ok?” is a bit of an odd choice. Was there some sort of stress detected in her voice.
Milo “You? Nervous?” Voice recognition? Milo’s face a little surprised. Eye contact is direct, camera tracking at work?
Claire, “This is the first time thousands of people are going to see this” Milo, “Thousands of people?”. To me, the most suspicious part of the whole interaction. How is this accomplished? How does Milo identify the phrase to repeat? Voice emphasis from Claire? Again, how is the phrase articulated? Built from phenomes? How big is Milo’s vocabulary? What’s the icon on the screen indicating? It seems to be a microphone.
Milo’s eyes wander nervously. Why? Because thousands of people are watching? No way, too much cognition there, I don’t believe it. Reading Claire’s mood from face and voice cues and reflecting it? Possibly. Possibly.
“Let me beat you at football, that is if you finished your homework”. No reaction to “football”, which you’d sort of expect to be a keyword in a gaming system, if Milo is some sort of operating system interface anyway. He’s looking anxiously off to the side during this, possibly indicating the fishing activity he wants to get to?
“Homework” is a clear vocal keyword, triggering emotional cues from Milo expressing resentment at being reminded of his shirked responsibility. Possible that her scolding tone and “school projects” furthers the shame reaction from Milo.
Milo seems to be confessing while we can’t hear clearly under the narration.
Claire’s mention of “help” in a cheery way seems to trigger Milo’s own cheery response, though he immediately forgets the homework assignment and walks over to the pond. Proximity triggers pond-approach, or was this a plan all along?
Walking along the rocks seems pretty scripted, but note Claire’s turning to the side. Does this trigger camera to follow along as if she is walking beside him?
Milo sort of ignores her, says everything they need is there. Seems to go into brief idle mode until she says “let’s get started”, possible keyword.
Then the goggle tossing, which is brilliantly done with visual and aural cues. Notice the “slapping” sound that catching the goggles makes.
Milo shows how to put goggles on, perhaps indicating the gesture that Natal will recognize for this action. If so, a nice subtly natural education of the user. Backed up by an icon at the bottom of the screen, somewhat clumsier.
Approach to the water is silent, nothing from Milo. Triggered by putting on goggles? Clearly some computational pausing here, then a version of Claire appears reflected in the water. Another small but brilliant cue. Possibly done via Natal’s skeletal model and then mapping colors via the camera?
The interaction with the water seems to be basic Natal. Track hand motions and animate based on that. Some prodding from Milo to push the user into further interaction.
Is Milo’s response “They’re only fish” a response to Claire’s compliment? Impressive if so, implying vocal tonal cues and possibly vocal vocabulary, maybe expression recognition. But also possibly just canned.
Passing the pic into the screen is a simple but brilliantly immersive trick. Full points!
Milo seems to react to the color of the drawing? Again, simple but effective trick.
A goodbye script triggered by either vocal cues or body language. Nice touch of reminding of Mom’s birthday.
So overall there’s a lot that’s being accomplished by some basic tricks. These tricks aren’t really “fake”, they’re just effective interactional cues. Another layer seems to be accomplished via an Eliza like interface, though there’s some implied vocal analysis and synthesis I question.
And a great deal is accomplished just by affective computing- reading, responding to, and synthesizing vocal and kinesthetic emotional cues.
Is the system as intelligent as it’s read to be on a surface reading? No, probably not. But does it need to be that smart in order to be effective? No, I don’t think so. I think the basic tricks it seems to use are valid, and I think they can be quite powerful.
What we really need is more footage, of course!

{ 1 } Trackback
[...] recently posted an exploration of the elements of the infamous Milo demo from E3. That post primarily focused on what took place in the demo, and if all was as it was [...]
Post a Comment