Unambiguous speech between humans relies on two things:
Shared frames of reference
Humans share "common sense" about the world. Given the statement "Bob is constantly disciplining little Bobby Jr; he can be a real nuisance," a human speaker can deduce that the "he" refers to Bobby Jr., not Bob himself. This is difficult for computers to do for a variety of reasons, including:
So humans have a good sense of how other humans think about the world, and how the world works. Humans are also excellent at identifying what is a speaker's frame of reference. When someone says "these shoes are killing me," we all know they are refering to the shoes on their feet, not someone else's shoes or the shoes at the mall. Humans can launch into almost any topic without warning and expect listeners to have a good sense of what frame of reference to use to interpret the speech. Computers, perhaps because of lack of "common sense" have no mechanism for this.
Nuanced language use
Humans use slang, interactions between syntax and semantics, and connotations to convey very specific messages in ways that are often difficult for computers to understand. Examples?
Multi-modal communication
One possible way to get around the lack of shared frame of reference in human-computer communication is to constantly represent the computer's frame of reference to the human in a mode other than language. For instance, Terry Winograd's SHRDLU provides a visualization of a 3D world in addition to a natural language interface. This shared frame of reference allows a human user to ground their language in the features they see through visualization, and allows the computer to assume language will be constrained to what it displays to the user.
Related work