A Robot with Vision Wants A Body
SAKURA--My understanding is that you were originally interested in researching machine vision, which then led to work in motor skills and body issues. Where did you find the limitations to purely visual research, and why did you move on from those studies?
ASADA-- felt limitations in recognition issues themselves. At that time, I was interested in machine recognition, so I entered research in the fields of pattern recognition and computer vision, but I didn't realize what I was getting myself into. The problem you're faced with is, to give one famous example, asking a machine "Is this an orange or an apple?" The machine takes a picture of an apple, makes analyses of color, shape and size, and calculates that a round red object, approximately 15 cm across "is an apple." But how do you know that it actually recognized it as an apple? When we recognize an apple, we're not only relying on visual recognition. We have senses of smell, touch and weight. When we bite into an apple we taste the tart sweetness. Our gums might even bleed. We have so many ways of recognizing it as an apple, and it is only through the sum of these that an apple takes on the meaning of what we know to be "apple." We live in a three-dimensional environment, and the experiences of modeling "apple" and recognizing "apple" take place simultaneously in our minds. There is simply considerable doubt in my mind that bypassing that process, and simply using template matching[*5] to tell the computer that an apple is something which is "red, round and about 15 cm in diameter" actually produces recognition. It is not the symbols, but the body which is important. Only from having a body--that holds the apple, touches it, smells it and bites into it--do we finally learn what an apple is. I believe that the semantics of recognition come from our corporeal experience, and not from the symbolic confines of a computers' interior.
SAKURA--From the perspective of someone who researches living creatures, the faculty of vision is an extremely high level information activity. Most mammals rely on their sense of smell as their primary information medium. Humans and other primates are the only ones who do rely on vision first. Among all other animals, birds are about the only ones that rely on vision. Most can only see in black and white, or only distinguish some brightness or darkness, for example. A sense of smell or pheromones, in short sensitivity to chemical compounds, is the primary sensory medium of most living things. This is why if we trace the process of animal evolution in terms of system lineage occurrence, physical recognition comes first, and slowly finds sophistication, with visual recognition coming really much later. I always thought it interesting that, because it is humans that are doing the research, when they begin to build robots, the visual faculties always "naturally" come first. Or when doing AI research, they always tend towards linguistic processing, and quickly come to the extraordinarily high barrier that this presents. But what you're saying is that researchers are recognizing the limitations to this approach, and changing their AI and robotics approaches to more closely reflect other characteristics of life on earth?
ASADA--Well, that's precisely what I've done. And mechanics issues are exactly the same. When you begin researching human recognition, there is a kind of tacit approval for dealing with the visual faculties right from the beginning. The problem is that, precisely as you've mentioned, vision is a capability that only came at the end of a long process of refinement. Even then it is only one element. And when you try to study processes of recognition in a living environment, it is futile to use only this one element, because you invariably run up against the frame problem.[*6] That's why you cannot look at vision, or any other function, without looking at issues of the entire body. You also mentioned the issue of language. This is another area that can only be begun to be understood from body issues because visual information only finds currency in the context of the robot's relation to its environment being abstracted, behavior patterns emerging from the robot's relations with specific situations, and these becoming symbols within the stimulus-response diagram.[*7] In other words, codifying the robots' reactions to reoccurring situations. It is not that these were symbols to begin with, but rather that their conduct produced a symbol. Well, I'm wishfully expecting that a type of language may emerge when that symbol is shared by multiagents. In the case of RoboCup, there are multiagents in collusion, so it is essential that some form of communications language emerge. Moreover, this language must be quite tacit, so that once "eye contact" has been made, both players share a common symbology. If this is indeed possible, then a case can be made that a common linguistic structure has been established. This is, of course, not simply an experiment for the unit's visual faculties. There are many experiments that I have in mind which include linguistic functions. I have many test cases to articulate.
SAKURA--In the spring of 1998, when we were both on the panel at Yokohama's Minato Mirai, the topic of a "theory of mind," concerning how animals understand each others' feelings came up. It is commonly thought that when humans understand each other they do so through language, when in fact any animal communicates, it is primarily through eye contact and implicit gestures. And your comment was, if I recall it correctly, that when it came to porting these to a robot, the operating algorithm itself has to change. Now I'd like to ask, simply, if you know of any computer program presently available that would allow us to do this? As someone admittedly unfamiliar with the issue, I imagine that because computer programs are languages, the discussion must begin and end with how to best use them. And yet, issues of "theory of mind" and understanding the other come not from usage, but from the murky and ill-defined elements which "emerge"[*8] into language as a structure of and for understanding, and therefore, there seems a very real obstacle that remains completely unaddressed.
ASADA--When considering the structure of the robot's brain, you need to clarify whether your concern is in creating the essence of language, or whether you're using it as a tool for imagining how to reproduce language. For example, if you were able to use a wetware (biological) body, the changes in the body's structure would invalidate all concepts of traditional computers. I'd be happy to be able to make it that far, but in my research facilities we're busy trying to verify important concepts through simulating linguistic processes. Creating a body capable of growth is pretty difficult. Right now we're sticking with a fixed mechanical one that runs on computer software, and tinkering with the mechanics while we look for our results. Of course, in the end we will need to consider evolving wetware bodies or we will never reach the truth about our work.