Mankind has reached a stage where robots have been programmed to bridge the sensory gap and have been programmed to see and feel objects. The latest research from MIT’s Computer Science and AI Laboratory (CSAIL) has come up with predictive artificial intelligence (AI) algorithms that can learn to see by touching and learn to feel by seeing.

Researchers from MIT’s Computer Science and AI Laboratory (CSAIL) have deployed a web camera to record 200 everyday objects which include tools, household products and fabrics to create a predictive AI model for tactile signals without actually touching an object.

MIT researchers recorded these objects being touched by a tactile sensor, GelSight over 12,000 times. These video clips were broken down into static frames compiling 3 million images VisGel dataset used by the predictive touch AI. These robotic arms were used for picking up and moving objects to judge the shape and amount of touch required to grasp an object. The algorithm works as when a robot views an object it takes a frame of its video feed and then compares it to the dataset, to learn the reference of touch and feel.

Robot-Object Interaction

So how does the Robot-Object Interaction take place? In the experiment, the research team used a simple web camera, to record nearly 200 objects, such as tools, household products, fabrics, and more, which were being touched more than 12,000 times. Breaking those 12,000 video clips down into static frames, the team compiled “VisGel,” a dataset of more than 3 million visual/tactile-paired images.

The research team uses the VisGel dataset and deploys generative adversarial networks (GANs), which use visual or tactile images to generate images in the other modality. GANs work by using a “generator” and a “discriminator” which compete with each other, where the generator creates real-looking images to fool the discriminator with help of AI. When the discriminator “catches” the generator, it has to expose it to internal reasoning for the decision, allowing the generator to internally improve itself.

The concept of Vision to Touch

For the human eyes, the touch can be ascertained just by seeing the object. To better give machines this power, the system has to initially locate the position of the touch, to deduce information about the shape and feel of the region. In the training phase, the reference images without any robot-object interaction assisted the system to encode details about the environment and the objects. Subsequently, when the robotic arm was operating, the model had to compare the current frame along with its reference image, to ascertain the location and scale of the touch.

This experiment may look like feeding the system with an image of a computer mouse, and subsequently visioning the area where the AI model predicts the object, has to be touched for pickup a technique which would help machines plan safer and more efficient actions.

Working on Touch to vision

For touch to vision, the researchers worked on an AI model to produce a visual image which was based on tactile data. This AI model analyzed a tactile image, to subsequently figure out the shape and material of the contact position. The model finally looked back to the reference image to replicate the interaction.

For instance, if during testing the AI model was fed tactile data on a shoe, it could produce an image, of where the shoe had the highest probability of getting touched.

In the times to come, this type of ability could be very helpful for accomplishing tasks in cases where there’s no visual data, like in a dark place without lights, or when a visually impaired person is touching the object or an unknown area.

The Path Ahead

Researchers with the current dataset have only used examples of interactions that exist in a controlled environment. In the future, the team hopes to improve its AI research by collecting data in more unstructured areas, or by using a new MIT-designed tactile glove, to include diversity in the dataset.

However, there are still grey areas which can be tricky to infer from switching modes, like explaining the colour of an object by just touching it, or analysing what a soft is, without actually pressing on it. The researchers said these conditionals will be subsequently improved by creating more robust Amodels for uncertainty, to expand the distribution of possible outcomes.

Brace yourselves for more advanced models which could help with the more harmonious relationship between vision and robotics, especially in the case of object recognition, better scene understanding, and helping the mankind with seamless human-robot integration in an assistive setting.