Google AI Unveils WIT for Multimodal Multilingual Machine Learning

The Google AI’s Wikipedia-Based Image Text Dataset can aid in research



Google AI has come up with a high-quality, multilingual dataset called ‘Wikipedia-Based Image Text’ or WIT Dataset. It has designed this by extracting multiple text selections that are associated with an image from Wikipedia articles and Wikimedia image links. 

As we all know that image and text datasets are now widely used in most machine learning applications, to channelize the relationship between both the text and images, multimodal Visioligustic models can now rely on large datasets. But when we look back, these datasets were designed by either crawling the web or manually captioning images and extracting the alt-text as just the caption. 

The former method yields higher-quality data, but while coming to the manual annotation it reduces the amount of data produced. The extraction through automated methods can result in large datasets, though it requires heuristics that can carefully filter to keep a check on the data quality to achieve robust performance.


What is the WIT Dataset? 

When the Google researchers wanted to come up with something without sacrificing the quality or the concept, they started to create large datasets. So on this journey, they kickstarted by utilizing the largest online encyclopedia available in the present day that is Wikipedia. The Google researchers started choosing images from Wikipedia pages and extracted diverse image-text associations and surrounding contexts. So this way it produced about 37.5 million databases which are entity-rich image-text examples with nearly 11.5 million unique images consisting of 108 languages.  

After which they filtered to refine the data further to double-check the quality. Let’s know how the Google filtering process took place: 

First, the text-based filtering tried to keep up the caption availability, quality, and length. 

Second, comes the image-based filtering that takes care of each image with permissible licensing. 

Third, is the image-text-entity-based filtering to ensure the research suitability. 

Later, when human editors were given a random set of picture captions to evaluate, nearly 98% of the samples had good image captions alignment. So now take a look at the WIT dataset and its advantages.


1 Highly Multilingual 

WIT is the first large-scale, multimodal, multilingual dataset with nearly 108 languages. 


2 Image-Text Dataset 

WIT is the first contextual Image-text dataset is itself an advantage. As most of the multimodal datasets simply provide a single text caption for every single image, with WIT we can aid academics in modeling the impact of context on image captions and image selection. It can lend its support in text captions, and contextual information that can be of great help in research.


3 High Quality and Set Benchmark 

The WIT serves as a challenging benchmark, even for state-of-the-art methods. The WIT test set has mean recall scores in the 40s for well-resourced languages and the 30s for under-resourced languages. 

Wikipedia has made nearly 300-pixel resolution images accessible and also Resnet 50-based image embeddings for most of the training and test datasets to help in the research.