How Far Can You Imagine with “Imagen”? AI and Photorealism

With Google's new AI tool 'imagen,' you can turn any word into an image.

‘Imagen’ is a text-to-image diffusion model with an unprecedented degree of photorealism and a profound level of language understanding. Imagen relies on the strength of diffusion models in high-fidelity image generation and builds on the power of big transformer language models in reading text.

AN UNPRECEDENTED PHOTOREALISM DEGREE

Google asserts Imagen has an unrivaled level of photorealism and a level of language comprehension that outperforms its competitors. The application works by reading sentences, such as Three glass spheres tumbling into the ocean. The water is splattering. The sun is setting and transforms it into an image that depicts just that. The images that result can be either lifelike or artistic interpretations.

Although Imagen is not publicly available, Google has provided various samples of how AI works. Google established DrawBench, a comprehensive and rigorous benchmark for text-to-image models, for the project. Imagen will be compared to various AI approaches such as VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2. Humans prefer Imagen over the competition.

Imagine "a nice dining robot couple with the Eiffel Tower in the background." It is quite easy for us to visualize this in our minds. Of course, those of us who are more artistically inclined can easily put these words to life in our work. ‘Imagen’ a Google AI model, is now capable of performing something similar. Google has demonstrated how Imagen a text-to-image diffusion model, can create images from written words in a new announcement.

The most impressive aspect, though, is the accuracy and photorealism displayed in the images, which are all made by these models. Google has shown a handful of Imagen-created artworks that accurately portray the words in question. One Android mascot, for instance, is made of bamboo. Another depicts an enraged bird. Another depicts a chrome-plated duck with a golden beak in a forest disputing with an enraged turtle.

Imagen is based on Google's "large transformer language models," which assist the Artificial Intelligence read the text, according to the company. Generic big language models "are unexpectedly excellent at encoding text for image synthesis," according to Google research.

However, there are drawbacks, including "many ethical problems facing text-to-image research in general," according to the business. It acknowledges that this could have a "complex influence on society" and that such models could be abused. This is why it is not currently publishing the source code or a public demo.

The data requirements of text-to-image models have prompted academics to rely extensively on vast, mostly uncurated, web-scraped datasets. According to the blog, these statistics frequently represent social prejudices, oppressive perspectives, and disparaging, or otherwise detrimental, links to marginalized identity groups.

A subset of our training data was filtered to remove noise and undesired information, such as pornographic photos and toxic language. However, the LAION-400M dataset utilized by Google is known to contain a wide range of undesirable content, including pornographic photos, racist slurs, and negative social stereotypes. There is a danger that Imagen includes encoded damaging preconceptions and representations. This leads to the decision not to release Imagen for public usage without additional safeguards in place.

Finally, Imagen ability to create art that depicts individuals is currently restricted, and it usually produces stereotypical outcomes. It has social prejudices and stereotypes, including an overall tendency toward creating photos of persons with lighter skin tones. When asked to illustrate various vocations, there is also a bias for displaying Western gender norms.