Toyota Connected North America | Using Stable Diffusion and…

Case Study

Jun 2023

Prepared by: Raja Shekar Kilaru, Lokesh Kumar Viswavarapu, Jerry George

Text-to-multimedia Generative AI (artificial intelligence) models have gained tremendous popularity since 2022 as technical advances greatly enhanced the fidelity of art that AI systems could create. Leveraging the emergence of these technological advancements, Toyota Connected North America (TCNA) identified Generative AI as a tool that could drive increased customer engagement by placing the customers at the center of conversation.

TCNA’s Mobility AI/ML (machine learning) team leveraged state-of-the-art Generative AI models such as open source Stable Diffusion [1] and ControlNet architectures [2] to train proprietary text-to-image latent diffusion models that are personalized for Lexus and Toyota vehicles to generate various artistic-style images. The team also used Dreambooth [3] framework released by Google Research for fine-tuning off the shelf text-to-image models. The coolest thing about these models is that they provide customers with a platform to visualize these vehicles in any dream setting. For example, “Lexus RZ driving on Mars” or even “Digital art of Toyota BZ4X in manga style”.

TCNA developed an AI Art generator product for Lexus Marketing and officially launched the program at the 2023 New York International Auto Show [4].

Thousands of auto show attendees brought Lexus RZ and RX vehicles to life in their ideal settings through digital, painting and futuristic art styles. Below are few sample images our team generated along with their respective text prompts. These images were generated with the AI art tool we built.

Photorealistic:

Painting:

Digital Art:

Futuristic:

via GIPHY

Text-to-image models are computationally heavy, and the latency to output a single image varies between 3-5 seconds on a single NVIDIA GPU device depending on the device type. We leveraged multiple GPUs to achieve parallelism during the inference, which helped reduce the overall latency drastically – by 75% – while generating four outputs for every user input.

Image Generation Workflow:

Distributed Inferencing on Multi GPU instances

Traditional Inference Pipeline

Let's consider a simple illustrative example where we generate image variations of the prompt “a photograph of an astronaut riding a horse.” This surreal prompt shows us how powerful these Stable Diffusion models are at combining unrelated concepts in profoundly creative ways. Let us take a deep dive into the technical details of the implementation of the inference pipeline,

The traditional model inference pipeline using Stable Diffusion 2.1, takes around 19 seconds to generate the following image.

In addition to the considerable inference time, we also observed the workload was pinned to a single GPU core, when executing the pipeline with a larger batch size of four images, as demonstrated in the below snapshot of GPU usage:

via GIPHY

We found several opportunities for improvement like using the latest Pytorch 2.0 for optimized and memory-efficient attention implementation, employing half precision weights to achieve faster model load and inference times. When using the Pytorch 1.13 version, we noticed that the memory-efficient attention implementation through xformers toolbox greatly helped improve the inference times.

To assess the performance of different model variants, it was necessary for us to separately evaluate the model load time and inference time. During our extensive testing of different model variants, we discovered that the model load time typically ranged between 2.5 and 3.5 seconds. Despite varying inference times, the above stated optimization techniques significantly enhanced the speed of inference for us.

Additionally, Pytorch 2.0 includes an optimized and memory-efficient attention implementation through the torch.nn.functional.scaled_dot_product_attention function, which automatically enables several optimizations depending on the inputs and the GPU type.

Optimized/Distributed Inference Pipeline

Even with these modifications, we needed to generate up to several different images for each request. This prompted us to further optimize using torch multiprocessing package. The multiprocessing module offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. The multi-processing technique allowed us effectively to use all GPU cores parallelly as seen in the below snapshot of GPU usage,

via GIPHY

The following code is created with torch multi-processing is also 100% compatible with native multiprocessing module packaged with Python. This allows us to further optimize our workflow for image generation and allows us to generate 4 images in roughly 11 seconds. The code created with torch multi-processing is 100% compatible with native multiprocessing module packaged with Python.

The generation of AI art promoting Toyota and Lexus vehicles is only the beginning. Future applications will enable both companies to create new levels of personalization for customers to experience their brands. After the success of the New York Auto Show, we are continuing to identify potential opportunities across several different functions and products in which we can use Generative AI technology and are looking forward to delivering these cutting-edge solutions.

References:

Featured Case Study

Toyota Connected engineers and designers played a key role in the development of new features, functionality and enhancements in the latest version of the Lexus Interface multimedia system.

Join our Team

We build technology that empowers people to move, and makes their lives easier and safer at Toyota Connected.

Careers

Case Study

Using Stable Diffusion and Generative AI for Lexus brand marketing at the 2023 New York International Auto Show

Photorealistic:

Painting:

Digital Art:

Futuristic:

Image Generation Workflow:

Keep Reading

Featured Case Study

The Next Era In Multimedia: Lexus Interface Sets A New Standard

Join our Team