
Have you ever tried asking ChatGPT a question based on an image ? The results are often striking, revealing capabilities that go far beyond simple visual recognition. Yet ChatGPT is only the visible part of a much broader transformation.
Visual Language Models (VLMs) are reshaping the way we interact with information by combining computer vision with natural language understanding.
VLMs have emerged from the rapid evolution of computer vision : from signal processing to convolutional neural networks (CNNs), then to transformers and the exploitation of massive image text datasets collected from the web. OpenAI accelerated the field in 2021 with the release of CLIP weights, demonstrating that contrastive learning on hundreds of millions of image text pairs enables highly robust zero shot recognition and retrieval capabilities.
In other words, it became possible for a data scientist to compare an image and a piece of text without training a dedicated model. Since then, VLMs have rapidly evolved into increasingly large and general purpose systems.
These foundation models now enable rapid validation of use cases without building a full pipeline.
For example, eleven developed within a few days a proof of concept for detecting recycling errors for a construction company, without the need to collect thousands of images.
These approaches also prove valuable in document understanding for banks and insurance companies, helping to detect fraud patterns, as well as in the automated interpretation of dashboards, providing daily summaries to executives in a Private Equity fund.
Once the value has been demonstrated through a proof of concept, two scenarios emerge : either the use case is narrow and well defined, or it requires a broader level of understanding.
In the first case, traditional computer vision methods can be leveraged to improve precision and robustness. The use of lightweight CNN models is often sufficient, but pre trained transformer based models such as DINOv3 can also be used to rapidly enhance performance.
This type of approach has enabled eleven to achieve a high level of accuracy in identifying Regions of Interest on satellite imagery.
In the second case, the VLM based approach can be further optimised.
This can involve improving the model itself, by selecting a more performant model or, if hosted internally, applying light fine tuning, as well as iterating on prompts with business users to refine the most effective parameters.
For example, eleven fine tuned image generation models for a leading luxury player to better reflect the brand’s visual tone of voice, at a very low training cost.
Thanks to the rapid evolution of computer vision techniques, and more specifically VLMs, leveraging images and videos has never been easier. Each industry should consider these approaches to unlock the still underexploited value of its visual assets.
At eleven, we support organisations from proof of concept through to industrialisation of their use cases, whether in visual recognition, document analysis or custom image generation.
📩 Contact our teams to identify high impact use cases and build an AI roadmap tailored to your business challenges.