Guides
AI Image Classification Benchmarking
TL;DR
For image classification at 6-7/10 description quality, use Gemini 2.5 Flash Lite on Google AI (Gemini API). We benchmarked 10 model and platform combinations across open source and proprietary options. Flash Lite averaged 2.4 seconds per image and roughly 4 cents per thousand classifications. Every open source setup we tried either cost more, ran slower, or both.
Overview
We’re conducting a discovery for a client who plans to build a complex AI app. The app will work with a lot of images. One of the problems to solve is image classification. We want an LLM to classify user images for subsequent use in other AI agents within the overall workflow. We want an LLM that will quickly, reliably, and inexpensively generate good descriptions of the contents of an image.
Problem to Solve
We want to hit the optimal combo of quality of image descriptions, optimal cost, and optimal speed. And we want to do the above in an architecture that allows us to change our minds later.
Why would anyone want this?
Since jumping into AI web development, I have almost constantly had a need for an AI to 'see' an image. Sometimes I need an AI to evaluate a screenshot of a web page. Sometimes I need an AI to place text on an image. Sometimes I need an AI to understand what is in an image. The list goes on. In fact, any web app where AI is going to help a human complete a task that is at all visual, the AI needs to 'see' as well.
What we're not worried about
We're not worried about the readability of the image descriptions. Humans will never read our classifications; only LLMs will. We could send across a list of keywords if we wanted to: 'roses, ball, chickens, sunset'. We just want the LLM receiving and working with our images to know what's in the image.
We're not worried about being scientific. This new app is not even in beta. We can optimize and fine-tune later. For the moment, we just want to spend a morning benchmarking a list of good options, pick one, and move on to the next problem.
We're not worried about perfect. Gone are the days where the technical decisions you make in a web project are permanent. Everything I test below is basically hot swappable. In minutes, I can have a coding AI spin up 3 new testing options. And in the subsequent app we launch, replacing one option with another will likely only take a few days and not much expense. We're in the AI age now, and web development has drastically changed.
Assumptions going in
We’re assuming that there are a lot of image to text LLMs to choose from. We’re assuming that there are a lot of LLM providers. We’re assuming that there are decent open source options. We’re assuming that open source will be a way to save money. We’re assuming that hosting an open source LLM ourselves would be most cost effective once we reach a certain volume. We’re assuming that image to text LLMs will vary greatly in quality. We’re assuming that image to text LLMs are all pretty slow. We’re assuming that we’ll need an architecture that batches LLM image processing in a queue because it’s so slow.
Solutions tested
We tested open source LLMs hosted on a Mac. We tested open source LLMs hosted on DigitalOcean Droplets (VPS, non-GPU servers). We tested containerized hyper-scaling providers. We tested serverless LLM providers.
Solutions not tested
We did not try to train our own model. That’s crazy. We did not try image classification within the user’s browser, although this is apparently possible. We did not test on AWS, Google Cloud Vertex AI or other DIY cloud providers. We did not test on DigitalOcean Serverless Inference. They do not currently support vision LLMs. Those features are turned off for the moment.
LLMs tested
We tested Moondream 2, which is an open source LLM for image classification. We tested Llama 3.2 Vision from Meta. We tested Qwen 2.5-VL. We tested Kimi K2.6 (Moonshot AI multimodal LLM). We tested Gemini 2.5 Flash Lite over Google AI (Gemini API). We tested Claude 3 Haiku from Anthropic. We tested GPT-4o mini from OpenAI.
Platforms tested
Since we were first focused on the LLMs rather than hardware, we tested all models on a local Macbook Air. Once we found something that worked fairly well but was still open source, Moondream 2, we launched Ollama and Moondream on a DigitalOcean Droplet. This was just a normal CPU droplet, no GPU in this case. We just wanted to see how bad this simple approach might be.
Next we tried Cloudflare Workers AI. We’ve heard great things about this and couldn’t wait to give it a chance.
Next we tried Ollama Cloud. With such a great experience from the Ollama package running on a local Mac, we thought Ollama’s cloud offering deserved a chance.
We had used Replicate before and continue to do so for a plugin we released for Strapi, Contentful, etc. called Imagiterate. The experience was great. So we tested Replicate for this use case as well. Replicate was acquired by Cloudflare in late 2025 and is still operating as a distinct brand with separate infrastructure as the integration rolls out
We found RunPod Serverless GPU which offers fractional GPU deployments so that you can run LLMs.
Baseten provides cloud infra to allow you to deploy LLMs and run them with API endpoints.
OpenAI Platform was tested as well. We’ve run a number of AI web applications through their platform and have been quite pleased. Anthropic API also offers Sonnet and Opus via API through their platform. This has also been an excellent provider in the past.
Google AI (Gemini API) is also an excellent cloud AI platform. We haven’t used it until this project, but you’ll see from the results it was quite compelling.
Methodology
We grabbed 20 images from Unsplash. We spun up a Claude Code instance running Sonnet 4.5. We explained to Claude that we were evaluating image classification LLMs and infrastructure. We wanted Claude to create an interface to allow us to see the metrics of our tests at the same time that we saw the subjective output of the LLMs as they created textual descriptions of our images. Claude set up all of the API calls and data management. We then ran the various permutations and had Claude capture static HTML of those results, which we present to you below.
We kept our prompts extremely simple in the hopes that a subsequent system would use AI tokens sparingly. Our prompt for all tests was, "Describe everything you see in this image in detail."
We provided images that had the kinds of objects and settings that our client's app would subsequently be processing. These were largely going to be locations for events. We also included some images that would break the flow and overall theme just to make sure that the AI's could discern outliers.
We were not obsessed with being properly scientific. We didn't have control groups, dependent variables, independent variables, and the like. We just had a realistic mix of the ways that a web app might have an image classification capability on our launch day. Three years out the stack might look really different. We just wanted to know what we could launch with.
We were looking for a fair mix of open source and proprietary LLMs running on a fair mix of easy to provision, scalable cloud service providers. We were looking for tools that played well with AI coding platforms since we wanted AI to do most of our coding work.
We absolutely didn't want to train any models. We only wanted to prompt existing capable models.
We wanted to keep track of token usage and financial estimates. We wanted to be able to evaluate the cost of our solution per thousand images so that our client could make an informed decision about the investment. We wanted to keep track of the amount of time each image classification took, and we wanted a way to allow a human to subjectively cross-check the validity of classifications.
We built a system where we could do a quick test run on 3 images just to make sure that for a given platform, everything was running correctly. We then could switch to running a full batch of 20 images. We included an option to size images down to 512 pixels to test whether this would optimize speed and cost.
We did not host the images over a CDN, though. This would have been a good optimization because most vision LLMs can just hit a public URL. If that URL was close to the server doing the classifying, speed would improve. We instead just sent images through the API's in whatever form they would be accepted; base 64 encoded, blob, etc.
Findings
We’ve captured our findings for this guide as interactive charts. You can click around and see for yourself how things landed.
Moondream 2 running through Ollama on a local Mac.
This was surprisingly fast. My Mac does not have a GPU. It’s just running the modern Apple M3 chip. You’ll see later that my results on this setup were nearly as fast as the fastest cloud providers running proper GPU’s. A good server guru friend of mine, Nevin Lyne at Arcustech, explained that my results were due to how well a Mac using the M3 chip manages memory and such.
The average image classification took 2.4 seconds. The fastest was 1.8 seconds.
But of course you can’t launch a production app on my poor little Mac. So…
Moondream 2 running on Ollama on a DigitalOcean VPS CPU Droplet
This is the kind of Digital Ocean virtual private server that you would run any normal website on. It's made for PHP apps and classic stuff like that. It's not made for AI.
Why on earth would someone want to host an LLM like this?
You wouldn’t. This is dumb, as you can see from the numbers. But imagine you had a use case where image classification was not time sensitive. Imagine you had tons of images to process every day. If these two conditions held, this straight VPS CPU setup would be quite cost effective.
The average image classification took 49.3 seconds. The fastest was 45.4 seconds. And the cost for 1000 is projected to be roughly 1.68 bucks. This approach scales very nicely, though. For the same money, process 10,000 if you want, but you’ll wait a few weeks for results.
Qwen on Cloudflare Workers AI
I was quite excited about this one, and the results were very good for the cost. Cloudflare, through it’s killer Workers edge computing infrastructure, offers numerous LLMs for very agreeable pricing.
The average image classification took 8 seconds. The fastest was 3.5 seconds. And the cost for 1,000 is projected to be roughly 17 cents.
Ollama Cloud
I apologize for the following torture from Ollama Cloud. They were very slow. And I suspect it was my specific use case; my fault. Images and LLMs eat poor planet earth. A lot must be done to optimize a tech stack to run image processing AI.
The average image classification took 11.1 seconds. The fastest was 7.2 seconds. And the cost for 1,000 is projected to be roughly 33 cents.
Moondream on Replicate
Replicate is an AI provider that specializes in image processing. As above, it’s a specialized use case. Replicate is crazy easy to use and crazy fast. They offer open source image processing LLMs so the costs are decent. Their results were very good.
The average image classification took 4.1 seconds. The fastest was 2.4 seconds. And the cost for 1,000 is projected to be roughly 91 cents.
Moondream on RunPod
Sticking with the assumption that an open source LLM would save us money, we moved over to run Moondream 2 on RunPod. RunPod was super easy to use and showed a lot of promise for versatility and overall readiness to manage a bunch of weird use cases.
The average image classification took 1.7 seconds. The fastest was 1.3 seconds. And the cost for 1,000 is projected to be roughly 4.43 bucks. Ouch!
Kimi K2.6 on Baseten
Another cloud LLM services provider that was very convenient and easy to use was Baseten. Slightly better in price than RunPod for this application, but still too expensive.
The average image classification took 5.5 seconds. The fastest was 3.7 seconds. And the cost for 1,000 is projected to be roughly 3.48 bucks. Ouch!
GPT 4o-mini on OpenAI and Claude Haiku on Anthropic
You’re all familiar with these providers and their excellent models. They’re both suddenly billion dollar companies now. I’ll combine their results into the overall chart below. They are overkill for our use case.
GPT-4o mini averaged 6.4 seconds per image at roughly 21 cents per thousand. Claude Haiku averaged 4.6 seconds at roughly $1.87 per thousand. Both produce high quality descriptions, but for simple classification work where humans never read the output, you're paying for capability you won't use. See the chart for the full breakdown including the resized image runs.
Gemini 2.5 Flash Lite on Google AI
This was the winner.
We assumed an open source LLM would save a lot of dough. But it turned out that a proprietary model running on proprietary infrastructure provided the optimal results. The average image classification took 2.4 seconds. The fastest was 1.6 seconds. And the cost for 1,000 is projected to be roughly 4 cents. Nice!
In this interactive chart you’ll see that we tested each of our best performing solutions with both the default image size and a resized image. We scaled the image down to 512 pixels to see if the LLM could process it faster.
This will turn out to be important once we scale. Some of these models and tech stacks charge per image tile instead of per image. A tile is 512 pixels. So if you can optimize on that size, you can save some money too, depending on conditions.
Also note that in the interactive charts below you can click around to read the quality of the image descriptions from each LLM we used.