How a $1 Qwen3-VL Fine-Tune Beat Gemini 3

Eloy Martinez Eloy Martinez 15 min read
How a $1 Qwen3-VL Fine-Tune Beat Gemini 3

Can a $1 fine-tune beat a state-of-the-art closed-source model?

ModelAccuracyTime (98 samples)Cost/Run
Base Qwen3-VL-8B54.1%~10 sec$0.003
Gemini 3 Flash82.7%2 min 46 sec$0.016
FT Qwen3-VL88.8%~10 sec$0.003

Spoiler: yes, let us show you how!

If you find this content interesting, feel free to join our live Fine-Tuning Friday sessions, where we walk through real world AI use-cases live. Here's a recording from last session, which unfortunately we forgot to smash the record button on, but Eloy kindly took the time to film again!

Why VLMs?

Before we dive in: quick little recap. What are Vision-Language Models?

VLMs combine vision with language understanding. They take images + text as input and produce text as output. Great for captioning, labeling, describing, and many other tasks. The applications are everywhere:

  • Insurance: Classify car damage, validate claims, flag fraud
  • Healthcare: Identify abnormalities in X-rays, MRIs, pathology slides
  • Manufacturing: Spot defects on production lines in real time
  • Media & Entertainment: Scan video frames for unwanted brands or logos
  • Retail: Automated quality inspection and inventory counting
  • Agriculture: Detect crop disease from drone imagery

The Setup

It's no surprise that we at Oxen.ai are big fans of open-source models, and we have a lot of fun fine-tuning them. Continuing our tradition (which began when we showed how efficiently Qwen-Image-Edit can be fine-tuned), we wanted to put Alibaba's Qwen3-VL to the test and see how it stacks up against a heavyweight closed-source model from one of the other top labs.

Car damage detection is a real problem, and not just for folks like me browsing Facebook Marketplace. Insurance companies process thousands of damage claims daily, and manually reviewing photos is slow and expensive. A model that can automatically classify damage types could save these companies hours of work (and tons of money). Let's see if we can build one.

The task is simple:

Input: Photo of damaged car

Output: One of three classes: scratchdentcrack

This is a pretty interesting challenge for a few reasons. Labeling this type of damage is tough even for humans, the visual differences between a crack and a scratch are subtle, and a dent with scraped paint can easily be mistaken for either.

The sheer scale of the problem makes it very cost-sensitive. Insurance companies process thousands of claims daily, and even saving a few cents per inference request can translate to millions of dollars in savings. On top of that, new accidents happen every day, so the data you're processing will constantly drift outside the training distribution. The model needs to actually learn the task, it can't get away with memorizing samples.

And of course prompt engineering gets tough when you're passing in images as input. The context gets saturated fast and performance tends to be terrible. This is actually a great indicator that fine-tuning is worth exploring, if you can't prompt your way out of a problem, fine-tuning is usually the next lever to pull.

The Dataset

Luckily, I don't have to scrape together a damage dataset from Facebook Marketplace myself. I can use CarDD (Car Damage Detection), a publicly available dataset out of USTC containing ~4,000 high-resolution images with multiple annotation types, originally designed for detection and segmentation.

One small problem: remember how this is supposed to be a quick test? Let's downsample this dataset a bit.

This dataset includes many images with multiple damage types per photo, which would complicate our evaluation. To keep things simple, we filtered down to images with only a single damage category and picked three classes to focus on. Worth noting: we absolutely could train Qwen3-VL on multiple labels, Oxen.ai supports multi-category labeling out of the box, but for this proof of concept, single labels keep things clean.

Original CarDDOur Subset
~4,000 images417 images
Detection + segmentationClassification only
Multiple damage types3 classes: crack, dent, scratch
Complex labelsSingle labels

Here's what our final split looks like:

Small, clean, and ready to train. If you want to explore the data yourself, here's the Oxen.ai repo we created to store it.

Time to answer the question:

Can we teach a small open-source model to do this better than a giant proprietary one? (and how?)

A great way to approach this kind of experiment is to follow the recipe laid out by Karpathy in his blog post A Recipe for Training Neural Networks. If you haven't read it, no worries, we'll go over it step by step during this little exercise.

Become One with the Data

"Spend quality time with your data. Look at examples. Understand the distribution. Your brain is surprisingly good at recognizing patterns."

First step in the recipe: become one with the data. Spend more time than you'd be comfortable with just looking at samples, trying to recognize patterns between the images, what can the model realistically learn? How can you improve your dataset to squeeze even better performance out of your fine-tuned model?

One big advantage of using Oxen.ai is that we get to actually see our data, browse it, query it, spot issues visually. And sure enough, we spotted a few problems almost immediately!

Problem 1: Train/Test Leakage

Remember the cardinal rule? Your eval data should never have been seen during training. Turns out we had near-duplicates sneaking across the split:

These are essentially the same photo, just slightly different crops. On a massive dataset this might be noise, but on a 417-image sample, that's the kind of leakage that lets a model memorize instead of learn (no bueno).

Trusting our eyes to recognize data leakage is great, but we're smart oxen, why do the work ourselves when we can write a Python script to catch it systematically?

The idea is simple: if two images are nearly identical, their embeddings should be nearly identical too. We used CLIP to embed every image in the dataset and computed cosine similarity between all test-train pairs, basically a score from 0 to 1 that tells you how similar two images are.

Anything above 0.95 is a red flag. We found 126 suspicious pairs, removed 19 leaked test images, and ended up with a clean split of 98 test samples ready for evaluation.

The 0.95 threshold is a bit arbitrary, these similarity scores will naturally run high since we're comparing images of cars with other images of cars. One way to get more meaningful similarity scores would be to fine-tune CLIP itself to produce better-separated embeddings for this domain.

After shuffling and cleaning, no pairs exceeded our threshold. Clean dataset in hand, we can move on!

Problem 2: Multi-label Reality

Here's the other thing we noticed: the subset of the dataset we chose uses single labels, but real car damage is often multi-class. A crash that dents the metal usually also scratches the paint. We used majority-vote labeling from the original annotations, but keep this in mind when we get to the "impossible" images later.

What Does the Data Look Like?

Here are some examples from each class:

cracks, dents and scratches

Key observation: cracks are the hardest to distinguish. A little intuition we can get from just looking at the data is that when a car visibly has several damage types, cracks tend to take precedence in the labeling, this is a quirk of the dataset itself and might be why cracks are so hard to predict.

How VLMs Actually Work

Before we start training, let's quickly understand what we're fine-tuning. A Vision-Language Model has three main components:

Image → [Vision Encoder] → [Projection Layer] → [LLM Backbone] → Text

1. Vision Encoder (ViT): Chops the image into patches (think a 14x14 grid), then produces an embedding vector for each patch. These are your "image tokens."

2. Projection Layer (MLP): The vision encoder and the LLM speak different "languages", their embedding spaces have different dimensions. The projection layer is just an MLP that maps vision vectors into the same dimensional space as text tokens. It's not learning semantics, just translating coordinate systems.

3. LLM Backbone (Qwen): Here's the key insight, image tokens and text tokens get concatenated into a single flat sequence, and standard self-attention runs over the whole thing:

[img_patch_1] [img_patch_2] ... [img_patch_N] [text_tok_1] [text_tok_2] ... [text_tok_M]

The model doesn't care which tokens came from an image and which came from text. Every token attends to every other token. When the model processes "What damage is this?", the text tokens can attend to image patches showing the crack, and vice versa. No special cross-attention mechanism, the model learns to relate modalities purely through regular self-attention.

Setting Up Training + Baselines

"Get a simple training loop running. Establish baselines before doing anything fancy."

We need a number to hillclimb against, baselines are very important when it comes to evals.

VRAM Back-of-the-Napkin Math

This is a good rule of thumb we use at Oxen.ai to decide what's the best GPU to run a particular model on. When you use Oxen, you don't have to think about this, we pick the GPU that gives you the best cost efficiency, so you know you're not leaving money on the table.

Nevertheless, it's a good concept to keep in mind, as you can estimate a ballpark of how much inference will cost.

PrecisionBytes/Param8B Model70B Model
FP16/BF16216 GB140 GB
INT818 GB70 GB
INT40.54 GB35 GB

Qwen3-VL-8B at BF16 = 16 GB base. Full fine-tuning needs ~3x for optimizer states = 48 GB.

Baselines

Here's how different models performed out of the box:

ModelAccuracyNotes
Random guess33.3%3 classes
Base Qwen3-VL-8B54.1%Zero-shot
Gemini 2.0 Flash78.6%API call
Gemini 3 Flash82.7%API call

The base model is barely better than random on cracks, only 19.4% correct! Let's look at the confusion matrix:

confusion_matrix_base.png

The model straight up doesn't know what car cracks look like. It keeps calling them scratches or dents.

Overfit First (The Sanity Check)

"Verify your model can learn. Overfit a tiny dataset to loss ~0."

Before training on the full dataset, we need to prove the model can learn. This is a quick sanity check to make sure the training pipeline has no bugs and the data is in the correct format and properly flowing through.

  1. Take 10 training samples
  2. Train for 100+ steps
  3. Watch loss go to ~0

View the overfitting run on Oxen.ai

If this fails, it's a bug in your code or data, not the model. Ours went to ~0 as expected. We're good to go.

VLM Quirks We Discovered

These are a few quirks we discovered while overfitting the model during our initial implementation.

Quirk 1: Loss stalls and won't go down

We had completion_only_loss=False, which meant the model was penalized for not predicting image tokens. But it's not an image generator! Once we set it to only compute loss on the text output (also known as train_on_input=False or mask_prompt=True on other frameworks), training started working properly.

Quirk 2: Training looks good, inference = base model

We trained LoRA on Qwen3-VL, loss decreased beautifully, everything looked great. Deployed with SGLang, applied LoRA at runtime... and results were nearly identical to the base model.

The culprit? SGLang's LoRA filter only matches LLM layers, not vision layers:

# From sglang/srt/models/qwen3_vl.py
_lora_pattern = re.compile(
    r"^model\.layers\.(\d+)\.(self_attn|mlp)\.(qkv_proj|o_proj|...)$"
)

Only model.layers.{N}.* matches, vision modules like visual.* get silently skipped! We had to merge the LoRA weights into the base model instead of applying them at runtime.

Using Oxen.ai means you don't have to spend hours debugging cryptic performance issues like this. We feel the pain so you don't have to.

Tune

"Now train on the full dataset, adjust hyperparameters, and iterate."

Here's where the fun begins!

LoRA:

We add these small matrices to attention layers throughout the model. The original weights stay frozen, and the adapter outputs get summed in.

LoRA Hyperparameters Deep Dive

These are the main hyperparameters you might want to play with when fine-tuning a LoRA for Qwen3-VL. You can change all of these when training on Oxen.ai, though we give you smart defaults so you can get going without thinking too much about it.

ParameterValueWhat It Does
LoRA rank (r)64Size of low-rank matrices
LoRA alpha16Scaling factor
Learning rate2e-4Step size for updates
Batch size4Images per update
Epochs3Passes through data
Target modulesq,k,v,o_projWhich attention layers to adapt

We'd recommend experimenting with different values to squeeze out a bit more performance from your tuned model. Once you have a good baseline run you want to improve on, you can trigger as many concurrent experiments as you want by playing around with these values directly on Oxen!

Scaling Up

All we have to do to run these fine-tunes on Oxen is upload your dataset (1–2 clicks), start a new fine-tune, choose which columns of your dataset you want to train on (4–5 clicks), and just wait for the magic! The full 319 sample training run took about 8 minutes and cost just $1!

Here we start to see some interesting results. With these two data points we can start plotting an accuracy curve.

Training SamplesAccuracyImprovementTraining Cost
0 (base)54.1%
7867.3%+13.2 pp$0.50
31988.8%+21.5 pp more$1.00

Training loss for 78-sample run.

Training loss for 319-sample run.

Turns out the more (clean) data we have, the better results we get, unsurprisingly. But look at where the improvement happened:

ClassBaseFine-tunedChange
Crack19.4%96.8%+77.4 pp
Dent75.0%78.1%+3.1 pp
Scratch65.7%91.4%+25.7 pp

Weak diagonal, the base model struggles.

Strong diagonal, the fine-tuned model nails it.

The model went from 19.4% to 96.8% on cracks, a +77 percentage point swing. From the intuition we got from looking at the data, we can safely say the model is now classifying cars with several types of damage as cracks. This might not be the most intuitive thing to do as a human, but once again this is a dataset quirk and might very well be how car insurers are taught to assign single-label damage to claims.

Squeeze Out the Juice

Learning curves typically follow a power law, each doubling of data gives roughly similar improvement. Knowing this, we can use every statistician's favorite party trick, a simple extrapolation, to estimate what our accuracy would look like if we added more data.

This is obviously not deterministic, the accuracy improvement will strongly depend on the quality and type of data you feed into the model and how hard the task is. But from our two data points we can see that accuracy is not yet plateauing, so a surefire way to improve performance would be to add more clean data.

We can clearly see the diminishing returns, the closer you get to 100% accuracy the harder it becomes, and the more data you have to add. At some point you need to evaluate the tradeoff of curating and labeling more data (which is very non-negligible) for ever-smaller accuracy improvements.

What we didn't try (yet):

  • More training data (500–1000 samples)
  • Larger model (Qwen3-VL-32B)
  • Hyperparameter sweeps
  • Quantization to lower inference cost

The Results

Let's start with what fine-tuning actually fixed. We tagged every test image in our dataset so you can query them on Oxen.ai:

Eloy/CarDamageVLM/data/results_all_models.parquet at main
Contribute to the Eloy/CarDamageVLM repository by creating an account on Oxen.ai
TagCountDescription
both_right48Both models correct
base_wrong_lora_right39Fine-tuning fixed these
both_wrong6"Impossible" images
lora_wrong_base_right5Regressions

Before/After: What Fine-Tuning Fixed

These are cracks the base model couldn't recognize. As we can see, there are a few samples in there that we as humans would have a hard time classifying as just "cracks", but the model learned what the training dataset had to teach, so we'll count that as a win!

Final Scoreboard

ModelAccuracyInference TimeCost/RunTraining Cost
Base Qwen3-VL-8B54.1%10s$0.003
Gemini 2.0 Flash (batch)78.6%203s$0.0015
Gemini 3 Flash82.7%166s$0.016
LoRA Qwen3-VL88.8%10s$0.003$1

We ran Gemini 2.0 Flash through the batch API expecting it to be quicker, turns out that endpoint is optimized for cost, not speed. They give you a 24-hour window to run your inference at a very cheap price. Good for the wallet, terrible for patience.

Our fine-tuned open-source model didn't just match Gemini 3, it beat it by 6.1 percentage points. And it did so 10x faster, running on a single A10G GPU. The cost difference might look small here, but it becomes a lot more obvious as you scale up the amount of inference you need to run.

Head-to-Head vs Gemini 3 Flash

OutcomeCount
LoRA wins (we're right, Gemini wrong)10
Gemini wins (Gemini right, we're wrong)4
Both correct77
Both wrong7

Net advantage: +6 images over Gemini 3 Flash.

The "Impossible" Images

Both models fail on 7 images. Some of these are genuinely ambiguous, even as humans we'd have a hard time assigning a single label to them.

This is exactly why looking at your data matters. When both a fine-tuned specialist and Gemini 3 agree on a prediction that contradicts the label, maybe the label is wrong. By tagging and querying these edge cases on Oxen, you can identify where your dataset needs work: relabel ambiguous samples, add more examples of tricky cases, or rethink whether your labeling scheme reflects the real-world task.

The Honest Caveats

Let's be real about limitations:

CaveatContext
98 test samplesSmall, but balanced across classes
87 vs 81 correctOnly 6 sample difference
Single domainCar damage only, not general vision

But the direction is clear:

  • 88.8% vs 82.7% = consistent gap
  • More training data → more improvement (not plateauing)
  • 10x faster inference is real

What this proves: Domain-specific fine-tuning on small open models can compete with general-purpose giants.

What this doesn't prove: That this works for every domain or that you don't need more data for production.

The Takeaway

Following Karpathy's recipe works:

  1. Become one with the data - Found leakage, understood class imbalance
  2. Set up baselines - Base model: 54.1%, Gemini: 82.7%
  3. Overfit - Verified the pipeline works
  4. Regularize - LoRA gives you regularization for free
  5. Tune - More data helped (88.8% with 319 samples)
  6. Squeeze - Batch inference, merged weights
BenefitValue
Accuracy+6 pp over Gemini 3 Flash
Speed10x faster (10s vs 166s)
CostTime is money
PrivacyData stays on-prem
ControlFull model ownership

The Pragmatic Path Forward

Here's what we've learned: it makes total sense to start with a big lab model. When you're prototyping, Gemini or GPT-4V gives you zero setup, instant results, and a high baseline. Don't pre-optimize, validate the idea first.

But eventually you hit walls. Cost at scale ($0.01/image x 1M images/month = $10K/month), latency (API calls adding hundreds of milliseconds), privacy concerns (sensitive data leaving your infrastructure), and an accuracy ceiling you can't break through with prompt engineering alone.

That's when fine-tuning becomes worth the investment. And the best part? It doesn't have to be painful.

Using Oxen.ai means you don't have to wrangle dependencies, fight CUDA driver issues, or debug cryptic inference bugs. Upload your data, click fine-tune, deploy. What took us days of debugging takes minutes on the platform.

A $1 fine-tune on 319 images beat Google's best model by following a simple recipe. And that's now accessible to everyone.

So what are you waiting for? Come fine-tune Qwen3-VL, or any other model, on Oxen.ai. Fine-tuning is powerful, accessible, and (as we just showed) surprisingly cheap. We'll keep proving it in upcoming posts. Stay tuned!