Published on May 30, 2026

Building a Local V100 Multimodal AI Lab

A redacted hardware-to-inference build log

A public, redacted note on turning a V100 32GB server into a local multimodal inference lab with switchable model profiles and private aggregate verification.

v100
local-ai
multimodal
hardware

This is a redacted build note from a local V100 multimodal AI lab. It keeps network, host, service, validation, input, and answer details private while preserving the useful engineering record.

May 30, 2026

Building a Local V100 Multimodal AI Lab

A redacted hardware-to-inference build log

A server-class host was turned into a local multimodal inference lab around a Tesla V100 PCIe 32GB. The public version focuses on architecture, model tradeoffs, operational guardrails, and aggregate verification.

V100 made usable

The GPU moved from a generic driver state to the NVIDIA 550 driver family with ECC clean and CUDA validation complete.

Local multimodal path

Switchable llama.cpp CUDA profiles served image-aware chat completions through a private OpenAI-style interface.

Redacted V100 local multimodal AI lab illustration

What changed

The useful part of the installation was not one command. It was the chain: confirm the PCIe device and link, install the right driver stack, keep the card thermally supervised, pass it through cleanly, and make model switching repeatable.

Lab pieces that mattered

The public version removes internal addresses and service names, but the engineering pattern is intact.

Hardware baseline
R730-class server, Tesla V100 PCIe 32GB, clean x16-class link, ECC enabled.
Virtualized GPU path
The V100 was passed through into a Linux inference environment while the host kept thermal and power supervision.
Model profiles
InternVL3.5 Q4/Q6, Gemma 4 26B Q4, and Qwen3.6 35B-A3B Q4 became switchable local profiles.
Operational guardrails
Thermal shutdown rules, GPU initialization at boot, and scheduled runtime windows reduced unattended risk and idle cost.
Publishing test
This post exercises registered Step components, charts, gallery, and sanitized HTML in the live blog pipeline.

Model profile snapshot

Rounded local observations
Rounded local observations
ProfileContextVRAM postureThroughput posturePrivate visual QA
Qwen3.6 35B-A3B Q416kabout 24 GiB observedaround 60 decode tokens/scompleted privately
Gemma 4 26B Q48kabout 21 GiB observedaround 60 decode tokens/scompleted privately
InternVL3.5 Q616kabout 25 GiB observedaround 80 decode tokens/scompleted privately
InternVL3.5 Q416kcomfortable on 32GBaround 100 decode tokens/scompleted privately
  • Numbers are rounded local snapshots, not public benchmark claims.
  • Private validation details are intentionally withheld.

Tradeoffs on a 32GB V100

Rounded observations help choose a default profile without exposing private validation material.

Observed VRAM use, rounded GiB
bar benchmark
Highlighted
Qwen3.6 35B-A3B Q4
24
Gemma 4 26B Q4
21
InternVL3.5 Q6
25
InternVL3.5 Q4
22
Decode throughput posture
horizontal-bar benchmark
Highlighted
Qwen3.6 35B-A3B Q4
60
Gemma 4 26B Q4
60
InternVL3.5 Q6
80
InternVL3.5 Q4
100

V100 installation gallery

AreaPublic status
Driver and CUDA pathVerified locally
Switchable model profilesVerified locally
Private multimodal validationCompleted; validation details withheld
Operational disclosureNetwork, host, virtualization, service, and path details redacted

What I would repeat

The best pattern was to treat the V100 as an operations project before treating it as an AI project: prove the card is electrically stable, keep thermals and power visible, then make every model switch observable and reversible.