Jun 3, 2026

FLUX.1 Kontext Dev for Local AI in 2026: Image Editing on Consumer GPUs Without the API Bills

By RunAIHome Team · 13 min read

fluxcomfyuiimage-editinglocal-aivramhardware-guidegpudiffusion

TL;DR: FLUX.1 Kontext dev is a 12B open-weight image-editing model from Black Forest Labs. The FP8 checkpoint runs in 12GB VRAM at roughly 2× the speed of the raw BF16 model; an aggressive NF4 quantization squeezes it to 7GB. The API is $0.04 per image — local breaks even in under 13,000 edits.

	RTX 4090 (FP8)	RTX 4070 / 3060 12GB (FP8)	8GB GPU (NF4/GGUF)
Best for	Full-speed editing, FP4 on RTX 50-series	Sweet spot: quality + hardware you may already own	Budget entry, slower output
VRAM used	12–14 GB (headroom for FP8)	12–14 GB	7–8 GB
Speed	~2.29 iter/s at NF4 / faster at FP8 TensorRT	~1.5–2.0 iter/s at FP8	~0.6–1.0 iter/s
The catch	Hardware cost is steep if you don’t own one	T5 encoder adds ~6–9 GB RAM overhead	Visible quality loss vs FP8

Honest take: If you own a 12GB+ GPU, run the FP8 checkpoint locally — the setup takes 20 minutes and you’ll break even against API costs in a weekend of editing. Below 12GB, the quality compromise from NF4 is real enough to just use the API unless you’re doing hundreds of edits daily.

What Flux Kontext Is (and Isn’t)

Black Forest Labs released FLUX.1 Kontext Pro on June 1, 2025 as the first model in its Kontext suite. The open-weight [dev] variant followed shortly after. The key distinction: Kontext is not a text-to-image model. It is an image-editing model.

You hand it an existing image and a text instruction — “change the jacket color to red”, “replace the background with a forest”, “make her hold an umbrella” — and it applies that edit while preserving everything else: face identity, lighting, background elements, stylistic consistency. That consistency-across-edits capability is what sets it apart from running an inpaint workflow in standard FLUX.1 dev.

The architecture accepts both a text prompt and one or more reference images as conditioning inputs. Internally, it’s a 12B parameter flow-matching diffusion transformer — same family as FLUX.1 dev, but trained on instruction-following editing tasks rather than pure text-to-image generation. The Pro and Max variants are closed API; the [dev] model is open-weight under the FLUX.1 Non-Commercial License, which restricts the model weights to non-commercial use but permits commercial use of the generated outputs under certain conditions.

If you’re already running ComfyUI or ComfyUI on Linux, the Kontext dev workflow slots in without a framework change.

The VRAM Reality: 24GB Native, 7GB Quantized

The raw BF16 safetensors file weighs in at approximately 24 GB on disk — right at the VRAM ceiling of an RTX 3090 or RTX 4090. In practice, you need a few GB of headroom for KV cache and activations, so BF16 is tight on 24GB cards and requires lowering resolution or step count to stay within bounds.

The practical tiers, all of which Black Forest Labs and the community have released as ready-to-use checkpoints:

FP8 Scaled (12 GB VRAM required) The recommended path for RTX 40/30-series cards. The file flux1-dev-kontext_fp8_scaled.safetensors is ~12 GB. NVIDIA’s own benchmarks show 2× faster inference vs BF16 PyTorch when running on RTX 40-series hardware, which has FP8 tensor core acceleration. This is the sweet spot: near-full quality, half the memory, faster output.

NF4 / Q4 Quantization (7–8 GB VRAM required) Community GGUF and NF4 checkpoints bring the model to ~7 GB on disk. Black Forest Labs benchmarking reported 97% quality retention vs the full BF16 model at NF4 precision. On an RTX 4090 using NF4, real-world edits benchmark at approximately 2.29 iterations/second — roughly 9 seconds per edit at 20 sampling steps.

FP4 via TensorRT (Blackwell RTX 50-series only) RTX 5060 Ti and other Blackwell GPUs with native FP4 tensor cores can load Kontext at 7 GB through NVIDIA’s TensorRT-RTX. The FP4 path hits similar speeds to FP8 on Ada — the model is smaller in memory, the throughput is comparable, and the quality is close to NF4. This requires the TensorRT-RTX library and NVIDIA’s NIM microservice or a ComfyUI-TensorRT node, not the standard safetensors path.

GPU Tier Guide

24GB Cards (RTX 3090, RTX 4090): Run BF16 or FP8

Both the RTX 3090 and RTX 4090 comfortably handle the FP8 checkpoint. The RTX 4090 gains the additional TensorRT 2× speedup from FP8 tensor core acceleration; the RTX 3090 runs FP8 at full quality but without the same hardware-accelerated path, so expect speeds comparable to FP8 on a 40-series midrange rather than the flagship.

If you want to run BF16 on a 24GB card, keep your output resolution at 1024×1024 or below and use 20 steps. Above that, you will hit OOM errors. FP8 is strictly better here — same quality, half the memory, faster.

12–16GB Cards (RTX 4070 12GB, RTX 4060 Ti 16GB, RTX 3060 12GB): FP8 Sweet Spot

The RTX 4070 with 12GB and the RTX 4060 Ti 16GB are arguably the most practical targets for Kontext dev. The FP8 checkpoint fits with 0–2 GB headroom. Speed lands somewhere between a 3090 and 4090 depending on architecture — for Kontext’s editing workload, you’re looking at around 1.5–2.0 iterations/second at 20 steps, so 10–15 seconds per edit.

The RTX 3060 12GB is the minimum for running FP8 without offloading. It works; the speed is modest (~12–18 seconds per edit at FP8 estimated), and you will need to keep context length conservative. But it runs.

One practical issue on 12GB cards: the T5-XXL text encoder is a 4–9 GB RAM consumer depending on precision. If you load it at FP16, it adds roughly 9 GB of system RAM usage. Use the FP8-scaled T5 encoder (t5xxl_fp8_e4m3fn_scaled.safetensors) to keep RAM pressure manageable.

8GB Cards (RTX 3060 8GB, RTX 4060 8GB, RTX 5060 8GB): NF4/GGUF Only

An 8GB card requires NF4 or a GGUF quantization. With the 7GB NF4 checkpoint, there’s 1 GB of headroom — fine for small resolution (768×768), tight for 1024×1024. Black Forest Labs reported 97% quality retention at NF4; in practice, you’ll notice softened fine detail in complex scenes and slightly reduced text rendering compared to FP8, but for most portrait and product edits the output is usable.

GGUF variants in the Q4 range (4–7 GB) are available from the QuantStack repository on Hugging Face. Load these through the ComfyUI-GGUF custom node into the models/unet/ directory rather than the standard diffusion model loader.

ComfyUI Setup: 20 Minutes Start to First Edit

Prerequisites

ComfyUI v0.3.42 or newer — the Kontext workflow nodes were added in this release and are not available in older builds
30–50 GB of free storage (accounting for model files + working cache)
Python 3.11 or 3.12 with PyTorch 2.4+

Download the Model Files

You need four components:

1. Diffusion model — place in ComfyUI/models/diffusion_models/

For FP8 (recommended for 12GB+ VRAM):

flux1-dev-kontext_fp8_scaled.safetensors  (~12 GB)

Download from the Black Forest Labs Hugging Face repository.

For NF4/GGUF (8–12GB VRAM): Use any Q4–Q8 GGUF from QuantStack’s FLUX.1-Kontext-dev-GGUF repo. Place in ComfyUI/models/unet/ and use the GGUF loader node.

2. VAE — place in ComfyUI/models/vae/

ae.safetensors

This is shared with standard FLUX.1 dev — you likely already have it.

3. Text encoders — place in ComfyUI/models/text_encoders/

clip_l.safetensors
t5xxl_fp8_e4m3fn_scaled.safetensors   (FP8, recommended — saves ~5 GB RAM vs FP16)

Load the Workflow

The ComfyUI docs provide an official native workflow JSON. Download it, drag it onto your ComfyUI canvas. If nodes appear red after loading:

Open ComfyUI Manager → Install Missing Custom Nodes
Restart ComfyUI
Reload the workflow

The Basic Editing Loop

A Kontext workflow has three key nodes the standard text-to-image pipeline lacks:

Load Image — your source image input
FluxKontextImageConditioning — binds the source image to the conditioning
KSampler with the Kontext diffusion model loaded

Once wired up, your workflow runs as:

Load source image
Enter your edit instruction in the text prompt
Set steps to 20, guidance 3.5–5.0
Sample → decoded output is your edited image

The model respects the non-edited regions well at guidance 3.5–4.0. Higher guidance (5.0+) makes edits more aggressive but can introduce artifacts in background regions.

Typical settings for portrait editing:
  Steps: 20
  CFG/Guidance: 3.5–4.0
  Resolution: 1024×1024
  Sampler: euler, scheduler: simple

Local vs API: The Cost Math

The fal.ai API for Kontext Pro costs $0.04 per image. At 20 images a day — a moderate creative workflow — that’s $0.80/day, $24/month, $288/year. At 100 images a day (a production pipeline or active project), it’s $4/day, $120/month, $1,440/year.

Running locally with an RTX 3090 (used price ~$1,050 as of June 2026 per eBay completed listings):

Break-even on hardware: $1,050 / $0.04 = 26,250 images
At 100 edits/day: ~263 days to break even
Electricity at ~350W: $0.062/hour at the 17.65¢/kWh US average — effectively free compared to the API cost per image
After break-even: ~$0 per edit vs $0.04 indefinitely

At 20 edits/day the break-even stretches past three years, which makes the API more rational if you’re casual. At 100+ edits/day, local pays off in roughly nine months on a used 3090.

The calculus shifts for an RTX 4090 at ~$2,300 used: break-even at 57,500 images. At 100 edits/day, that’s well over a year. Reasonable only for a professional using the model heavily. The 4090 also runs FP8 with full TensorRT acceleration, which cuts generation time nearly in half vs the 3090 — at scale, that speed matters.

If your editing is sporadic — 10–20 images a week — just use the API. The hardware investment doesn’t pencil out at low volume, and the API’s Kontext Pro uses the cloud model which may have marginally higher output quality than the open-weight dev version. For workflows that also involve cloud GPU rental for training, RunPod offers GPU instances at $1.49–$2.49/hour where you can test the full Kontext dev pipeline before committing to hardware.

What Flux Kontext Does That Standard Inpaint Doesn’t

Standard inpainting in Stable Diffusion or FLUX.1 dev requires you to mask the region you want changed. The model fills the masked area from scratch, which means it frequently conflicts with surrounding lighting, loses fine detail at mask boundaries, or generates elements that don’t match the rest of the scene.

Kontext works differently. It conditions on the entire source image — including the regions you’re not changing — and generates a globally consistent edit rather than a masked patch. In practice this means:

Face identity is preserved across edits without LoRA or IP-Adapter overhead
Background edits don’t bleed into subject regions
Style reference transfer (make this photo look like a watercolor) applies globally rather than creating obvious seam lines
Text within images can be changed without the surrounding content degrading

For home lab use cases — product photography editing, portrait retouching, iterative concept art — this architecture closes the gap between “run locally” and “send to a professional API.” If you’ve seen Kontext Pro output through the fal.ai playground, the [dev] model running locally on FP8 is close; the gap between dev and Pro is meaningful on complex prompts but not decisive for most editing tasks.

Check the comparison of image generation costs if you’re also evaluating Kontext against FLUX.1 dev or SDXL for pure generation rather than editing workflows.

Common Issues and Fixes

OOM on 12GB card with FP8: The T5 encoder is the culprit. Switch to t5xxl_fp8_e4m3fn_scaled.safetensors instead of the FP16 version. If still OOM, reduce output resolution to 768×1024 instead of 1024×1024.

Nodes turn red after loading official workflow: Your ComfyUI build is below v0.3.42. Update via git pull && pip install -r requirements.txt in the ComfyUI directory. The Kontext conditioning nodes were added in this release.

Output ignores the edit instruction: Guidance scale too low. Try 4.5–5.5. Also check that the FluxKontextImageConditioning node is correctly connected between your image loader and the KSampler positive conditioning input.

GGUF model loads but produces noise: Verify you’re using the GGUF Unet loader from ComfyUI-GGUF, not the standard Load Diffusion Model node. GGUF files require the custom loader.

Slow on RTX 40-series: You may be running BF16 or a PyTorch path instead of TensorRT-optimized FP8. Confirm you loaded the _fp8_scaled checkpoint and check whether the ComfyUI-TensorRT extension is installed and active.

FAQ

Is the FLUX.1 Kontext dev model free for commercial work? The model weights are under the FLUX.1 Non-Commercial License — you cannot use the model itself in commercial products or services without a separate commercial license from Black Forest Labs. However, output images can generally be used commercially under the terms of the license. Read the full license before any commercial deployment. The API versions (Pro, Max) have separate commercial terms via BFL’s licensing portal.

Does Kontext dev work with LoRAs trained on FLUX.1 dev? Most FLUX.1 dev LoRAs are compatible. The Kontext architecture shares the same base diffusion transformer. LoRA strength may need tuning — start at 0.7–0.9 rather than 1.0.

Can I use Kontext for video frame editing? Not natively. Kontext is a single-image editor. For video, you’d process frame-by-frame, which is slow and accumulates temporal inconsistency. Wan 2.1/2.2 is a better fit for AI video generation locally. Kontext is useful for editing keyframes or style references before passing to a video model.

What’s the difference between Kontext dev, Pro, and Max? Dev is open-weight (non-commercial). Pro and Max are closed-API only. Pro targets fast iterative editing with character/identity consistency; Max adds higher-quality rendering, better typography, and reportedly 8× faster output than previous image-editing models. If you need Pro or Max quality, RunPod offers GPU pods where you can deploy inference servers that call the BFL API at scale rather than paying per-image.

Does ComfyUI support multi-image editing (consistency across scenes)? Yes, through chained conditioning. Load multiple reference images and connect them to the conditioning chain. The degree of consistency depends on how similar the subjects are and how much structural change the prompt requires.

My RTX 5060 Ti 8GB only has 8GB — can I run Kontext? The 8GB RTX 5060 Ti can run Kontext NF4/Q4 at 7GB VRAM. With the Blackwell FP4 TensorRT path, it also loads the 7GB FP4 checkpoint natively. The RTX 5060 Ti 16GB version has enough headroom for FP8. See our RTX 5060 Ti 8GB vs 16GB comparison for the full tradeoff analysis.

Sources

Prices as of June 2026. GPU used-market prices fluctuate weekly; verify before purchasing.

Recommended Gear

Was this article helpful?