Comfyui Local First

October 11, 2025•7 min read

Part 1/2 of the n8n AI Studio Journey Series

The Vision: Why Local-First AI Matters

Let me be honest upfront: this isn't a "weekend project" story. This is months of part-time learning, breaking things, starting over, and slowly building something that actually works. If you've ever tried to self-host GPU-accelerated AI services, you know exactly what I mean.

I'm building what I call an "n8n AI Studio"—a local-first Docker environment that orchestrates multiple AI services to create content pipelines. The goal? Go from a text prompt to a fully produced talking avatar video, entirely on my own hardware, with no API dependencies on external services.

Why local-first?

Privacy: My content, my data, my control
Cost: No per-request charges eating into budgets
Flexibility: Experiment without worrying about rate limits
Learning: Understanding these systems from the ground up

The stack includes ComfyUI for image generation, Chatterbox TTS for voice synthesis, Remotion for video composition, and n8n for workflow automation—all running in Docker containers on a single RTX 3090 24GB GPU.

But before any of that sophisticated orchestration can happen, I need to get ComfyUI working. And that's what this post is about.

The Challenge: Why This Isn't Plug-and-Play

If you've looked at ComfyUI's GitHub and thought "oh, just pull the image and run it," let me share what I've learned: getting the container to start is about 10% of the journey.

Here's what makes this genuinely challenging:

1. GPU Passthrough Complexity

Docker + NVIDIA GPU passthrough + proper CUDA memory management isn't as simple as --gpus all. You're managing:

NVIDIA Container Toolkit configuration
CUDA memory allocation strategies (my RTX 3090 shares resources across multiple services)
Proper device capabilities (compute,utility,video,graphics)
Security exceptions (GPU access requires no-new-privileges:false)

2. Model Management Nightmare

ComfyUI is nothing without models. And "downloading models" sounds simple until you realize:

Foundational models alone are 150GB+
Many models are gated (requires Hugging Face authentication and ToS acceptance)
Download speeds vary wildly (even with HF Transfer acceleration)
Organizing models across drives (I'm using a multi-drive setup: code on SSD, models on HDD)
Verifying integrity without built-in checksums

3. Multi-Service Architecture

ComfyUI doesn't run in isolation. In my setup, it needs to:

Communicate with n8n for workflow triggers (internal Docker DNS)
Share output directories with downstream services (bind mounts)
Respect shared GPU memory limits (other services need VRAM too)
Proxy through Nginx for local network access

4. Documentation Gaps

The ComfyUI community is amazing, but documentation is scattered:

Official docs focus on desktop installation
Docker examples are outdated or incomplete
Custom node installation is trial-and-error
Multi-model-path configuration barely documented

I'm not complaining—this is open source at its finest. But it means you're reading logs, debugging YAML syntax, and learning by breaking things.

My Approach: Infrastructure First, Capabilities Second

After several false starts, I settled on a philosophy: build the foundation rock-solid before adding complexity.

Phase 1: Infrastructure (Where I Am Now)

Get ComfyUI running with foundational models:

✅ Docker environment configured (v27.5.1)
✅ NVIDIA Container Toolkit working
✅ Multi-drive model storage setup
✅ Web UI accessible
🔄 Model downloads in progress 11 October 2025

See the Model list I have decided on and busy gathering
Custom HTML/CSS/JAVASCRIPT
⏳ First inference test pending

Phase 2: Text-to-Image Foundation

Master the basics before adding video:

FLUX.1-dev for high-quality T2I generation
LoRA integration for style control
Upscaling workflows (SUPIR + RealESRGAN)
Batch generation pipelines

Phase 3: Image-to-Video Expansion

Add temporal dimension:

WAN 2.2 video generation models
Motion consistency workflows
Frame interpolation techniques

Phase 4: Talking Avatar Integration

The final piece:

InfiniteTalk lip-sync generation
Audio-driven animation (Chatterbox TTS integration)
Full pipeline: text → speech → avatar video

Phase 5: Production Hardening

Make it reliable:

Error recovery mechanisms
Queue management optimization
Monitoring and alerting
Documentation for future me

Why this sequence? Each phase builds on the previous one. If text-to-image doesn't work reliably, adding video generation just multiplies the failure points.

Where We Are Now: The Model Download Marathon

Here's the current state of affairs:

✅ What's Working

Docker Container: ComfyUI boots successfully
GPU Detection: RTX 3090 visible and CUDA operational
Web UI: Accessible at http://192.168.1.13:8188 (direct) and http://192.168.1.13:8080 (via Nginx)
Network Integration: Internal Docker DNS resolves comfyui-main from other containers
Volume Mounts: Both internal storage and external model drive mounted correctly

🔄 In Progress: The Model Download Reality Check

This is where theory met reality.

I knew the models were large. I expected to download 150GB+. What I didn't fully appreciate was:

The Scale of Individual Models:

FLUX.1-dev: 24GB (overnight download)
WAN 2.2 T2I: 20GB (another overnight)
InfiniteTalk 14B: 28GB (you get the idea)
Plus dozens of smaller models (encoders, VAEs, LoRAs, ControlNets)

The Waiting Game: When you kick off a 24GB download at 2AM hoping it'll be done by morning, you need to know:

Is it still downloading or did it stall?
How much progress has been made?
What's the transfer rate?
How many files have actually arrived?

I built a simple monitoring script (monitor_download.sh) that checks directory size, file count, and calculates progress. It's not sophisticated, but it's honest feedback: "Yes, it's still working. No, it's not frozen. Yes, you should go to bed."

The Storage Dance: My setup uses:

Primary SSD: Docker volumes, code, configs
External HDD: Model storage (145GB+ and counting)

ComfyUI needs to know about both. Enter extra_model_paths.yaml—a configuration file that maps multiple storage locations so ComfyUI can discover models across drives.

main_models:base_path: /comfy/mnt/ComfyUI/models/external_drive:base_path: /media/inky/abc1/models/checkpoints: checkpoints/loras: loras/[... more paths ...]

Getting this right was crucial. Get it wrong and ComfyUI can't find your painstakingly downloaded models.

⏳ What's Next

Once the foundational models finish downloading:

First Inference Test: Can I generate a simple image from a text prompt?
Custom Node Installation: Add WanVideoWrapper, InfiniteTalk, SUPIR
Workflow Testing: Build and export basic T2I workflows
N8N Integration: Connect ComfyUI to automation pipelines

Lessons So Far

1. Patience Is a Technical Skill

Large model downloads aren't a coffee break—they're overnight affairs. Plan accordingly. Start downloads before bed, not before important calls.

2. Monitoring Tools Prevent Anxiety

Building that download monitor wasn't just useful—it was necessary for my sanity. When you're downloading 24GB files, "I think it's working" isn't good enough.

3. Documentation Is Your Future Self's Best Friend

I'm writing this blog post partly for you, but mostly for me in three months when I need to rebuild this because I upgraded Docker and broke everything.

4. The Community Is Gold

Every solution I've implemented builds on someone else's work:

Docker image: mmartial/comfyui-nvidia-docker
Model organization strategies from Reddit threads
CUDA optimization tips from GitHub issues
Nginx configurations adapted from production setups

Standing on the shoulders of giants isn't just a phrase—it's how this gets done.

5. Breaking Things Is Part of Building

I've rebuilt this setup three times. Each iteration taught me something:

Iteration 1: Focused on getting it running (any way possible)
Iteration 2: Focused on making it maintainable (proper configs, secrets management)
Iteration 3 (current): Focused on making it shareable (documentation, reproducibility)

What's Next: The Roadmap Forward

Immediate (This Week)

✅ Complete foundational model downloads
⏳ Run first text-to-image inference test
⏳ Verify model discovery in ComfyUI UI
⏳ Export working T2I workflow as JSON

Short-Term (Next 2 Weeks)

Install and test custom nodes (WanVideo, InfiniteTalk, SUPIR)
Build image-to-video workflow
Test upscaling pipeline
Document n8n integration patterns

Next Blog Post

"First Inference: From Prompt to Pixels"

The moment of truth: does it actually work?
Troubleshooting the inevitable issues
Performance benchmarks (how fast is generation on RTX 3090?)
What works, what doesn't, what surprised me

Resources & Following Along

If you're building something similar or want to replicate this setup:

📄 Technical Documentation
Deep-dive technical specs, YAML configs, and command references:
ComfyUI Production Setup Roadmap (Google Doc)

🤖 Model Download Reference
Complete list of models, download commands, and monitoring tools:
ComfyUI Foundational Models

💬 Let's Connect
Building something similar? Hit a different issue? I'd love to hear about your setup. Drop a comment or reach out—this journey is better when it's shared.

Final Thoughts

This isn't a "look how easy it is" post. It's a "here's what I'm learning" post.

I'm documenting this journey because:

Future me will need these notes
Someone else is facing the same challenges right now
Transparency builds better solutions

The goal isn't perfection—it's progress. And right now, progress looks like:

A working ComfyUI container ✅
Models slowly accumulating on disk 🔄
A clear path forward ⏳

Next stop: first inference test. That's when we find out if all this infrastructure work actually... works.

Stay tuned for Part 2: "First Inference: From Prompt to Pixels"

This is Part 1 of a multi-part series documenting the build of a local-first AI content studio. Follow along as I share the wins, the failures, and the lessons learned along the way.

Series Navigation:

Part 1: ComfyUI Foundation (You are here)
Part 2: First Inference (Coming soon)
Part 3: Image-to-Video Pipeline (Planned)
Part 4: Talking Avatar Integration (Planned)
Part 5: Production Lessons Learned (Planned)

Regard Vermeulen

Regard Vermeulen is a self-taught AI Workflow Engineer based in Pretoria, South Africa. In January 2025 he began an intensive deep-dive into AI, and within eleven months shipped multiple production agentic systems on local hardware. His flagship projects include an autonomous content pipeline that has posted over 70 videos to YouTube, Instagram, TikTok, and X with zero manual intervention after trigger; a zero-cloud Claude-based coding team that reduces three-day development cycles to three-hour turnarounds; and specialised CrewAI multi-agent systems for PDF journal generation, trading automation, and personal finance reporting. With a background spanning banking, real-estate investment, and scaling a nationwide distribution business, Regard brings a relentless focus on measurable ROI, cost control, and production reliability to every system he builds. He documents his work openly on GitHub and realandworks.com, sharing code, workflows, and lessons to help creators and teams move from manual execution to automated outcomes. Regard is available for selective collaborations on high-impact AI workflow projects.

Back to Blog