NovelAI Inpainting Pipeline
Overview
Section titled “Overview”TomoriBot can edit existing images using NovelAI’s infill API, with Gemini providing automatic region detection. When a user asks the bot to change part of an image (e.g., “make her hair red”), the pipeline:
- Extracts the source image from a Discord message
- Sends it to Gemini’s segmentation API with a natural language target description
- Builds an inpainting mask from the detected bounding boxes
- Calls NovelAI’s infill endpoint with the original image + mask + prompt
The implementation spans three files:
| File | Responsibility |
|---|---|
src/utils/image/segmentationService.ts | Gemini segmentation API call + mask construction |
src/tools/functionCalls/generateImageNaiTool.ts | NAI infill payload, inpaint mode orchestration |
src/utils/image/imageExtractor.ts | Discord message image extraction utility |
Pipeline Flow
Section titled “Pipeline Flow”1. Tool receives message_id + edit_target + prompt tags └─ generateImageNaiTool.ts → enters inpaint mode
2. Source image extracted from referenced Discord message └─ imageExtractor.ts → base64 + mime type + dimensions
3. Gemini segments the target region └─ segmentationService.ts → callGeminiSegmentation() └─ Returns: bounding boxes (normalized 0–1000) + labels
4. Mask built from bounding boxes └─ segmentationService.ts → buildBoundingBoxMask() └─ Elliptical shape → latent grid quantization → RGBA encoding
5. Infill request sent to NovelAI └─ generateImageNaiTool.ts → generateInpaintImage() └─ action: "infill", inpainting model suffix derived at runtime
6. Result image posted to Discord channelDesign Decisions
Section titled “Design Decisions”Bounding Box Approach (Not Per-Pixel Masks)
Section titled “Bounding Box Approach (Not Per-Pixel Masks)”Gemini’s segmentation API returns both bounding boxes (box_2d) and per-pixel masks (mask). We only use the bounding boxes because per-pixel masks produce severe artifacts on complex regions:
- Hair, fur, fabric: Per-pixel masks have holes, rough edges, and incomplete coverage that translate directly into grey splotches in the inpainted output
- Blur destroys thin masks: Applying Gaussian blur to soften per-pixel mask edges turns thin white strands into grey — making the artifacts worse
- Bounding boxes are reliable: Gemini’s bounding box detection is consistent even when per-pixel segmentation fails
The trade-off is precision — a bounding box covers more area than the exact target region. We mitigate this with elliptical shapes (see below).
Elliptical Mask Shape
Section titled “Elliptical Mask Shape”Instead of filling the bounding box as a rectangle, we inscribe an ellipse within the padded bounding box:
- Diffusion models expect organic shapes: NAI’s inpainting model was trained on masks from brush strokes, lasso selections, and other organic tools — not perfect rectangles. A rectangular mask creates an unnatural latent-space discontinuity that the model reproduces as a visible edge.
- Curved boundaries blend naturally: An ellipse’s varying distance from the content center gives the model a more gradual transition to work with during denoising.
- Configurable padding (
NAI_INPAINT_PADDING, default 0.15): Each side of the bounding box is expanded by this fraction to capture content that extends beyond Gemini’s detected region (e.g., wispy hair strands). The padding also compensates for the ellipse’s curved boundary cutting into the corners.
Latent Grid Quantization (Critical)
Section titled “Latent Grid Quantization (Critical)”The mask is quantized to 1/8th resolution before being sent to NAI. This is the single most important step for avoiding halo artifacts:
Full-res mask → downsample to ceil(w/64)*8 × ceil(h/64)*8 (nearest-neighbor) → upsample back to full resolution (nearest-neighbor, V4 only)Why this matters: NAI’s diffusion model operates in latent space at 1/8th pixel resolution. When you send a full-resolution mask, the model downsamples it internally — and any mask edges that don’t align to the 8×8 latent grid produce intermediate grey values. These grey values tell the model to partially redraw, creating a visible halo ring at the mask boundary.
Pre-quantizing with nearest-neighbor interpolation ensures every mask pixel snaps to the latent grid. The resulting mask looks “blocky” at full resolution, but this is correct — it matches exactly what the model sees internally.
This approach was derived from the open-source ComfyUI_NAIDGenerator resize_to_naimask() function and confirmed by Aedial/novelai-api issue discussions.
RGBA Mask Format
Section titled “RGBA Mask Format”The mask is encoded as an RGBA PNG rather than a simple RGB black/white PNG:
| Pixel type | R | G | B | A |
|---|---|---|---|---|
| Redraw (white) | 255 | 255 | 255 | 255 |
| Preserve (black) | 0 | 0 | 0 | 0 |
The alpha channel acts as the actual mask signal. This matches the naimask_to_base64() encoding used by every major open-source NAI client (ComfyUI_NAIDGenerator, novelai-python, novelai-api).
add_original_image: true
Section titled “add_original_image: true”This parameter tells NAI to overlay the original image pixels onto the non-masked area of the output. Without it (false), the entire image can drift slightly during generation, causing the redrawn region to look inconsistent with its surroundings. All open-source NAI implementations default to true.
strength: 1.0
Section titled “strength: 1.0”Full denoising strength ensures the masked region is completely redrawn from the prompt tags. Lower values (e.g., 0.7) preserve some of the original structure but cause the original colors to bleed through — white hair at 0.7 strength + “red hair” prompt = grey output.
Configurable via NAI_INPAINT_STRENGTH env var.
Gemini Segmentation Configuration
Section titled “Gemini Segmentation Configuration”The Gemini API call uses specific settings derived from Google’s spatial understanding reference:
| Setting | Value | Why |
|---|---|---|
| Text before image | Parts order: text first | Gemini processes instructions better when text leads |
| Temperature | 0.5 | Prevents the model from looping on repeated tokens |
| Safety settings | OFF (all categories) | Anime/artistic images trigger false positives, causing silent hangs |
| Thinking | Disabled (thinkingBudget: 0) | Adds latency without quality benefit for structured extraction |
| Model | gemini-2.5-flash (configurable) | Fast enough for real-time use; configurable via NAI_SEGMENTATION_MODEL |
Configuration Reference
Section titled “Configuration Reference”| Env Var | Default | Description |
|---|---|---|
NAI_INPAINT_DEBUG | false | DMs the invoking user the mask + bounding box overlay for debugging |
NAI_INPAINT_STRENGTH | 1.0 | Denoising strength for the masked region (0.0–1.0) |
NAI_INPAINT_PADDING | 0.15 | Padding added to each side of the bounding box as a fraction of box dimension |
NAI_SEGMENTATION_MODEL | gemini-2.5-flash | Gemini model used for segmentation |
NAI_SEGMENTATION_TIMEOUT_MS | 90000 | Timeout for the Gemini segmentation API call |
Debugging
Section titled “Debugging”When NAI_INPAINT_DEBUG=true, the bot DMs the invoking user two images:
- Bounding box overlay: Original image with dashed rectangles (raw Gemini bbox) and semi-transparent ellipses (actual padded mask shape) drawn on top
- Binary mask: The actual mask being sent to NAI — white regions are redrawn, black regions are preserved
Common issues:
| Symptom | Likely Cause | Fix |
|---|---|---|
| Halo/ring at mask boundary | Mask not quantized to latent grid, or grey pixels in mask | Ensure buildBoundingBoxMask quantization is working; check mask has only pure white/black |
| Grey splotches in output | Using per-pixel masks instead of bounding boxes | Verify box_2d coordinates are used, not mask data |
| Original colors bleeding through | Strength too low | Increase NAI_INPAINT_STRENGTH to 1.0 |
| Target region not fully covered | Bounding box too tight | Increase NAI_INPAINT_PADDING (e.g., 0.25) |
| Seam at mask edge with shifted surroundings | add_original_image: false | Set to true |
| Gemini returns empty/hangs | Safety filter triggered | Verify safety settings are OFF |
| 512x512 squished output | Missing width/height in infill payload | Ensure source image dimensions are passed |
Lessons Learned
Section titled “Lessons Learned”These are mistakes made during development that should not be repeated:
- Per-pixel masks from Gemini are unreliable for complex regions (hair, fur, fabric) — holes, rough edges, incomplete coverage cause artifacts
- Blur on masks causes halos, not fixes them — grey pixels from blur create partial-redraw zones
- Mask resolution must match the latent grid — full-resolution masks with arbitrary edges get quantized internally by the model, creating intermediate values at boundaries
- RGBA format matters — RGB-only masks may not be interpreted correctly by the API
add_original_image: falsecauses global image drift — the surrounding area shifts, making the redrawn region look inconsistent- Strength < 1.0 causes color bleed-through — original content leaks into the redrawn area
Future Work
Section titled “Future Work”Better Segmentation Approaches
Section titled “Better Segmentation Approaches”The current bounding-box-to-ellipse approach trades precision for stability. An ellipse inscribed in a bounding box includes surrounding pixels that aren’t part of the target (e.g., background behind hair), which the model must regenerate. Potential improvements:
- Alternative segmentation APIs: Services like SAM 2 (Segment Anything Model) or cloud-based instance segmentation APIs may produce cleaner per-pixel masks than Gemini’s built-in segmentation, especially for complex regions like hair
- Refined Gemini prompting: Experimenting with more specific segmentation prompts or multi-pass segmentation (coarse bbox pass → refined mask pass) could improve Gemini’s per-pixel output quality
- Hybrid approach: Use Gemini’s bounding box for region detection, then run a specialized segmentation model (e.g., SAM 2) on the cropped region for pixel-precise masking — combining Gemini’s natural language understanding with a dedicated segmentation model’s precision
- Mask post-processing: If per-pixel masks improve, applying morphological operations (dilation, erosion, closing) could fill holes and smooth edges before latent grid quantization
- User-guided refinement: Allow users to adjust the mask region via Discord interactions (e.g., “expand the mask”, “include more of the left side”) for cases where automatic detection falls short
File References
Section titled “File References”| File | Purpose |
|---|---|
src/utils/image/segmentationService.ts | Gemini API call, bounding box mask construction, debug overlay |
src/tools/functionCalls/generateImageNaiTool.ts | NAI infill payload, inpaint mode orchestration, model detection |
src/utils/image/imageExtractor.ts | Discord message image extraction |
.env.optional.example | Optional inpainting and related runtime env vars documented |