Remote Embedding: Load Your Embedding Once For GPU Optimization

4 minute read

Published: April 23, 2026

I created and maintain remote-embedding, a Python package for running embeddings on one server so the model can be loaded once on a GPU and reused across multiple applications. Install it via PyPI with the command below, and check out the GitHub repository at Remote-Embedding:

pip install remote-embedding

New version of remote embedding is out now:

remote-embedding v0.3.0: Better Long-Running GPU Stability

remote-embedding packages two simple pieces together:

a FastAPI embedding server
a LangChain-compatible remote client

The point is to let multiple applications share one loaded embedding model instead of each process loading its own copy into VRAM.

Version v0.3.0 is focused on one problem: long-running GPU behavior.

What was happening

In real usage, embedding servers do not just need to work for one request. They need to survive hours or days of indexing, retrieval, retries, and mixed client traffic without GPU memory drifting upward.

In earlier versions, this package could gradually hold onto more GPU memory than intended in a few situations:

concurrent requests could race while loading models
multiple model configurations could stay resident indefinitely
large embedding requests could push memory higher than necessary
operators did not have enough visibility into what the server was holding

That does not always mean a true leak. Sometimes it is allocator high-water behavior. Sometimes it is multiple cached model instances. Either way, it becomes an operational problem.

What changed in v0.3.0

1. Serialized model loading

Model loading is now protected by the same async lock used for GPU inference.

That matters because two requests arriving at the same time should not both decide they need to load the same model. In practice, that race can cause duplicate initialization and unnecessary VRAM growth.

2. Bounded model cache with eviction

The server now uses an LRU-style cache for loaded embedding models instead of keeping every unique configuration forever.

New settings:

MAX_LOADED_MODELS
--max-loaded-models

The default is 1, which matches the main design goal of the project: share one loaded model across many applications.

When older models are evicted, the server attempts to release them cleanly and clears CUDA cache where appropriate.

3. Request-size guardrails

The server now limits how many input strings can be embedded in one request.

New settings:

MAX_INPUTS_PER_REQUEST
--max-inputs-per-request

The default is 128.

This is not about being restrictive for its own sake. It is there to prevent a single oversized request from destabilizing a long-running server.

4. Stable default embedding batch size

The server now applies a default encoder batch size:

EMBEDDING_BATCH_SIZE
--embedding-batch-size

The default is 32.

This keeps embedding behavior more predictable and reduces the chance that throughput tuning quietly turns into VRAM pressure.

5. Better observability

The /health endpoint now reports cache and request-limit state, including:

loaded model count
maximum loaded models
maximum inputs per request
embedding batch size

That makes it much easier to answer operational questions such as:

Is the server holding more than one model?
Are requests hitting configured limits?
Did someone start the server with different runtime settings than expected?

Why this release matters

The original goal of remote-embedding was always practical: reduce duplicated VRAM usage by centralizing embedding inference.

v0.3.0 moves that idea closer to production reality.

It is one thing to share a model across apps. It is another thing to keep that shared service stable under real traffic. This release improves the second part.

If you are using remote-embedding for:

RAG pipelines
background indexing jobs
multiple local tools sharing one embedding server
constrained GPU setups

then v0.3.0 should be a safer default than earlier versions.

Upgrade notes

The package version is now:

version = "0.3.0"

Typical server usage:

remote-embedding-server \
  --model-name Qwen/Qwen3-Embedding-0.6B \
  --device cuda \
  --max-loaded-models 1 \
  --max-inputs-per-request 128 \
  --embedding-batch-size 32

Or through environment variables:

export EMBEDDING_MODEL_NAME=Qwen/Qwen3-Embedding-0.6B
export DEVICE=cuda
export MAX_LOADED_MODELS=1
export MAX_INPUTS_PER_REQUEST=128
export EMBEDDING_BATCH_SIZE=32

Final note

GPU memory behavior is one of those issues that looks simple until a service runs long enough.

v0.3.0 does not pretend CUDA memory is trivial. It just narrows the failure modes, adds sane defaults, and makes the server easier to reason about when it is under load.

Share on

Twitter Facebook LinkedIn

Meshkat