docs / getting started / Overview

Polymodels — Product Overview

Version: 1.3 Date: 2026-03-26

What is Polymodels?

Polymodels is a multi-user platform for creating and deploying personal AI experts — specialized language model capabilities trained on your own data and accessible through a shared, GPU- or NPU-backed inference service.

Instead of fine-tuning an entire language model (expensive, slow, and exclusive), Polymodels lets multiple users each train small expert modules that extend a shared base model. Each expert captures specialized knowledge for a specific task, and users can combine, compare, and route between experts at inference time — all without touching the underlying model or interfering with each other.

The Core Idea

A large language model has many layers. Polymodels adds small, trainable expert blocks to selected layers, creating slots that different users can each occupy with their own trained weights.

When you run inference, you choose which expert to activate at each layer for each prompt. The result is a model that can draw on specialized knowledge without the cost of retraining or the complexity of running multiple separate models.

Base Model (shared, read-only)
  └─ Layer  8  [slot 0: base                    | slot 1: Alice's HVAC expert | slot 2: Bob's legal expert]
  └─ Layer 11  [slot 0: base                    | slot 1: Alice's HVAC expert | slot 2: base              ]
  └─ Layer 15  [slot 0: Claire's physics expert | slot 1: base                | slot 2: Bob's legal expert]
  └─ Layer 18  [slot 0: base                    | slot 1: Alice's HVAC expert | slot 2: Bob's legal expert]

Each user trains their expert in isolation. At inference time, you route a prompt through your expert's slots. Another user routing through their slots gets a completely different result — from the same model, on the same hardware, at the same time. Note in the example above that an expert does not have to use all layers. However, a user can only have reserve slot on any layer at a time.

Key Features

Personal Experts on Shared Hardware

Each user gets their own expert slot in the model's expanded layers. Training, loading, and running inference is fully isolated — your expert doesn't interfere with anyone else's.

Flexible Expert Routing

At inference time, the inference controller determines which expert handles each prompt at each layer. It can:

Route individual prompts for a single user to corresponding experts for each prompt.
Run multiple prompts for different users in a batch, each routed to a different expert, to concurrently generate responses for multiple independent users.

Automatic Multi-User Batching

The inference API includes a batch queue that collects requests from different users and executes them together as a single GPU forward pass. Per-layer routing decisions are derived automatically from each user's slot reservations, so callers do not need to construct them manually. Both a blocking (/enqueue) and a streaming (/enqueue/stream) variant are available. The streaming variant returns tokens as Server-Sent Events (SSE) with token, done, error, and keepalive event types. This maximises GPU utilisation when many users are active concurrently without any extra coordination effort from the caller.

Expert Upload

Users can upload expert weights trained outside the platform. Expert packages can be uploaded to the platform and loaded for use. This enables experts trained on other Polymodels deployments or through offline training pipelines to be managed external to the Polymodels platform.

Dataset Generation Built In

Polymodels includes a dataset studio that can be used to generate training data. Describe your task, choose a prompt format, and generate hundreds of training examples, analyze the dagtasets, and run experiments and evals without writing a single line of code.

Multiple Prompt Formats

Experts can be trained and served in several prompt formats: Classic, Alpaca, Microsoft Phi, QC multi-turn, Function (Gemma), and Custom JSON. Each format targets a different interaction style or downstream use case.

Session State Persistence

Expert slot reservations and loaded weights are saved to the cloud. After a server restart or model reload, users can restore their exact setup in one click — re-reserving slots and reloading all expert weights automatically. The expert loader correctly handles the case where a new expert's slots overlap with a previously loaded expert — it evicts any existing entry that would conflict before writing the new weights, so the last load always wins.

Server Selection and Identity

The frontend displays all registered backend servers on the post-login landing page. Users pick which server to connect to and can check each server's status and currently loaded model on demand. Admins can configure a default model config per server — users can load it in one click when no model is running.

Admin Controls

Administrators manage the shared model layer: which model is loaded, which layers are expanded, and how many expert slots exist per layer. Admins can snapshot the current model configuration as a named config and apply it later to restore the exact same state. A toggle switch in the header lets admins move between the admin panel and the user-facing app without logging out.

Role-Based Access

Three roles — admin, user, viewer — control access to model management, training, inference, and user administration.

Use Cases

Domain specialist agents Train experts on domain-specific corpora (legal documents, technical manuals, product catalogs, MCP tools) and run inference on whichever expert is needed on a prompt-by-prompt and/or user-by-user basis.

Comparative evaluation Run the same prompt through multiple experts in a single batch to compare how different training approaches or datasets affect output quality.

Iterative fine-tuning Train a quick expert on a small dataset, evaluate, refine the dataset, and retrain — without restarting the model or affecting other users.

Function calling Use the FunctionGemma prompt format to train experts that reliably call specific tools or APIs, while keeping the base model available for open-ended generation.

Research and experimentation Study how targeted layer-level interventions affect model behavior, share a single GPU across a team, and compare results across training configurations.

How It Works

1. Load and Expand (Admin)

An administrator loads a model and expands selected transformer layers with trainable expert blocks, specifying the number of expert slots per layer. This is a one-time setup that persists until the model is unloaded or the layers are changed. The admin optionally saves this configuration as a named model config so it can be reapplied after a restart.

2. Reserve Slots (User)

Each user can reserve up to one slot per layer. A slot is an exclusive position in the expanded layer — no two users share a slot. Reservations are tracked globally to prevent collisions.

3. Generate Data and Train (User)

Users describe what they want their expert to know using the dataset generator. Generated examples are formatted in the chosen prompt type and saved as JSONL files. Users then submit a training job specifying their dataset, layers, and hyperparameters. Jobs run asynchronously in a GPU queue.

4. Load and Route (User)

Once training is complete, users load their expert weights into their reserved slots. The load expert function maps the saved weight placeholders to the user's actual slot IDs and writes the weights directly into the shared model. If an expert with overlapping slots was loaded previously, the old entry is evicted first — the last load always wins for any given slot.

UI: Users set expert routing explicitly in the Inference tab, or use the Multi-Batched Inference page where routing is derived automatically.

API: Users call POST /inference/enqueue (or /enqueue/stream for streaming). The server looks up the user's slot reservations and constructs the gate index matrix automatically — no manual routing required. Multiple concurrent callers are batched together for a single GPU forward pass. After generation, token counts and timing are written to each user's usage record, scaled by the server's multiplier.

5. Save and Restore (User)

Users save named session states to persist their configuration. Saves capture which slots are reserved and which experts are loaded. On restore, the system automatically re-reserves slots and reloads expert weights.

What Polymodels Is Not

Not a model hosting service. Polymodels is a platform for creating and using experts on top of a model you provide. You bring the model; Polymodels manages the infrastructure around it.
Not a full fine-tuning service. Expert modules are lightweight additions to specific layers, not full model fine-tunes. They are faster to train, cheaper to store, and easy to swap — but they work best for targeted capability improvements, not wholesale model replacement.
Not a RAG system. Experts encode knowledge in trained weights, not in a retrieval index. The trade-offs are different: experts generalize better but require upfront training, while RAG is more flexible but adds retrieval latency and infrastructure.

Glossary

Term	Definition
Expert	A set of trained weights that specialize a specific layer of the base model for a task
Slot	A reserved position within an expanded layer; one slot per user per layer
Expansion	The process of adding expert blocks to transformer layers

NextStep-by-step guide