Wei Chen has spent eleven years building production ML systems for healthcare companies that are now mostly defunct. She is on a due diligence call with the BlueMirror technical team because the fund she advises is considering a position. The question she has come to ask is not whether the architecture is interesting. It is whether the architecture is real. Specifically, whether the thirty-seven small language models the BlueMirror specification describes are the system that runs today, or the system the company wishes it had.
The answer she gets is honest. Thirty-seven models is the target portfolio, the engineering destination the system is being built toward over twenty-four to thirty-six months. At launch, the system runs a smaller set of proprietary models alongside a commercial API. The proprietary models handle the tasks where privacy and latency cannot tolerate a cloud round-trip. The API handles the tasks where reasoning quality at launch matters more than self-hosted inference. Over time, proprietary models trained on real subscriber interaction data replace the API, domain by domain, until the thirty-seven-model portfolio operates as designed.
Wei appreciates the honesty. She has seen enough AI companies describe their target architecture as their current architecture. BlueMirror describes both, distinguishes them, and explains the pipeline that connects them.
Why not one model#
The constraints that force decomposition are five, and they compound.
Latency is the first. The Safety Filter must respond in fifteen milliseconds because it gates every output the system produces. A large general-purpose model cannot meet that latency on edge hardware. A 120-million-parameter model optimized for safety classification can. The Safety Filter is therefore its own model, and the rest of the system is designed around its existence.
Privacy is the second. The Cognitive State Estimator must run in the person’s home because cognitive data must never leave the device. A cloud-hosted general model cannot meet this requirement regardless of contractual protections. The cognitive estimator runs on the Local Pane, with no cloud path, no API path, no replication target. The data is born on the device and stays on the device.
Incrementality is the third. When the Nutrition Advisor needs improvement because new dietary research changes recommendations, the team retrains a single small model on a focused dataset. With one general model, the same improvement requires retraining the entire system. The capacity to update one component without disturbing the rest is what allows the system to evolve continuously rather than in disruptive quarterly releases.
Cost is the fourth. The total development cost for the SLM portfolio is approximately $600,000 to $1 million over twenty-four months, executed through university research partnerships in India (BMT-06.04). A general-purpose model trained from scratch on healthcare-specific data would cost several million and would still need per-task fine-tuning.
Deployability is the fifth. The thirty-seven models distribute across a three-zone compute architecture (BMT-06.03). Privacy-critical models run in Zone 1 (the person’s home). Heavy inference models run in Zone 2 (a regional Community Pane node). The decomposition allows each model to be deployed where its task requirements dictate. A monolithic model cannot be split across zones. A decomposed portfolio can.
The target portfolio#
The thirty-seven models organize into six categories. The first five categories track the functional layers of the system. The sixth covers the two newer concierge domains that the original thirty did not include.
The Core Interaction category handles real-time user-facing language. The Response Generator produces conversational output. The Intent Classifier categorizes incoming requests by domain and sub-domain. The Emotion Detector recognizes emotional state from text and voice. The Empathy Responder generates emotionally calibrated responses. The Clarification Generator produces follow-up questions when requests are ambiguous. These models range from 100 to 400 million parameters, with inference latency targets under 100 milliseconds.
The Memory Care category specializes in cognitive support. The Orientation Assessor performs time, place, and person checks. The Cognitive State Estimator detects lucidity and cognitive fluctuation from behavioral signals. The Confusion Detector identifies disorientation patterns from conversation flow. The Reminiscence Prompter generates life-story engagement prompts. The Simplification Engine adjusts language complexity based on cognitive state. These models range from 70 to 200 million parameters, with the Cognitive State Estimator processing 30-second behavioral windows rather than real-time per-token inference.
The Domain Expert category provides specialized knowledge. The Medication Advisor handles drug interaction checking. The Nutrition Advisor generates dietary recommendations. The Exercise Coach suggests mobility activities. The Sleep Pattern Analyzer assesses rest quality from temporal data. The Financial Advisor and Legal Advisor handle their respective domains. These models range from 100 to 200 million parameters.
The Routing and Safety category gates the system’s behavior. The MoC Router selects context layers per query. The Safety Filter validates outputs for harmful content. The Privacy Filter detects personally identifiable information before any outbound transmission. The Escalation Classifier decides when human intervention is needed. The Trust Evaluator scores external agents in the Blue Pane membrane context. These models range from 80 to 150 million parameters, with the Safety and Privacy Filters targeting sub-15-millisecond inference because they gate every output.
The Specialized Function category handles sensor and analytical tasks. The Speech-to-Intent model converts voice commands to structured intents. The Voice Tone Analyzer extracts emotional tone from speech. The Temporal Pattern Detector finds patterns in time-series behavior. The Anomaly Detector flags deviations from established baselines. The Summary Generator produces conversation and event summaries.
The Learning and End-of-Life category covers the two newest concierge domains. Three models power the learning and literacy concierge. The Knowledge Graph SLM extracts concept relationships from clinical, financial, legal, and digital source text and classifies prerequisite edges in the subscriber’s knowledge graph (120M parameters, Transformer encoder, under 80ms). The Adaptive Content SLM reformats information across learning style modes and generates analogies from subscriber-specific context (250M parameters, Transformer encoder-decoder, under 150ms). The Comprehension Assessment SLM classifies comprehension signals from conversational and behavioral input and produces confidence scores without direct examination of the subscriber (100M parameters, Transformer encoder, under 60ms). Four models power the end-of-life concierge. The Directive Processing SLM parses advance directive text into structured fields and detects version conflicts (150M parameters, Transformer encoder, under 100ms). The Symptom Pattern SLM classifies symptom reports and detects trend anomalies relative to the subscriber’s baseline without producing clinical assessments (120M parameters, Transformer encoder, under 80ms). The Legacy Content SLM organizes and tags legacy assets and prompts the subscriber toward content she may not have considered (200M parameters, Transformer encoder-decoder, under 200ms; not latency-critical). The Care Circle Communication SLM classifies communication pattern transitions and calibrates notification content for end-of-life context (100M parameters, Transformer encoder, under 60ms). All seven models in this category deploy to Zone 3 cloud in Phase 1, with the Knowledge Graph SLM and Advance Directive Manager’s processing model targeting Zone 1 in Phase 2 for offline availability.
Total target portfolio: approximately 3 billion parameters across thirty-seven models. After INT4 quantization, total storage footprint is approximately 1.5 gigabytes.
What runs at launch#
The launch portfolio is smaller, and honestly describing it is more important than describing the target.
Zone 1 (Local Pane, in the person’s home) runs from day one. Five to eight models totaling approximately 850 million parameters: Safety Filter, Privacy Filter, Cognitive State Estimator, Emotion Detector, Speech-to-Intent, Voice Tone Analyzer, Orientation Assessor, and Confusion Detector. These are V0.5 models, pretrained on synthetic data generated through the pipeline described in BMT-06.04. They handle the most privacy-critical inference: cognitive assessment, emotional state, safety screening, and voice processing. This data never leaves the home.
Zone 2 (everything else) runs on a commercial API at launch. The API handles response generation, intent classification, domain expert reasoning, empathy calibration, cross-domain coordination, and all tasks that require the full MoC context and multi-model collaboration. The API operates under a healthcare data processing agreement. The orchestration logic (BMT-02.01) runs identically regardless of whether the inference substrate is a self-hosted regional node or an API endpoint. The H-layer decomposes the task, delegates to L-layer skills, and synthesizes the response through the same code paths.
The person does not see the difference. She sees one AI concierge that responds quickly and knows her well. Whether the inference behind her response runs on a device in her living room or through an API is invisible to her. The orchestration layer (BMT-02.01) abstracts the substrate.
The migration path#
Over twenty-four to thirty-six months, proprietary SLMs trained on real subscriber interaction data replace the API, domain by domain. The process is described in full in BMT-06.04, but the portfolio-level view matters for this article.
Month 6 to 12: V0.5 Zone 1 models run alongside the API. Real interaction data accumulates. The India university teams (IIIT Hyderabad, IIT Madras) begin fine-tuning V1.0 models on the accumulated data.
Month 12 to 18: Zone 2 regional nodes (Community Pane, BMT-06.03) deploy in the first markets. V1.0 SLMs for routine query classes (medication reminders, appointment scheduling, simple benefits questions) pass A/B quality validation against the API and migrate to Zone 2. The API handles complex queries.
Month 18 to 30: progressive migration. More query classes move from API to Zone 2 SLMs. The percentage of API-dependent inference drops from 80 to 85 percent at launch to 10 to 15 percent at steady state.
Month 30 to 36: the thirty-seven-model portfolio operates as designed. Zone 1 handles privacy-critical inference (15 to 20 percent). Zone 2 handles heavy inference (55 to 60 percent). The API handles complex multi-domain reasoning and novel query types (10 to 15 percent). Zone 3 (cloud) handles only anonymized aggregates and coordination (5 to 10 percent).
The portfolio is not static at month thirty-six. The training pipeline continues. The models improve. New models are added as the platform expands to new domains. But by month thirty-six, the thirty-seven-model target portfolio is deployed, self-hosted, and generating the inference cost savings and privacy posture described in the target architecture.
The right architecture for the right task#
The portfolio uses four architecture types. The choice is per-model and justified by the task.
State space models handle temporal pattern recognition with linear computational complexity. The Anomaly Detector, the Temporal Pattern Detector, the Sleep Pattern Analyzer, and the Health Monitor process time-series data where linear scaling matters. A transformer’s quadratic attention overhead would dominate inference cost on long sequences.
Mixture of experts provides parameter efficiency for classification and routing. The Intent Classifier, the Safety Filter, and the MoC Router need broad knowledge but activate only relevant expert sub-networks per query. Most parameters are dormant during any single inference.
Transformers deliver attention quality for generation. The Response Generator, the Empathy Responder, and the Summary Generator need the full attention mechanism to produce coherent, contextually appropriate text.
Hybrids combine architectures for tasks that need multiple capabilities. The Cognitive State Estimator combines temporal pattern recognition with discrete state classification because cognitive assessment needs both continuous monitoring and categorical output.
Each choice is a tradeoff. The tradeoff is documented per model in the technical appendix with measured performance comparisons against alternatives that were considered and rejected.
The deployment distribution#
At maturity, the thirty-seven models distribute across three zones based on privacy sensitivity, latency requirements, and computational demands.
Zone 1 (Local Pane) hosts all Memory Care models, the Safety and Privacy Filters, and the sensor-processing Specialized Function models. These are the models that process the most sensitive data and require the lowest latency for safety-critical functions. Total Zone 1 parameter budget: approximately 850 million. The Knowledge Graph SLM joins Zone 1 in Phase 2 for in-session comprehension support.
Zone 2 (Community Pane) hosts Core Interaction models, Domain Expert models, the MoC Router, and the remaining Routing, Safety, and Specialized Function models. Zone 2 holds the full MoC context for each subscriber and the P-RLHF individual preference models. Total Zone 2 parameter budget: approximately 1.15 billion.
Zone 3 (Cloud) hosts the Learning and End-of-Life category in Phase 1: all seven models powering the learning and literacy concierge and the end-of-life concierge. These models operate on non-latency-critical tasks (comprehension assessment, legacy content organization, directive processing) where cloud round-trip is acceptable. As Zone 1 capacity expands in Phase 2, the Knowledge Graph SLM and the Directive Processing support pipeline migrate to Zone 1 for offline availability. Zone 3 also handles FSSVA coordination, model update distribution, and anonymized analytics.
The total portfolio of approximately 3 billion parameters, quantized to roughly 1.5 gigabytes, distributes across three tiers with headroom on all. The architecture is not designed to run at capacity. Headroom allows for model size increases as research advances, additional models as new domains are added, and growth in concurrent subscribers beyond the initial design point.
Cross-references#
BMT-02.02 The Thirty-Eight. The infrastructure agents that invoke these models. Each agent’s deployment preference drives which models it calls and from which zone.
BMT-06.01 Why Thirty-Seven Models, Not One. The strategic rationale for the portfolio approach at full depth.
BMT-06.03 Edge Intelligence. The three-zone compute architecture that defines where each model runs and how the edge intelligence envelope expands over time.
BMT-06.04 The Training Philosophy. The synthetic-to-proprietary pipeline that produces the models described here, including the India university partnerships and the API-to-SLM migration timeline.
Technical Appendix BMT-02.03-A is available to partners and investors at partners.bluemirror.tech.
