Specialists Guide

Specialists are the core building blocks of Alveare. Each one is a tuned configuration running on a shared model, optimised for a specific task.

What is a cognitive hive?

A cognitive hive is Alveare's core architecture. Instead of running separate models for each task (classification, summarisation, extraction, etc.), a hive loads one model and creates multiple specialists from it.

Each specialist has its own:

System prompt — instructions that shape how the model behaves for that task
Sampling parameters — temperature, top-p, and repetition settings tuned for the task type
Guardrails — output validation, format enforcement, and safety filters

Because all specialists share the same loaded model weights, a hive uses 80-90% less GPU memory than running separate models. That structural advantage is why Alveare costs a fraction of OpenAI for the same workloads.

How specialists share a single model

When a request arrives at the hive:

The router identifies which specialist to use (from the specialist field or model name)
The specialist's system prompt is prepended to the user's input
Sampling parameters (temperature, top-p) are applied per the specialist's config
The model generates a response using the shared weights
Guardrails validate the output before returning it

This happens in the same inference process — there is no extra serialisation or network hop between specialists. Switching from one specialist to another is essentially free.

Available specialists

classify

Text Classification

Categorise text into predefined labels. Useful for sentiment analysis, intent detection, topic routing, and content moderation. Runs at low temperature (0.2-0.3) for consistent results.

python

result = client.infer(
    specialist="classify",
    prompt="""Classify the sentiment of this review as positive, negative, or neutral:

"The battery life is incredible but the camera quality
could be better. Overall a solid phone for the price."""",
    temperature=0.2,
)

print(result.result)
# positive

summarise

Text Summarisation

Condense long text into concise summaries. Supports bullet points, executive summaries, TL;DR, and custom formats. Works well with documents up to ~3000 tokens of input.

python

result = client.infer(
    specialist="summarise",
    prompt="""Summarise this in 3 bullet points:

The company reported Q4 revenue of $4.2 billion, up 23% year-over-year.
Operating margins expanded to 28%, driven by efficiency improvements in
cloud infrastructure and reduced customer acquisition costs. The enterprise
segment grew 45% and now represents 60% of total revenue. Management
raised full-year guidance to $18B-$18.5B, above analyst expectations of
$17.8B. Free cash flow reached $1.1B in the quarter.""",
    max_tokens=200,
)

print(result.result)
# - Q4 revenue hit $4.2B (+23% YoY) with operating margins at 28%
# - Enterprise segment grew 45%, now 60% of total revenue
# - Full-year guidance raised to $18B-$18.5B, beating expectations

extract

Structured Data Extraction

Pull structured data from unstructured text. Outputs JSON by default. Useful for parsing invoices, emails, resumes, receipts, and any document where you need specific fields.

python

result = client.infer(
    specialist="extract",
    prompt="""Extract all contact information as JSON:

Hi there,

I'm Sarah Chen, VP of Engineering at TechFlow Inc.
You can reach me at sarah.chen@techflow.io or call
my direct line at (415) 555-0198. Our office is at
123 Market Street, Suite 400, San Francisco, CA 94105.""",
    max_tokens=256,
)

print(result.result)
# {
#   "name": "Sarah Chen",
#   "title": "VP of Engineering",
#   "company": "TechFlow Inc.",
#   "email": "sarah.chen@techflow.io",
#   "phone": "(415) 555-0198",
#   "address": "123 Market Street, Suite 400, San Francisco, CA 94105"
# }

Question Answering

Answer questions grounded in a provided context passage. The specialist is trained to only use information present in the context, reducing hallucination. Returns "I don't have enough information" when the answer is not in the context.

python

result = client.infer(
    specialist="qa",
    prompt="""Context: The Alveare platform uses a cognitive hive architecture
where a single SLM (Small Language Model) is shared across multiple
specialists. Each specialist has its own system prompt and sampling
parameters. The platform supports Mistral 7B and Llama 2 7B/13B models.
Inference latency is typically under 500ms for requests under 1000 tokens.

Question: What models does Alveare support?""",
)

print(result.result)
# Alveare supports Mistral 7B and Llama 2 in both 7B and 13B parameter sizes.

chat

Multi-turn Conversation

General-purpose conversational AI. Maintains context across multiple turns via the messages array. Use with the OpenAI-compatible endpoint for the best multi-turn experience.

python

response = client.chat.completions.create(
    model="alveare-chat",
    messages=[
        {"role": "system", "content": "You are a helpful customer support agent for an e-commerce store."},
        {"role": "user", "content": "I ordered a laptop last week but it hasn't arrived."},
        {"role": "assistant", "content": "I'm sorry to hear that. Could you provide your order number so I can look into this?"},
        {"role": "user", "content": "It's ORD-98765"},
    ],
    max_tokens=256,
)

print(response.choices[0].message.content)
# Thank you. I can see order ORD-98765 was shipped on March 12th via
# express delivery. The tracking shows it's currently at the regional
# distribution center. It should arrive within 1-2 business days...

code

Code Generation

Generate, explain, debug, and refactor code. Works with Python, JavaScript, TypeScript, Go, Rust, SQL, and other mainstream languages. Best results with clear, specific prompts.

python

result = client.infer(
    specialist="code",
    prompt="""Write a Python function that takes a list of dictionaries
and groups them by a specified key. Include type hints and docstring.""",
    max_tokens=512,
)

print(result.result)
# def group_by(items: list[dict], key: str) -> dict[str, list[dict]]:
#     """Group a list of dictionaries by a specified key.
#
#     Args:
#         items: List of dictionaries to group.
#         key: The dictionary key to group by.
#
#     Returns:
#         Dictionary mapping key values to lists of matching items.
#     """
#     groups: dict[str, list[dict]] = {}
#     for item in items:
#         value = item.get(key, "unknown")
#         groups.setdefault(value, []).append(item)
#     return groups

custom

Custom Specialists

Available on Professional and Scale plans. Create your own specialists with custom system prompts, temperature defaults, and output validators. Define them in the dashboard or via the management API.

python

# Use a custom specialist defined in your dashboard
result = client.infer(
    specialist="my-legal-reviewer",
    prompt="Review this contract clause for potential issues: ...",
    max_tokens=512,
)

Best practices

Choosing the right specialist

Use classify when you need a single label or category — not free-form text
Use summarise when compressing long text — not when answering questions about it (use qa for that)
Use extract when you need structured JSON output from unstructured input
Use qa when you have a specific context passage and want grounded answers
Use chat for multi-turn conversations — it handles conversation history
Use code for anything programming-related: generation, explanation, debugging

Prompt tips

Be specific: "Classify as positive, negative, or neutral" beats "What's the sentiment?"
Provide format instructions: "Return a JSON object with keys: name, email, phone" gets better extraction results
Use examples: Including one or two examples in the prompt (few-shot) significantly improves accuracy
Keep context focused: For QA, include only the relevant passage, not an entire document

Token optimisation

Set max_tokens to the minimum you need — don't use 4096 when 128 will do
For classification, max_tokens: 32 is usually sufficient
Trim unnecessary whitespace and boilerplate from input text
Use temperature: 0.0-0.3 for deterministic tasks (classify, extract) and 0.5-0.9 for creative tasks (chat, code)

Lower temperature + lower max_tokens = faster responses and lower token usage. For classification workloads, you can often achieve sub-100ms latency with temperature: 0.1 and max_tokens: 16.