Specialists Guide
Specialists are the core building blocks of Alveare. Each one is a tuned configuration running on a shared model, optimised for a specific task.
What is a cognitive hive?
A cognitive hive is Alveare's core architecture. Instead of running separate models for each task (classification, summarisation, extraction, etc.), a hive loads one model and creates multiple specialists from it.
Each specialist has its own:
- System prompt — instructions that shape how the model behaves for that task
- Sampling parameters — temperature, top-p, and repetition settings tuned for the task type
- Guardrails — output validation, format enforcement, and safety filters
Because all specialists share the same loaded model weights, a hive uses 80-90% less GPU memory than running separate models. That structural advantage is why Alveare costs a fraction of OpenAI for the same workloads.
How specialists share a single model
When a request arrives at the hive:
- The router identifies which specialist to use (from the
specialistfield ormodelname) - The specialist's system prompt is prepended to the user's input
- Sampling parameters (temperature, top-p) are applied per the specialist's config
- The model generates a response using the shared weights
- Guardrails validate the output before returning it
This happens in the same inference process — there is no extra serialisation or network hop between specialists. Switching from one specialist to another is essentially free.
Available specialists
Text Classification
Categorise text into predefined labels. Useful for sentiment analysis, intent detection, topic routing, and content moderation. Runs at low temperature (0.2-0.3) for consistent results.
result = client.infer(
specialist="classify",
prompt="""Classify the sentiment of this review as positive, negative, or neutral:
"The battery life is incredible but the camera quality
could be better. Overall a solid phone for the price."""",
temperature=0.2,
)
print(result.result)
# positive
Text Summarisation
Condense long text into concise summaries. Supports bullet points, executive summaries, TL;DR, and custom formats. Works well with documents up to ~3000 tokens of input.
result = client.infer(
specialist="summarise",
prompt="""Summarise this in 3 bullet points:
The company reported Q4 revenue of $4.2 billion, up 23% year-over-year.
Operating margins expanded to 28%, driven by efficiency improvements in
cloud infrastructure and reduced customer acquisition costs. The enterprise
segment grew 45% and now represents 60% of total revenue. Management
raised full-year guidance to $18B-$18.5B, above analyst expectations of
$17.8B. Free cash flow reached $1.1B in the quarter.""",
max_tokens=200,
)
print(result.result)
# - Q4 revenue hit $4.2B (+23% YoY) with operating margins at 28%
# - Enterprise segment grew 45%, now 60% of total revenue
# - Full-year guidance raised to $18B-$18.5B, beating expectations
Structured Data Extraction
Pull structured data from unstructured text. Outputs JSON by default. Useful for parsing invoices, emails, resumes, receipts, and any document where you need specific fields.
result = client.infer(
specialist="extract",
prompt="""Extract all contact information as JSON:
Hi there,
I'm Sarah Chen, VP of Engineering at TechFlow Inc.
You can reach me at sarah.chen@techflow.io or call
my direct line at (415) 555-0198. Our office is at
123 Market Street, Suite 400, San Francisco, CA 94105.""",
max_tokens=256,
)
print(result.result)
# {
# "name": "Sarah Chen",
# "title": "VP of Engineering",
# "company": "TechFlow Inc.",
# "email": "sarah.chen@techflow.io",
# "phone": "(415) 555-0198",
# "address": "123 Market Street, Suite 400, San Francisco, CA 94105"
# }
Question Answering
Answer questions grounded in a provided context passage. The specialist is trained to only use information present in the context, reducing hallucination. Returns "I don't have enough information" when the answer is not in the context.
result = client.infer(
specialist="qa",
prompt="""Context: The Alveare platform uses a cognitive hive architecture
where a single SLM (Small Language Model) is shared across multiple
specialists. Each specialist has its own system prompt and sampling
parameters. The platform supports Mistral 7B and Llama 2 7B/13B models.
Inference latency is typically under 500ms for requests under 1000 tokens.
Question: What models does Alveare support?""",
)
print(result.result)
# Alveare supports Mistral 7B and Llama 2 in both 7B and 13B parameter sizes.
Multi-turn Conversation
General-purpose conversational AI. Maintains context across multiple turns via the messages array. Use with the OpenAI-compatible endpoint for the best multi-turn experience.
response = client.chat.completions.create(
model="alveare-chat",
messages=[
{"role": "system", "content": "You are a helpful customer support agent for an e-commerce store."},
{"role": "user", "content": "I ordered a laptop last week but it hasn't arrived."},
{"role": "assistant", "content": "I'm sorry to hear that. Could you provide your order number so I can look into this?"},
{"role": "user", "content": "It's ORD-98765"},
],
max_tokens=256,
)
print(response.choices[0].message.content)
# Thank you. I can see order ORD-98765 was shipped on March 12th via
# express delivery. The tracking shows it's currently at the regional
# distribution center. It should arrive within 1-2 business days...
Code Generation
Generate, explain, debug, and refactor code. Works with Python, JavaScript, TypeScript, Go, Rust, SQL, and other mainstream languages. Best results with clear, specific prompts.
result = client.infer(
specialist="code",
prompt="""Write a Python function that takes a list of dictionaries
and groups them by a specified key. Include type hints and docstring.""",
max_tokens=512,
)
print(result.result)
# def group_by(items: list[dict], key: str) -> dict[str, list[dict]]:
# """Group a list of dictionaries by a specified key.
#
# Args:
# items: List of dictionaries to group.
# key: The dictionary key to group by.
#
# Returns:
# Dictionary mapping key values to lists of matching items.
# """
# groups: dict[str, list[dict]] = {}
# for item in items:
# value = item.get(key, "unknown")
# groups.setdefault(value, []).append(item)
# return groups
Custom Specialists
Available on Professional and Scale plans. Create your own specialists with custom system prompts, temperature defaults, and output validators. Define them in the dashboard or via the management API.
# Use a custom specialist defined in your dashboard
result = client.infer(
specialist="my-legal-reviewer",
prompt="Review this contract clause for potential issues: ...",
max_tokens=512,
)
Best practices
Choosing the right specialist
- Use classify when you need a single label or category — not free-form text
- Use summarise when compressing long text — not when answering questions about it (use qa for that)
- Use extract when you need structured JSON output from unstructured input
- Use qa when you have a specific context passage and want grounded answers
- Use chat for multi-turn conversations — it handles conversation history
- Use code for anything programming-related: generation, explanation, debugging
Prompt tips
- Be specific: "Classify as positive, negative, or neutral" beats "What's the sentiment?"
- Provide format instructions: "Return a JSON object with keys: name, email, phone" gets better extraction results
- Use examples: Including one or two examples in the prompt (few-shot) significantly improves accuracy
- Keep context focused: For QA, include only the relevant passage, not an entire document
Token optimisation
- Set
max_tokensto the minimum you need — don't use 4096 when 128 will do - For classification,
max_tokens: 32is usually sufficient - Trim unnecessary whitespace and boilerplate from input text
- Use
temperature: 0.0-0.3for deterministic tasks (classify, extract) and0.5-0.9for creative tasks (chat, code)
Lower temperature + lower max_tokens = faster responses and lower token usage. For classification workloads, you can often achieve sub-100ms latency with temperature: 0.1 and max_tokens: 16.