← Back to Resources

Local LLMs on CPU

Developers running small local models (like phi4-mini) face a fundamental friction: the trade-off between privacy and reliability. Small models struggle with structural adherence (valid JSON) and semantic nuance (category drift). On CPU hardware, every token is expensive, making long, descriptive system prompts a performance bottleneck.

To build a production-grade system on constrained hardware, we must move from prompt engineering to system engineering.

1. Signatures as contracts

DSPy allows us to stop treating the LLM as a chatbot and start treating it as a software component. We define a signature: a typed interface that the model must satisfy. This reduces the surface area for hallucinations.

class SorterSignature(dspy.Signature):
    """Analyze input and determine its type, subject, and topic."""
    input = dspy.InputField()
    reasoning = dspy.OutputField(desc="Chain of thought")
    type = dspy.OutputField(desc="MUST be one of: project, log, entity...")
    confidence = dspy.OutputField(desc="Score 0.0-1.0")

By replacing fragile prompt tweaking with this compiled signature, we moved from ~70% accuracy to a 100% pass rate on our internal BDD test suite.

2. The safety loop (confidence gates)

In a rigorous system, we acknowledge that probabilistic models will fail. The critical mechanism here is the confidence gate. Any classification with a score below 0.8 is routed to a manual review folder.

This turns a potential AI failure into a simple asynchronous triage task, ensuring that low-confidence guesses never pollute the permanent file structure.

3. The async pattern

Running on a CPU-bound machine results in ~9 seconds of latency per classification. In a synchronous UI, this is a failure. However, by running the system as a background watchdog process, the latency becomes irrelevant. The user captures a thought and returns to their work; the system handles the heavy lifting out of sight, proving that high-end GPUs are not a requirement for agentic workflows.