How Ollama Works
Ollama works in the following way:
- You install Ollama on your local machine
- You pull model weights for specific LLMs (like Llama 2, Mistral, or Gemma)
- Ollama sets up a local API server
- You interact with models through a CLI or API calls
The Ollama Server
The Ollama server is the core component that manages everything from model loading to inference.
When you install Ollama, it creates a background service that runs on your machine. This service handles:
- Model management (downloading, storing, and updating models)
- Exposing an API interface (accessible at http://localhost:11434)
- Handling inference requests
- Managing model configurations
Models and Model Library
Ollama provides access to a growing library of open-source LLMs that can be run locally.
Some popular models available through Ollama include:
- Llama 2 (7B, 13B)
- Mistral (7B)
- Gemma (2B, 7B)
- Phi-2
- Falcon
Each model has different capabilities, parameter sizes, and hardware requirements.
Model files and Customization
Ollama allows you to customize models through Modelfiles, similar to how Docker uses Dockerfiles.
A Modelfile lets you:
- Create custom model configurations
- Add specific system prompts
- Fine-tune parameters
- Create specialized model variants
Interacting with Models
There are two main ways to interact with models in Ollama:
Command Line Interface (CLI) - For quick interactions and model management REST API - For integration with applications and more complex use cases
Hardware Requirements
Running LLMs locally requires decent hardware. The minimum requirements depend on the model size:
- Small models (1-3B parameters): At least 8GB RAM
- Medium models (7B parameters): 16GB RAM recommended
- Large models (13B+ parameters): 32GB+ RAM recommended
GPU acceleration significantly improves performance, but many models can run on CPU-only setups, slowly though.