How to Run Llama 3 Locally with Ollama
Table of Contents
Running large language models locally gives you complete privacy, zero API costs, and the freedom to experiment without rate limits. If you want to run Llama 3 locally with Ollama, you can have a working setup in under 10 minutes.
Ollama wraps the complexity of model management into a simple CLI tool. You pull models like Docker images, run them from the terminal, and optionally expose them as a local REST API. This guide walks you through the entire process — from installation to calling Llama 3 from your Python or PHP code.
Step 1: Install Ollama on Your System
Ollama supports macOS, Linux, and Windows. The installation is straightforward.
macOS:
brew install ollama
Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows: Download the installer from ollama.com and run it.
After installation, verify it works:
ollama --version
You should see something like ollama version 0.1.x. If you’re on Linux, the installer also sets up Ollama as a systemd service that starts automatically.
Step 2: Pull and Run Llama 3
Ollama manages models through a simple pull command. To download Llama 3:
ollama pull llama3
This downloads the 8B parameter version (about 4.7GB). For the larger 70B model:
ollama pull llama3:70b
Once downloaded, run it interactively:
ollama run llama3
You now have a chat interface in your terminal. Type your prompts and get responses. Press Ctrl+D to exit.
To list all downloaded models:
ollama list
To remove a model you no longer need:
ollama rm llama3:70b
Step 3: Expose Llama 3 as a Local API
Ollama automatically runs an API server on http://localhost:11434. If the server isn’t running, start it manually:
ollama serve
Test the API with curl:
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Explain recursion in one sentence",
"stream": false
}'
The response comes back as JSON with the generated text in the response field. Setting stream: false returns the complete response at once. For real-time streaming, set it to true and handle newline-delimited JSON chunks.
To check which models are available via the API:
curl http://localhost:11434/api/tags
Step 4: Call Llama 3 from Python
Python integration is clean and requires no special SDK — just the requests library.
import requests
def ask_llama(prompt: str) -> str:
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "llama3",
"prompt": prompt,
"stream": False
}
)
return response.json()["response"]
# Example usage
answer = ask_llama("What are the SOLID principles in software design?")
print(answer)
For chat-style conversations with message history:
import requests
def chat_with_llama(messages: list) -> str:
response = requests.post(
"http://localhost:11434/api/chat",
json={
"model": "llama3",
"messages": messages,
"stream": False
}
)
return response.json()["message"]["content"]
# Example with context
conversation = [
{"role": "user", "content": "My name is Yuvraj"},
{"role": "assistant", "content": "Nice to meet you, Yuvraj!"},
{"role": "user", "content": "What's my name?"}
]
reply = chat_with_llama(conversation)
print(reply) # Should remember "Yuvraj"
If you’ve worked with the Claude API in Python, you’ll find this pattern familiar. The main difference is you’re hitting localhost instead of a remote endpoint.
Step 5: Call Llama 3 from PHP
For PHP projects, use cURL to call the local API. This works in Laravel, plain PHP, or any framework.
<?php
function askLlama(string $prompt): string
{
$ch = curl_init('http://localhost:11434/api/generate');
curl_setopt_array($ch, [
CURLOPT_POST => true,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HTTPHEADER => ['Content-Type: application/json'],
CURLOPT_POSTFIELDS => json_encode([
'model' => 'llama3',
'prompt' => $prompt,
'stream' => false
])
]);
$response = curl_exec($ch);
curl_close($ch);
$data = json_decode($response, true);
return $data['response'] ?? '';
}
// Example usage
$answer = askLlama('Write a PHP function to validate an email address');
echo $answer;
In Laravel, you can wrap this in a service class and use the HTTP facade:
<?php
namespace App\Services;
use Illuminate\Support\Facades\Http;
class OllamaService
{
public function generate(string $prompt, string $model = 'llama3'): string
{
$response = Http::post('http://localhost:11434/api/generate', [
'model' => $model,
'prompt' => $prompt,
'stream' => false,
]);
return $response->json('response');
}
}
If you’re building APIs that interact with LLMs, check out the guide on building RESTful APIs with PHP and Laravel for structuring your endpoints properly.
Performance Tips and Troubleshooting
Slow responses? Llama 3 runs on CPU by default. If you have an NVIDIA GPU, Ollama uses it automatically. Check GPU usage with nvidia-smi while running a prompt.
Out of memory? The 8B model needs roughly 8GB RAM. Close other applications or use a quantized version:
ollama pull llama3:8b-instruct-q4_0
API not responding? Make sure the Ollama service is running:
# Linux
sudo systemctl status ollama
# macOS/Windows
ollama serve
Want to expose the API to other machines? Set the host before starting:
OLLAMA_HOST=0.0.0.0:11434 ollama serve
Key Takeaways
- Install Ollama with one command, pull Llama 3, and start chatting in minutes
- The local API runs on port 11434 and accepts standard HTTP requests — no SDK required
- Python and PHP integration is simple: POST JSON, get JSON back
- GPU acceleration happens automatically if available; use quantized models if RAM is tight