How to Run Llama 3 Locally with Ollama

Table of Contents

Running large language models locally gives you complete privacy, zero API costs, and the freedom to experiment without rate limits. If you want to run Llama 3 locally with Ollama, you can have a working setup in under 10 minutes.

Ollama wraps the complexity of model management into a simple CLI tool. You pull models like Docker images, run them from the terminal, and optionally expose them as a local REST API. This guide walks you through the entire process — from installation to calling Llama 3 from your Python or PHP code.

Step 1: Install Ollama on Your System

Ollama supports macOS, Linux, and Windows. The installation is straightforward.

macOS:

brew install ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download the installer from ollama.com and run it.

After installation, verify it works:

ollama --version

You should see something like ollama version 0.1.x. If you’re on Linux, the installer also sets up Ollama as a systemd service that starts automatically.

Step 2: Pull and Run Llama 3

Ollama manages models through a simple pull command. To download Llama 3:

ollama pull llama3

This downloads the 8B parameter version (about 4.7GB). For the larger 70B model:

ollama pull llama3:70b

Once downloaded, run it interactively:

ollama run llama3

You now have a chat interface in your terminal. Type your prompts and get responses. Press Ctrl+D to exit.

To list all downloaded models:

ollama list

To remove a model you no longer need:

ollama rm llama3:70b

Step 3: Expose Llama 3 as a Local API

Ollama automatically runs an API server on http://localhost:11434. If the server isn’t running, start it manually:

ollama serve

Test the API with curl:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Explain recursion in one sentence",
  "stream": false
}'

The response comes back as JSON with the generated text in the response field. Setting stream: false returns the complete response at once. For real-time streaming, set it to true and handle newline-delimited JSON chunks.

To check which models are available via the API:

curl http://localhost:11434/api/tags

Step 4: Call Llama 3 from Python

Python integration is clean and requires no special SDK — just the requests library.

import requests

def ask_llama(prompt: str) -> str:
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "llama3",
            "prompt": prompt,
            "stream": False
        }
    )
    return response.json()["response"]

# Example usage
answer = ask_llama("What are the SOLID principles in software design?")
print(answer)

For chat-style conversations with message history:

import requests

def chat_with_llama(messages: list) -> str:
    response = requests.post(
        "http://localhost:11434/api/chat",
        json={
            "model": "llama3",
            "messages": messages,
            "stream": False
        }
    )
    return response.json()["message"]["content"]

# Example with context
conversation = [
    {"role": "user", "content": "My name is Yuvraj"},
    {"role": "assistant", "content": "Nice to meet you, Yuvraj!"},
    {"role": "user", "content": "What's my name?"}
]

reply = chat_with_llama(conversation)
print(reply)  # Should remember "Yuvraj"

If you’ve worked with the Claude API in Python, you’ll find this pattern familiar. The main difference is you’re hitting localhost instead of a remote endpoint.

Step 5: Call Llama 3 from PHP

For PHP projects, use cURL to call the local API. This works in Laravel, plain PHP, or any framework.

<?php

function askLlama(string $prompt): string
{
    $ch = curl_init('http://localhost:11434/api/generate');
    
    curl_setopt_array($ch, [
        CURLOPT_POST => true,
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_HTTPHEADER => ['Content-Type: application/json'],
        CURLOPT_POSTFIELDS => json_encode([
            'model' => 'llama3',
            'prompt' => $prompt,
            'stream' => false
        ])
    ]);
    
    $response = curl_exec($ch);
    curl_close($ch);
    
    $data = json_decode($response, true);
    return $data['response'] ?? '';
}

// Example usage
$answer = askLlama('Write a PHP function to validate an email address');
echo $answer;

In Laravel, you can wrap this in a service class and use the HTTP facade:

<?php

namespace App\Services;

use Illuminate\Support\Facades\Http;

class OllamaService
{
    public function generate(string $prompt, string $model = 'llama3'): string
    {
        $response = Http::post('http://localhost:11434/api/generate', [
            'model' => $model,
            'prompt' => $prompt,
            'stream' => false,
        ]);
        
        return $response->json('response');
    }
}

If you’re building APIs that interact with LLMs, check out the guide on building RESTful APIs with PHP and Laravel for structuring your endpoints properly.

Performance Tips and Troubleshooting

Slow responses? Llama 3 runs on CPU by default. If you have an NVIDIA GPU, Ollama uses it automatically. Check GPU usage with nvidia-smi while running a prompt.

Out of memory? The 8B model needs roughly 8GB RAM. Close other applications or use a quantized version:

ollama pull llama3:8b-instruct-q4_0

API not responding? Make sure the Ollama service is running:

# Linux
sudo systemctl status ollama

# macOS/Windows
ollama serve

Want to expose the API to other machines? Set the host before starting:

OLLAMA_HOST=0.0.0.0:11434 ollama serve

Key Takeaways

  • Install Ollama with one command, pull Llama 3, and start chatting in minutes
  • The local API runs on port 11434 and accepts standard HTTP requests — no SDK required
  • Python and PHP integration is simple: POST JSON, get JSON back
  • GPU acceleration happens automatically if available; use quantized models if RAM is tight