Skip to main content
The Swarms API supports vision-enabled agents that can analyze and understand images. This guide shows you how to send a base64-encoded image to an agent and ask it to identify the location.
In this example, we’ll send an image of Hong Kong to an agent and ask “What city is this?”

Step 1: Get Your API Key

Before you can use the Swarms API, you need to obtain an API key.
  1. Visit https://swarms.world/platform/api-keys
  2. Sign in or create an account
  3. Generate a new API key
  4. Copy and save your API key securely
Keep your API key secure and never commit it to version control. Use environment variables to store it.

Step 2: Prepare Your Image (Base64 Encoding)

The Swarms API accepts images as base64-encoded strings. Here’s how to convert an image to base64:
import base64
import requests

# Method 1: From a URL
hong_kong_url = "https://ik.imgkit.net/3vlqs5axxjf/external/ik-seo/http://images.ntmllc.com/v4/destination/Hong-Kong/Hong-Kong-city/112086_SCN_HongKong_iStock466733790_Z8C705/Hong-Kong-Scenery.jpg?tr=w-680%2Ch-404%2Cfo-auto"
response = requests.get(hong_kong_url)
base64_image = base64.b64encode(response.content).decode('utf-8')

print(f"Base64 string length: {len(base64_image)}")
print(f"First 100 characters: {base64_image[:100]}...")

# Method 2: From a local file
with open("path/to/image.jpg", "rb") as image_file:
    base64_image = base64.b64encode(image_file.read()).decode('utf-8')

Step 3: Send the Image to the Agent

Now that you have your API key and base64-encoded image, you can send it to the Swarms API.
import requests
import base64
import os

# Step 1: Set your API key
API_KEY = os.getenv("SWARMS_API_KEY")
BASE_URL = "https://api.swarms.world"

# Step 2: Encode the Hong Kong image to base64
hong_kong_url = "https://ik.imgkit.net/3vlqs5axxjf/external/ik-seo/http://images.ntmllc.com/v4/destination/Hong-Kong/Hong-Kong-city/112086_SCN_HongKong_iStock466733790_Z8C705/Hong-Kong-Scenery.jpg?tr=w-680%2Ch-404%2Cfo-auto"
img_response = requests.get(hong_kong_url)
base64_image = base64.b64encode(img_response.content).decode('utf-8')

# Step 3: Prepare the API request
headers = {
    "x-api-key": API_KEY,
    "Content-Type": "application/json"
}

payload = {
    "agent_config": {
        "agent_name": "Vision Analyst",
        "description": "AI agent with image analysis capabilities",
        "system_prompt": "You are a vision analyst that can identify and describe images accurately.",
        "model_name": "gpt-4o",
        "max_tokens": 2048,
        "temperature": 0.5
    },
    "task": "What city is this?",
    "img": base64_image
}

# Make the API request
response = requests.post(f"{BASE_URL}/v1/agent/completions", headers=headers, json=payload)
result = response.json()

# Display the result
print("Agent Response:")
print(result['outputs'][0]['content'])
# Output: "This city is Hong Kong."

Expected Response

{
  "job_id": "agent-09a9c0f9ba19419abf64f5538b4b7d59",
  "success": true,
  "name": "Vision Analyst",
  "description": "AI agent with image analysis capabilities",
  "temperature": 0.5,
  "outputs": [
    {
      "role": "Vision Analyst",
      "content": "This image is of Hong Kong. The skyline, Victoria Harbour, and the distinctive tall buildings such as the International Finance Centre (IFC) and International Commerce Centre (ICC) are prominent features of Hong Kong.",
      "timestamp": "2026-02-05T00:58:15.121915",
      "message_id": "85b70f9d-e281-4a98-b9ad-5a2d39f8ffcb"
    }
  ],
  "usage": {
    "input_tokens": 57,
    "output_tokens": 101,
    "total_tokens": 158,
    "img_cost": 0.25,
    "total_cost": 0.252239
  },
  "timestamp": "2026-02-05T00:58:15.252392+00:00"
}

Complete Working Example

Here’s a complete Python script you can run:
import requests
import base64
import os

# Step 1: Set your API key
API_KEY = os.getenv("SWARMS_API_KEY")

if not API_KEY:
    print("Error: Please set SWARMS_API_KEY environment variable")
    print("Get your API key at: https://swarms.world/platform/api-keys")
    exit(1)

# Step 2: Encode the Hong Kong image to base64
print("Encoding image...")
hong_kong_url = "https://ik.imgkit.net/3vlqs5axxjf/external/ik-seo/http://images.ntmllc.com/v4/destination/Hong-Kong/Hong-Kong-city/112086_SCN_HongKong_iStock466733790_Z8C705/Hong-Kong-Scenery.jpg?tr=w-680%2Ch-404%2Cfo-auto"
img_response = requests.get(hong_kong_url)
base64_image = base64.b64encode(img_response.content).decode('utf-8')
print(f"✓ Image encoded (length: {len(base64_image)} characters)")

# Step 3: Send to Swarms API
print("\nAnalyzing image with AI agent...")

BASE_URL = "https://api.swarms.world"

headers = {
    "x-api-key": API_KEY,
    "Content-Type": "application/json"
}

payload = {
    "agent_config": {
        "agent_name": "Vision Analyst",
        "description": "AI agent with image analysis capabilities",
        "system_prompt": "You are a vision analyst that can identify and describe images accurately.",
        "model_name": "gpt-4o",
        "max_tokens": 2048,
        "temperature": 0.5
    },
    "task": "What city is this?",
    "img": base64_image
}

response = requests.post(f"{BASE_URL}/v1/agent/completions", headers=headers, json=payload)
result = response.json()

# Display results
print("\n" + "="*60)
print("AGENT RESPONSE:")
print("="*60)
print(result['outputs'][0]['content'])
print("\n" + "="*60)
print("USAGE STATS:")
print("="*60)
print(f"Total tokens: {result['usage']['total_tokens']}")
print(f"Image cost: ${result['usage']['img_cost']}")
print(f"Total cost: ${result['usage']['total_cost']}")

Image Format Support

The API supports common image formats:
  • JPEG/JPG: Standard photo format
  • PNG: Images with transparency
  • GIF: Static GIFs (first frame)
  • WebP: Modern web image format
All images must be base64-encoded strings. The API automatically detects the image format.

Vision-Capable Models

Not all models support vision. Use these models for image analysis:
  • gpt-4o: Best for complex visual analysis
  • gpt-4o-mini: Cost-effective for basic vision tasks
  • claude-sonnet-4-20250514: High-quality vision understanding

Best Practices

  1. Image Size: Optimize images before encoding (recommended max: 4096x4096 pixels)
  2. Compression: Use JPEG for photos, PNG for screenshots/graphics
  3. Quality: Balance image quality with file size for faster processing
  4. Specific Questions: Ask clear, specific questions for better results
  5. Token Usage: Larger/higher resolution images consume more tokens

Cost Considerations

Vision tasks consume additional tokens based on image resolution:
Image SizeApproximate Tokens
512x512~85 tokens
1024x1024~170 tokens
2048x2048~340 tokens

Troubleshooting

Common Issues

Issue: “Invalid image format”
  • Solution: Ensure your image is properly base64-encoded using base64.b64encode()
Issue: “Image too large”
  • Solution: Resize the image to under 4096x4096 pixels or reduce quality
Issue: “Model doesn’t support vision”
  • Solution: Use gpt-4o, gpt-4o-mini, or claude-sonnet-4-20250514
Issue: “High token usage”
  • Solution: Reduce image resolution or use gpt-4o-mini for basic tasks

Error Handling

try:
    response = requests.post(
        f"{BASE_URL}/v1/agent/completions",
        headers=headers,
        json=payload,
        timeout=60
    )

    if response.status_code == 200:
        result = response.json()
        print("Success:", result['outputs'][0]['content'])
    elif response.status_code == 400:
        print("Error: Invalid image format or base64 encoding")
    elif response.status_code == 413:
        print("Error: Image too large - please reduce size")
    else:
        print(f"Error {response.status_code}: {response.text}")

except requests.exceptions.Timeout:
    print("Request timed out - image may be too large")
except Exception as e:
    print(f"Error: {e}")

Next Steps

  • Try analyzing your own images by replacing the image URL
  • Experiment with different questions and prompts
  • Check out the API Reference for more details