The Swarms API supports vision-enabled agents that can analyze and understand images. This guide shows you how to send a base64-encoded image to an agent and ask it to identify the location.
In this example, we’ll send an image of Hong Kong to an agent and ask “What city is this?”
Step 1: Get Your API Key
Before you can use the Swarms API, you need to obtain an API key.
- Visit https://swarms.world/platform/api-keys
- Sign in or create an account
- Generate a new API key
- Copy and save your API key securely
Keep your API key secure and never commit it to version control. Use environment variables to store it.
Step 2: Prepare Your Image (Base64 Encoding)
The Swarms API accepts images as base64-encoded strings. Here’s how to convert an image to base64:
Python
JavaScript
Bash/cURL
import base64
import requests
# Method 1: From a URL
hong_kong_url = "https://ik.imgkit.net/3vlqs5axxjf/external/ik-seo/http://images.ntmllc.com/v4/destination/Hong-Kong/Hong-Kong-city/112086_SCN_HongKong_iStock466733790_Z8C705/Hong-Kong-Scenery.jpg?tr=w-680%2Ch-404%2Cfo-auto"
response = requests.get(hong_kong_url)
base64_image = base64.b64encode(response.content).decode('utf-8')
print(f"Base64 string length: {len(base64_image)}")
print(f"First 100 characters: {base64_image[:100]}...")
# Method 2: From a local file
with open("path/to/image.jpg", "rb") as image_file:
base64_image = base64.b64encode(image_file.read()).decode('utf-8')
// Method 1: From a URL (Node.js)
async function encodeImageFromUrl(imageUrl) {
const response = await fetch(imageUrl);
const buffer = await response.arrayBuffer();
return Buffer.from(buffer).toString('base64');
}
// Method 2: From a local file (Node.js)
const fs = require('fs');
function encodeImageFromFile(imagePath) {
const imageBuffer = fs.readFileSync(imagePath);
return imageBuffer.toString('base64');
}
// Example: Encode Hong Kong image
const hongKongUrl = "https://ik.imgkit.net/3vlqs5axxjf/external/ik-seo/http://images.ntmllc.com/v4/destination/Hong-Kong/Hong-Kong-city/112086_SCN_HongKong_iStock466733790_Z8C705/Hong-Kong-Scenery.jpg?tr=w-680%2Ch-404%2Cfo-auto";
const base64Image = await encodeImageFromUrl(hongKongUrl);
console.log(`Base64 string length: ${base64Image.length}`);
console.log(`First 100 characters: ${base64Image.substring(0, 100)}...`);
# From a URL
curl -s "https://ik.imgkit.net/3vlqs5axxjf/external/ik-seo/http://images.ntmllc.com/v4/destination/Hong-Kong/Hong-Kong-city/112086_SCN_HongKong_iStock466733790_Z8C705/Hong-Kong-Scenery.jpg?tr=w-680%2Ch-404%2Cfo-auto" | base64 > hongkong_base64.txt
# From a local file
base64 -i /path/to/image.jpg > image_base64.txt
# View first 100 characters
head -c 100 hongkong_base64.txt
Step 3: Send the Image to the Agent
Now that you have your API key and base64-encoded image, you can send it to the Swarms API.
import requests
import base64
import os
# Step 1: Set your API key
API_KEY = os.getenv("SWARMS_API_KEY")
BASE_URL = "https://api.swarms.world"
# Step 2: Encode the Hong Kong image to base64
hong_kong_url = "https://ik.imgkit.net/3vlqs5axxjf/external/ik-seo/http://images.ntmllc.com/v4/destination/Hong-Kong/Hong-Kong-city/112086_SCN_HongKong_iStock466733790_Z8C705/Hong-Kong-Scenery.jpg?tr=w-680%2Ch-404%2Cfo-auto"
img_response = requests.get(hong_kong_url)
base64_image = base64.b64encode(img_response.content).decode('utf-8')
# Step 3: Prepare the API request
headers = {
"x-api-key": API_KEY,
"Content-Type": "application/json"
}
payload = {
"agent_config": {
"agent_name": "Vision Analyst",
"description": "AI agent with image analysis capabilities",
"system_prompt": "You are a vision analyst that can identify and describe images accurately.",
"model_name": "gpt-4o",
"max_tokens": 2048,
"temperature": 0.5
},
"task": "What city is this?",
"img": base64_image
}
# Make the API request
response = requests.post(f"{BASE_URL}/v1/agent/completions", headers=headers, json=payload)
result = response.json()
# Display the result
print("Agent Response:")
print(result['outputs'][0]['content'])
# Output: "This city is Hong Kong."
const API_KEY = process.env.SWARMS_API_KEY;
const BASE_URL = "https://api.swarms.world";
const headers = {
"x-api-key": API_KEY,
"Content-Type": "application/json"
};
// Encode the Hong Kong image from Step 2
const hongKongUrl = "https://ik.imgkit.net/3vlqs5axxjf/external/ik-seo/http://images.ntmllc.com/v4/destination/Hong-Kong/Hong-Kong-city/112086_SCN_HongKong_iStock466733790_Z8C705/Hong-Kong-Scenery.jpg?tr=w-680%2Ch-404%2Cfo-auto";
async function analyzeImage() {
// Fetch and encode the image
const imageResponse = await fetch(hongKongUrl);
const arrayBuffer = await imageResponse.arrayBuffer();
const base64Image = Buffer.from(arrayBuffer).toString('base64');
// Prepare the payload
const payload = {
agent_config: {
agent_name: "Vision Analyst",
description: "AI agent with image analysis capabilities",
system_prompt: "You are a vision analyst that can identify and describe images accurately.",
model_name: "gpt-4o",
max_tokens: 2048,
temperature: 0.5
},
task: "What city is this?",
img: base64Image
};
// Make the API request
const response = await fetch(`${BASE_URL}/v1/agent/completions`, {
method: 'POST',
headers: headers,
body: JSON.stringify(payload)
});
const result = await response.json();
console.log("Agent Response:");
console.log(result.outputs[0].content);
// Expected output: "This image is of Hong Kong. The skyline, Victoria Harbour..."
}
analyzeImage();
# First, encode the image to base64
BASE64_IMAGE=$(curl -s "https://ik.imgkit.net/3vlqs5axxjf/external/ik-seo/http://images.ntmllc.com/v4/destination/Hong-Kong/Hong-Kong-city/112086_SCN_HongKong_iStock466733790_Z8C705/Hong-Kong-Scenery.jpg?tr=w-680%2Ch-404%2Cfo-auto" | base64)
# Make the API request
curl -X POST "https://api.swarms.world/v1/agent/completions" \
-H "x-api-key: your-api-key-here" \
-H "Content-Type: application/json" \
-d "{
\"agent_config\": {
\"agent_name\": \"Vision Analyst\",
\"description\": \"AI agent with image analysis capabilities\",
\"system_prompt\": \"You are a vision analyst that can identify and describe images accurately.\",
\"model_name\": \"gpt-4o\",
\"max_tokens\": 2048,
\"temperature\": 0.5
},
\"task\": \"What city is this?\",
\"img\": \"${BASE64_IMAGE}\"
}"
Expected Response
{
"job_id": "agent-09a9c0f9ba19419abf64f5538b4b7d59",
"success": true,
"name": "Vision Analyst",
"description": "AI agent with image analysis capabilities",
"temperature": 0.5,
"outputs": [
{
"role": "Vision Analyst",
"content": "This image is of Hong Kong. The skyline, Victoria Harbour, and the distinctive tall buildings such as the International Finance Centre (IFC) and International Commerce Centre (ICC) are prominent features of Hong Kong.",
"timestamp": "2026-02-05T00:58:15.121915",
"message_id": "85b70f9d-e281-4a98-b9ad-5a2d39f8ffcb"
}
],
"usage": {
"input_tokens": 57,
"output_tokens": 101,
"total_tokens": 158,
"img_cost": 0.25,
"total_cost": 0.252239
},
"timestamp": "2026-02-05T00:58:15.252392+00:00"
}
Complete Working Example
Here’s a complete Python script you can run:
import requests
import base64
import os
# Step 1: Set your API key
API_KEY = os.getenv("SWARMS_API_KEY")
if not API_KEY:
print("Error: Please set SWARMS_API_KEY environment variable")
print("Get your API key at: https://swarms.world/platform/api-keys")
exit(1)
# Step 2: Encode the Hong Kong image to base64
print("Encoding image...")
hong_kong_url = "https://ik.imgkit.net/3vlqs5axxjf/external/ik-seo/http://images.ntmllc.com/v4/destination/Hong-Kong/Hong-Kong-city/112086_SCN_HongKong_iStock466733790_Z8C705/Hong-Kong-Scenery.jpg?tr=w-680%2Ch-404%2Cfo-auto"
img_response = requests.get(hong_kong_url)
base64_image = base64.b64encode(img_response.content).decode('utf-8')
print(f"✓ Image encoded (length: {len(base64_image)} characters)")
# Step 3: Send to Swarms API
print("\nAnalyzing image with AI agent...")
BASE_URL = "https://api.swarms.world"
headers = {
"x-api-key": API_KEY,
"Content-Type": "application/json"
}
payload = {
"agent_config": {
"agent_name": "Vision Analyst",
"description": "AI agent with image analysis capabilities",
"system_prompt": "You are a vision analyst that can identify and describe images accurately.",
"model_name": "gpt-4o",
"max_tokens": 2048,
"temperature": 0.5
},
"task": "What city is this?",
"img": base64_image
}
response = requests.post(f"{BASE_URL}/v1/agent/completions", headers=headers, json=payload)
result = response.json()
# Display results
print("\n" + "="*60)
print("AGENT RESPONSE:")
print("="*60)
print(result['outputs'][0]['content'])
print("\n" + "="*60)
print("USAGE STATS:")
print("="*60)
print(f"Total tokens: {result['usage']['total_tokens']}")
print(f"Image cost: ${result['usage']['img_cost']}")
print(f"Total cost: ${result['usage']['total_cost']}")
The API supports common image formats:
- JPEG/JPG: Standard photo format
- PNG: Images with transparency
- GIF: Static GIFs (first frame)
- WebP: Modern web image format
All images must be base64-encoded strings. The API automatically detects the image format.
Vision-Capable Models
Not all models support vision. Use these models for image analysis:
- gpt-4o: Best for complex visual analysis
- gpt-4o-mini: Cost-effective for basic vision tasks
- claude-sonnet-4-20250514: High-quality vision understanding
Best Practices
- Image Size: Optimize images before encoding (recommended max: 4096x4096 pixels)
- Compression: Use JPEG for photos, PNG for screenshots/graphics
- Quality: Balance image quality with file size for faster processing
- Specific Questions: Ask clear, specific questions for better results
- Token Usage: Larger/higher resolution images consume more tokens
Cost Considerations
Vision tasks consume additional tokens based on image resolution:
| Image Size | Approximate Tokens |
|---|
| 512x512 | ~85 tokens |
| 1024x1024 | ~170 tokens |
| 2048x2048 | ~340 tokens |
Troubleshooting
Common Issues
Issue: “Invalid image format”
- Solution: Ensure your image is properly base64-encoded using
base64.b64encode()
Issue: “Image too large”
- Solution: Resize the image to under 4096x4096 pixels or reduce quality
Issue: “Model doesn’t support vision”
- Solution: Use gpt-4o, gpt-4o-mini, or claude-sonnet-4-20250514
Issue: “High token usage”
- Solution: Reduce image resolution or use gpt-4o-mini for basic tasks
Error Handling
try:
response = requests.post(
f"{BASE_URL}/v1/agent/completions",
headers=headers,
json=payload,
timeout=60
)
if response.status_code == 200:
result = response.json()
print("Success:", result['outputs'][0]['content'])
elif response.status_code == 400:
print("Error: Invalid image format or base64 encoding")
elif response.status_code == 413:
print("Error: Image too large - please reduce size")
else:
print(f"Error {response.status_code}: {response.text}")
except requests.exceptions.Timeout:
print("Request timed out - image may be too large")
except Exception as e:
print(f"Error: {e}")
Next Steps
- Try analyzing your own images by replacing the image URL
- Experiment with different questions and prompts
- Check out the API Reference for more details