New: Audio API, Embeddings & Realtime WebSocket now available!
osmAPI LogoosmAPI

Visual Intelligence & Perception

seamlessly integrate image-based reasoning into your AI workflows via URL or encoded payloads.

Multi-Modal Orchestration: Visual Intelligence

osmAPI empowers your applications to interact with high-performance vision models, enabling complex analysis and reasoning based on visual stimuli. Whether you are processing architectural diagrams, medical imaging, or interactive UI mocks, our gateway provides a unified interface for delivering visual data to any perception-enabled endpoint.

Integration Methodologies

1. Remote Asset Reference (Public URLs)

The most efficient way to deliver images is by providing a publicly accessible HTTPS reference. Our gateway securely proxies this reference to the target model provider.

curl -X POST "https://api.osmapi.com/v1/chat/completions" \
  -H "Authorization: Bearer $OSM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Perform a structural analysis of this composition."
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://assets.osmapi.com/demo/architecture.jpg"
            }
          }
        ]
      }
    ]
  }'

2. Embedded Payloads (Base64 Encoding)

For private assets or localized data, you can transmit images directly as inline base64-encoded strings within the request object.

curl -X POST "https://api.osmapi.com/v1/chat/completions" \
  -H "Authorization: Bearer $OSM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Identify the primary elements in this scene."
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQEASABIAAD..."
            }
          }
        ]
      }
    ]
  }'

Multi-Modal Structuring

Perception-enabled requests utilize a specialized content array format. Each element in the array represents a specific data modality:

  • Linguistic Block: {"type": "text", "text": "Prompt text..."}
  • Perceptual Block: {"type": "image_url", "image_url": {"url": "..."}}

Operational Ease: For standard text-only interactions, the legacy string format is still fully supported. The array structure is only mandatory when introducing visual assets.


Advanced Visual Composition

osmAPI facilitates multi-image comparative analysis. You can include multiple perceptual blocks within a single user message to enable cross-image reasoning.

curl -X POST "https://api.osmapi.com/v1/chat/completions" \
  -H "Authorization: Bearer $OSM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-6",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Contrast the design aesthetics between these two interface versions."
          },
          {
            "type": "image_url", "image_url": { "url": "https://example.com/v1.png" }
          },
          {
            "type": "image_url", "image_url": { "url": "https://example.com/v2.png" }
          }
        ]
      }
    ]
  }'

Technical Specifications & Resilience

Universal Capability Matrix

A comprehensive list of vision-ready models is available via our Dynamic Model Browser.

Supported Visual Standards

While provider support varies, osmAPI commonizes interactions for all major formats:

  • Photography Standards: JPEG, JPG
  • Lossless Interchanges: PNG
  • Modern Web Architecture: WebP
  • Sequential Animations: GIF (First frame analysis depends on provider)

Error Tolerance & Fault Handling

If a provided URL is unreachable or the payload exceeds provider-specific constraints, the osmAPI gateway implements internal sanitization to prevent catastrophic failure. Our engine will attempt to gracefully handle the exception and return meaningful error telemetry to help you calibrate your implementation.


Fallback Modality: Direct Text

You can continue to use simple strings for standard chat interactions even when targeting multi-modal models:

curl -X POST "https://api.osmapi.com/v1/chat/completions" \
  -H "Authorization: Bearer $OSM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{ "role": "user", "content": "How's the weather in Kyoto?" }]
  }'

How is this guide?