Visual Layer Documentation: Visual Intelligence, At Scale

How This Helps

Duplicate detection helps you identify redundant images or frames across your dataset. Use it to streamline cleanup, reduce storage, and improve data quality before training or export.

Prerequisites

A dataset in READY status.
A dataset ID (visible in the browser URL when viewing a dataset: https://app.visual-layer.com/dataset/<dataset_id>/data).
A valid JWT token. See Authentication.

Find Duplicates Using VQL

The preferred approach uses the Visual Query Language (VQL) filter on the Explore endpoint. The duplicates filter groups visually similar media into duplicate clusters.

GET /api/v1/explore/{dataset_id}?vql=[...]&entity_type=IMAGES&threshold=0
Authorization: Bearer <jwt>

VQL Duplicates Filter

Pass a duplicates filter in the vql array:

[{"duplicates": {"op": "duplicates", "value": 0.95}}]

The value field sets the similarity threshold (0.0–1.0). A value of 0.95 returns only clusters where images are at least 95% similar to each other. Lower values return more permissive groupings.

Example

curl -H "Authorization: Bearer <jwt>" \
  "https://app.visual-layer.com/api/v1/explore/<dataset_id>?vql=%5B%7B%22duplicates%22%3A%7B%22op%22%3A%22duplicates%22%2C%22value%22%3A0.95%7D%7D%5D&entity_type=IMAGES&threshold=0"

Decoded VQL:

[{"duplicates": {"op": "duplicates", "value": 0.95}}]

Response

{
  "clusters": [
    {
      "cluster_id": "d0470097-0c77-4a9c-9edf-289680df7f71",
      "type": "IMAGES",
      "n_images": 3,
      "similarity_threshold": "0",
      "relevance_score": null,
      "previews": [
        {
          "type": "IMAGE",
          "media_id": "300dad2c-1234-11f1-8483-5a879df30de4",
          "media_uri": "https://cdn.example.com/.../image.jpg",
          "media_thumb_uri": "https://cdn.example.com/.../thumb.webp",
          "file_name": "car_001.jpg",
          "width": 1920,
          "height": 1080
        }
      ],
      "labels": null,
      "user_tags": null
    }
  ],
  "metadata": {
    "used_duckdb": true
  }
}

Each cluster in the response contains a group of near-duplicate images. The previews array shows representative images from the group.

Find Duplicates Using `duplicate_threshold`

You can also use the duplicate_threshold query parameter directly on the Explore endpoint as a simpler alternative to VQL.

curl -H "Authorization: Bearer <jwt>" \
  "https://app.visual-layer.com/api/v1/explore/<dataset_id>?duplicate_threshold=0.95&entity_type=IMAGES&threshold=0"

Parameter	Type	Description
`duplicate_threshold`	float	Similarity cutoff (0.0–1.0). Returns only clusters containing near-duplicates at this threshold or higher.

Filter by Uniqueness

To find the most unique images (the opposite of duplicates), use the uniqueness VQL filter.

[{"uniqueness": {"op": "uniqueness", "value": 0.8}}]

This returns images with a uniqueness score above the specified threshold. Higher values mean more unique content.

curl -H "Authorization: Bearer <jwt>" \
  "https://app.visual-layer.com/api/v1/explore/<dataset_id>?vql=%5B%7B%22uniqueness%22%3A%7B%22op%22%3A%22uniqueness%22%2C%22value%22%3A0.8%7D%7D%5D&entity_type=IMAGES&threshold=0"

Python Example

The following example retrieves all duplicate clusters and prints a summary.

import requests
from urllib.parse import quote
import json

VL_BASE_URL = "https://app.visual-layer.com"
JWT_TOKEN = "<your-jwt-token>"
DATASET_ID = "<your-dataset-id>"

headers = {"Authorization": f"Bearer {JWT_TOKEN}"}

def find_duplicates(similarity_threshold: float = 0.95, page: int = 0):
    vql = json.dumps([{"duplicates": {"op": "duplicates", "value": similarity_threshold}}])
    resp = requests.get(
        f"{VL_BASE_URL}/api/v1/explore/{DATASET_ID}",
        headers=headers,
        params={
            "vql": vql,
            "entity_type": "IMAGES",
            "threshold": 0,
            "page_number": page,
        },
    )
    resp.raise_for_status()
    return resp.json()

results = find_duplicates(similarity_threshold=0.95)
clusters = results.get("clusters", [])
print(f"Found {len(clusters)} duplicate cluster(s)")

total_duplicates = sum(c.get("n_images", 0) for c in clusters)
print(f"Total duplicate images: {total_duplicates}")

for cluster in clusters:
    n = cluster.get("n_images")
    cid = cluster.get("cluster_id", "")[:8]
    print(f"  Cluster {cid}... — {n} near-duplicate images")
    for preview in cluster.get("previews", [])[:3]:
        print(f"    {preview['file_name']}")

Response Codes

See Error Handling for the error response format and Python handling patterns.

HTTP Code	Meaning
200	Results returned successfully.
401	Unauthorized — check your JWT token.
404	Dataset not found.
422	Invalid query parameters — check VQL syntax or threshold value.

Semantic Search

Search using natural language text queries with VQL.

Export a Dataset

Export duplicate clusters for downstream deduplication.

API Getting Started

Create a Dataset

Manage Datasets

Search

Guides

Useful Scripts

Duplicate Retrieval

How This Helps

Prerequisites

Find Duplicates Using VQL

VQL Duplicates Filter

Example

Response

Find Duplicates Using `duplicate_threshold`

Filter by Uniqueness

Python Example

Response Codes

Semantic Search

Export a Dataset

API Getting Started

Create a Dataset

Manage Datasets

Search

Guides

Useful Scripts

How This Helps

​Prerequisites

​Find Duplicates Using VQL

​VQL Duplicates Filter

​Example

​Response

​Find Duplicates Using duplicate_threshold

​Filter by Uniqueness

​Python Example

​Response Codes

​Related Resources

Semantic Search

Export a Dataset

Prerequisites

Find Duplicates Using VQL

VQL Duplicates Filter

Example

Response

Find Duplicates Using `duplicate_threshold`

Filter by Uniqueness

Python Example

Response Codes

Related Resources