Skip to main content

How This Helps

Duplicate detection helps you identify redundant images or frames across your dataset. Use it to streamline cleanup, reduce storage, and improve data quality before training or export.

Prerequisites

  • A dataset in READY status.
  • A dataset ID (visible in the browser URL when viewing a dataset: https://app.visual-layer.com/dataset/<dataset_id>/data).
  • A valid JWT token. See Authentication.

Find Duplicates Using VQL

The preferred approach uses the Visual Query Language (VQL) filter on the Explore endpoint. The duplicates filter groups visually similar media into duplicate clusters.
GET /api/v1/explore/{dataset_id}?vql=[...]&entity_type=IMAGES&threshold=0
Authorization: Bearer <jwt>

VQL Duplicates Filter

Pass a duplicates filter in the vql array:
[{"duplicates": {"op": "duplicates", "value": 0.95}}]
The value field sets the similarity threshold (0.0–1.0). A value of 0.95 returns only clusters where images are at least 95% similar to each other. Lower values return more permissive groupings.

Example

curl -H "Authorization: Bearer <jwt>" \
  "https://app.visual-layer.com/api/v1/explore/<dataset_id>?vql=%5B%7B%22duplicates%22%3A%7B%22op%22%3A%22duplicates%22%2C%22value%22%3A0.95%7D%7D%5D&entity_type=IMAGES&threshold=0"
Decoded VQL:
[{"duplicates": {"op": "duplicates", "value": 0.95}}]

Response

{
  "clusters": [
    {
      "cluster_id": "d0470097-0c77-4a9c-9edf-289680df7f71",
      "type": "IMAGES",
      "n_images": 3,
      "similarity_threshold": "0",
      "relevance_score": null,
      "previews": [
        {
          "type": "IMAGE",
          "media_id": "300dad2c-1234-11f1-8483-5a879df30de4",
          "media_uri": "https://cdn.example.com/.../image.jpg",
          "media_thumb_uri": "https://cdn.example.com/.../thumb.webp",
          "file_name": "car_001.jpg",
          "width": 1920,
          "height": 1080
        }
      ],
      "labels": null,
      "user_tags": null
    }
  ],
  "metadata": {
    "used_duckdb": true
  }
}
Each cluster in the response contains a group of near-duplicate images. The previews array shows representative images from the group.

Find Duplicates Using duplicate_threshold

You can also use the duplicate_threshold query parameter directly on the Explore endpoint as a simpler alternative to VQL.
curl -H "Authorization: Bearer <jwt>" \
  "https://app.visual-layer.com/api/v1/explore/<dataset_id>?duplicate_threshold=0.95&entity_type=IMAGES&threshold=0"
ParameterTypeDescription
duplicate_thresholdfloatSimilarity cutoff (0.0–1.0). Returns only clusters containing near-duplicates at this threshold or higher.

Filter by Uniqueness

To find the most unique images (the opposite of duplicates), use the uniqueness VQL filter.
[{"uniqueness": {"op": "uniqueness", "value": 0.8}}]
This returns images with a uniqueness score above the specified threshold. Higher values mean more unique content.
curl -H "Authorization: Bearer <jwt>" \
  "https://app.visual-layer.com/api/v1/explore/<dataset_id>?vql=%5B%7B%22uniqueness%22%3A%7B%22op%22%3A%22uniqueness%22%2C%22value%22%3A0.8%7D%7D%5D&entity_type=IMAGES&threshold=0"

Python Example

The following example retrieves all duplicate clusters and prints a summary.
import requests
from urllib.parse import quote
import json

VL_BASE_URL = "https://app.visual-layer.com"
JWT_TOKEN = "<your-jwt-token>"
DATASET_ID = "<your-dataset-id>"

headers = {"Authorization": f"Bearer {JWT_TOKEN}"}

def find_duplicates(similarity_threshold: float = 0.95, page: int = 0):
    vql = json.dumps([{"duplicates": {"op": "duplicates", "value": similarity_threshold}}])
    resp = requests.get(
        f"{VL_BASE_URL}/api/v1/explore/{DATASET_ID}",
        headers=headers,
        params={
            "vql": vql,
            "entity_type": "IMAGES",
            "threshold": 0,
            "page_number": page,
        },
    )
    resp.raise_for_status()
    return resp.json()

results = find_duplicates(similarity_threshold=0.95)
clusters = results.get("clusters", [])
print(f"Found {len(clusters)} duplicate cluster(s)")

total_duplicates = sum(c.get("n_images", 0) for c in clusters)
print(f"Total duplicate images: {total_duplicates}")

for cluster in clusters:
    n = cluster.get("n_images")
    cid = cluster.get("cluster_id", "")[:8]
    print(f"  Cluster {cid}... — {n} near-duplicate images")
    for preview in cluster.get("previews", [])[:3]:
        print(f"    {preview['file_name']}")

Response Codes

See Error Handling for the error response format and Python handling patterns.
HTTP CodeMeaning
200Results returned successfully.
401Unauthorized — check your JWT token.
404Dataset not found.
422Invalid query parameters — check VQL syntax or threshold value.