Overview

Creating ACL-aware embeddings for files means generating vector embeddings of file content while preserving access control context. This ensures that search and retrieval of embedded content respects each file’s permissions (ACLs). If you’re working with large file storage systems (including millions of files), it’s important to traverse Merge’s File Storage API efficiently and handle permission inheritance appropriately. You’ll also want a strategy for keeping embeddings updated as files, folders, or group permissions change over time.

By following the steps in this guide, you’ll be able to:

With this approach, your search and AI systems can maintain an index that only returns documents a user is authorized to view—even as the underlying structure evolves.

Data model and ACL inheritance

Before jumping into implementation, it’s important to understand how we represent files and permissions in our Unified API.

Files

Each file includes an idname, a reference to its parent folder, and a list of permissions.

These permissions reflect the complete set of access controls—including those inherited from folders—because this is how third-party platforms return the data to us. You do not need to separately traverse the folder hierarchy to aggregate inherited permissions.

Here’s a simplified example:

{
  "id": "file-123",
  "name": "Q1 roadmap.pdf",
  "folder": "folder-456",
  "permissions": [
    { "type": "USER", "user": "user-1", "roles": ["OWNER"] },
    { "type": "GROUP", "group": "group-1", "roles": ["READ"] }
  ]
}

Folders

Folders can contain subfolders and files. While folders may also have permissions, you do not need to aggregate them for file access control—files already include the full ACL based on their folder context.

Groups

Groups in Merge’s model can:

When you expand group-based permissions, make sure to account for any nested child groups.

Users

User objects contain fields like nameemail_address, and id. These IDs are used in file and folder ACLs, either directly or through group membership.

Learn more about ACLs here:📄 File Storage Access Control List (ACLs)

Traversing the API efficiently (files, folders, groups)

When working with large file repositories, it’s crucial to avoid loading too much into memory at once. Our API is paginated, so you’ll be able to iterate over your customer’s entire file system in manageable chunks.

We recommend processing your data in the following order:

  1. Fetch all Groups – Build a map of group memberships and nested hierarchies.

  2. Fetch all Files – Files include the complete permissions, so no need to traverse folders.

  3. Fetch file contents – Download content only when you're ready to embed.

  4. Generate embeddings and store them with ACL metadata – Do this incrementally, page by page.

Each of these is described in detail below.

1. Retrieve groups and users for ACL context

Start by calling: GET /groups

You’ll receive a paginated list of Group objects. For each group, extract:

From this, build the following maps:

Make sure to handle pagination by checking the next cursor on each response. Use page_size=100 to reduce the number of requests needed.

Next, call: GET /users

This step is optional, as group records already contain user IDs. If you need user emails or names—for display or filtering—you can fetch these here. Otherwise, you can rely on the IDs collected from the group data.

Why fetch groups first? When you later analyze file permissions, you’ll encounter group IDs. Having the full group map cached avoids needing to resolve them in real time for each file.

2. Retrieve files and extract permissions

Now retrieve files using: GET /files

Each file record includes:

You do not need to climb the folder hierarchy to aggregate permissions. Permissions on files reflect the complete set of access controls—including those inherited from folders—because this is how third-party platforms return the data to us.

Your job is to:

This gives you a complete list of users and groups who can access each file. That list becomes your ACL metadata attached to the embedding.

3. Fetch file content and generate embeddings

To access a file’s contents, use our direct download endpoints:📄 Direct File Download

Once you have the file:

  1. Extract the raw text using the appropriate parser (PDF, DOCX, etc.)

  2. Chunk the text if necessary (e.g., every 1,000–2,000 tokens)

  3. Use your embedding model (e.g., OpenAI, Cohere, in-house) to generate vector representations

  4. Attach ACL metadata

Example:

{
  "file_id": "abc123",
  "allowed_users": ["user-1"],
  "allowed_groups": ["group-3", "group-4"],
  "public": false
}

Process files in a streaming fashion (one page at a time) to avoid memory issues.

4. Storing embeddings with ACL Context

When inserting each embedding into your vector index (e.g. Pinecone, Weaviate):

During query time, filter results based on the querying user’s ID and group memberships.

Example filter logic:

{
  "$or": [
    { "allowed_users": { "$in": ["user-123"] } },
    { "allowed_groups": { "$in": ["group-456", "group-789"] } },
    { "public": true }
  ]
}

This ensures only authorized users can retrieve content from the index.

Keeping embeddings updated as ACLs change

Once your initial indexing is complete, it’s critical to keep your embeddings and their ACL metadata up to date with any changes in the underlying data. Files may be moved or re-permissioned, groups may change membership, and new content may be created at any time. If your embeddings are stale, users may retrieve content they shouldn’t—or miss out on content they should.

We recommend a single, structured approach for staying in sync:

Wait for Merge to finish syncing, then choose an update strategy

Every time Merge completes a sync with the third-party provider, we can notify you via a sync completion webhook. This webhook ensures that the data is fresh and consistent before you act on it.

Once you receive this sync completion event, choose one of the following methods to process updates:

Option 1: Use modified_after to sync all changes at once

After you receive the sync completion webhook, query Merge’s API endpoints with modified_after filters to pull all the changes since your last checkpoint.

You'll want to poll the following endpoints using the timestamp from your last processed sync:

Process each returned record accordingly (re-embed, re-ACL, delete, etc.) and update your checkpoint.

Option 2: Use real-time webhooks for every individual change

Alternatively, after receiving the sync completion webhook, you can rely on Merge’s “data changed” webhooks to receive a stream of granular events in near real time. These webhooks fire for:

This approach is ideal if you want low-latency updates and are comfortable handling a higher volume of webhook traffic.

Conclusion

With Merge’s File Storage API, you can confidently build ACL-aware embeddings that scale with even the largest file systems. By combining permission inheritance with efficient traversal and update strategies, your search or AI workflows will always respect who should have access to what.