Creating ACL-aware embeddings for files means generating vector embeddings of file content while preserving access control context. This ensures that search and retrieval of embedded content respects each file’s permissions (ACLs). If you’re working with large file storage systems (including millions of files), it’s important to traverse Merge’s File Storage API efficiently and handle permission inheritance appropriately. You’ll also want a strategy for keeping embeddings updated as files, folders, or group permissions change over time.
By following the steps in this guide, you’ll be able to:
Understand how Merge’s File Storage data model handles files, folders, groups, users, and ACL inheritance
Efficiently traverse Merge’s API using pagination to extract file content and permission metadata without overloading memory
Generate embeddings from file content and associate each with a complete set of access controls
Keep ACLs and embedding metadata up to date using either polling or webhook notifications
With this approach, your search and AI systems can maintain an index that only returns documents a user is authorized to view—even as the underlying structure evolves.
Before jumping into implementation, it’s important to understand how we represent files and permissions in our Unified API.
Each file includes an id, name, a reference to its parent folder, and a list of permissions.
These permissions reflect the complete set of access controls—including those inherited from folders—because this is how third-party platforms return the data to us. You do not need to separately traverse the folder hierarchy to aggregate inherited permissions.
Here’s a simplified example:
{
"id": "file-123",
"name": "Q1 roadmap.pdf",
"folder": "folder-456",
"permissions": [
{ "type": "USER", "user": "user-1", "roles": ["OWNER"] },
{ "type": "GROUP", "group": "group-1", "roles": ["READ"] }
]
}Folders can contain subfolders and files. While folders may also have permissions, you do not need to aggregate them for file access control—files already include the full ACL based on their folder context.
Groups in Merge’s model can:
Include individual users
Include child groups that inherit from their parent
When you expand group-based permissions, make sure to account for any nested child groups.
User objects contain fields like name, email_address, and id. These IDs are used in file and folder ACLs, either directly or through group membership.
Learn more about ACLs here:📄 File Storage Access Control List (ACLs)
When working with large file repositories, it’s crucial to avoid loading too much into memory at once. Our API is paginated, so you’ll be able to iterate over your customer’s entire file system in manageable chunks.
We recommend processing your data in the following order:
Fetch all Groups – Build a map of group memberships and nested hierarchies.
Fetch all Files – Files include the complete permissions, so no need to traverse folders.
Fetch file contents – Download content only when you're ready to embed.
Generate embeddings and store them with ACL metadata – Do this incrementally, page by page.
Each of these is described in detail below.
Start by calling: GET /groups
You’ll receive a paginated list of Group objects. For each group, extract:
The group ID and name
The list of users
The list of child_groups
From this, build the following maps:
Group → Users: Which users are direct members of a group
Group → Child Groups: Which other groups inherit from this one (and vice versa if needed)
User → Groups (optional): Can be derived from the above for access filtering later
Make sure to handle pagination by checking the next cursor on each response. Use page_size=100 to reduce the number of requests needed.
Next, call: GET /users
This step is optional, as group records already contain user IDs. If you need user emails or names—for display or filtering—you can fetch these here. Otherwise, you can rely on the IDs collected from the group data.
Why fetch groups first? When you later analyze file permissions, you’ll encounter group IDs. Having the full group map cached avoids needing to resolve them in real time for each file.
Now retrieve files using: GET /files
Each file record includes:
Its folder reference
File-specific metadata
A complete list of permissions, including both direct and inherited
You do not need to climb the folder hierarchy to aggregate permissions. Permissions on files reflect the complete set of access controls—including those inherited from folders—because this is how third-party platforms return the data to us.
Your job is to:
Expand GROUP permissions to include all users from that group and its descendant groups
Track USER permissions directly
Mark COMPANY and ANYONE access appropriately
This gives you a complete list of users and groups who can access each file. That list becomes your ACL metadata attached to the embedding.
To access a file’s contents, use our direct download endpoints:📄 Direct File Download
Once you have the file:
Extract the raw text using the appropriate parser (PDF, DOCX, etc.)
Chunk the text if necessary (e.g., every 1,000–2,000 tokens)
Use your embedding model (e.g., OpenAI, Cohere, in-house) to generate vector representations
Attach ACL metadata
Example:
{
"file_id": "abc123",
"allowed_users": ["user-1"],
"allowed_groups": ["group-3", "group-4"],
"public": false
}Process files in a streaming fashion (one page at a time) to avoid memory issues.
When inserting each embedding into your vector index (e.g. Pinecone, Weaviate):
Store the embedding vector
Store metadata: file_id, allowed_users, allowed_groups, and any flags like public or company
During query time, filter results based on the querying user’s ID and group memberships.
Example filter logic:
{
"$or": [
{ "allowed_users": { "$in": ["user-123"] } },
{ "allowed_groups": { "$in": ["group-456", "group-789"] } },
{ "public": true }
]
}This ensures only authorized users can retrieve content from the index.
Once your initial indexing is complete, it’s critical to keep your embeddings and their ACL metadata up to date with any changes in the underlying data. Files may be moved or re-permissioned, groups may change membership, and new content may be created at any time. If your embeddings are stale, users may retrieve content they shouldn’t—or miss out on content they should.
We recommend a single, structured approach for staying in sync:
Every time Merge completes a sync with the third-party provider, we can notify you via a sync completion webhook. This webhook ensures that the data is fresh and consistent before you act on it.
Once you receive this sync completion event, choose one of the following methods to process updates:
modified_after to sync all changes at onceAfter you receive the sync completion webhook, query Merge’s API endpoints with modified_after filters to pull all the changes since your last checkpoint.
You'll want to poll the following endpoints using the timestamp from your last processed sync:
GET /files?modified_after=...
GET /folders?modified_after=...
GET /groups?modified_after=...
GET /users?modified_after=... (optional, for COMPANY-level access)
Process each returned record accordingly (re-embed, re-ACL, delete, etc.) and update your checkpoint.
Alternatively, after receiving the sync completion webhook, you can rely on Merge’s “data changed” webhooks to receive a stream of granular events in near real time. These webhooks fire for:
file.created, file.updated, file.deleted
folder.updated, folder.created
group.updated, group.created
And more…
This approach is ideal if you want low-latency updates and are comfortable handling a higher volume of webhook traffic.
With Merge’s File Storage API, you can confidently build ACL-aware embeddings that scale with even the largest file systems. By combining permission inheritance with efficient traversal and update strategies, your search or AI workflows will always respect who should have access to what.