Implementing ACL-aware embeddings using Merge’s Knowledge Base API

Last updated: April 6, 2026

This guide provides an implementation guide for Merge's Knowledge Base ACLs. For a higher level overview, see:

📄 Knowledge Base Access Control Lists (ACLs)

Overview

Creating ACL-aware embeddings for article means generating vector embeddings of article content while preserving access control context. This ensures that search and retrieval of embedded content respects each article’s permissions (ACLs). If you’re working with large knowledge base systems (including millions of articles), it’s important to traverse Merge’s Knowledge Base API efficiently and handle permission inheritance appropriately. You’ll also want a strategy for keeping embeddings updated as articles, containers, or group permissions change over time.

By following the steps in this guide, you’ll be able to:

Understand how Merge’s Knowledge Base data model handles articles, containers, groups, users, and ACL inheritance
Efficiently traverse Merge’s API using pagination to extract article content and permission metadata without overloading memory
Generate embeddings from article content and associate each with a complete set of access controls
Keep ACLs and embedding metadata up to date using either polling or webhook notifications

With this approach, your search and AI systems can maintain an index that only returns documents a user is authorized to view, even as the underlying structure evolves.

Integration support for ACLs

Not all Knowledge Base integrations support ACLs. More specifically - Notion does not natively support ACLs via API.

We recommend either user-based authentication approach (non-admin) or managing ACLs on your side for Notion. For more guidance on how to manage ACLs on your side, review our guide:

📄 Implementing Customer-Managed Access Control Lists (ACLs) for Knowledge Base

Data model and ACL inheritance

Before jumping into implementation, it’s important to understand how we represent articles and containers in our Unified API.

Articles

Each article includes an id, name, a reference to its parent article, a reference to its parent container, a reference to its root container, and a list of permissions.

These permissions reflect the complete set of access controls - including those inherited from containers - because this is how third-party platforms return the data to us. You do not need to separately traverse the container hierarchy to aggregate inherited permissions.

Here’s a simplified example:

{
  "id": "Article-123",
  "name": "Q1 roadmap.pdf",
  "parent_article": "parent-article-1",
  "parent_container": "parent-container-1",
  "root_container": "root-container-123",
  "permissions": [
    { "type": "USER", "user": "user-1", "roles": ["OWNER"] },
    { "type": "GROUP", "group": "group-1", "roles": ["READ"] }
  ]
}

Attachments

The Attachment object is used to represent an attachment to an article. Attachment ACLs are effectively inherited from the parent article and governed at the Article level.

If a user or group can access the article, they can access the attachments referenced by that article’s attachments array.
Uf they can’t access the article, the attachment should be treated as inaccessible as well.

For Notion specifically, attachments don’t support distinct ACL/permissions separate from the article.

Containers

Containers can contain subcontainers and articles. While articles may also have permissions, you do not need to aggregate them for article access control - articles already include the full ACL based on their parent article and container context.

Groups

Groups in Merge’s model can:

Include individual users
Include parent groups

Users

User objects contain fields like name, email_address, and id. These IDs are used in article and container ACLs, either directly or through group membership.

Learn more about ACLs here: 📄 Knowledge Base Access Control Lists (ACLs)

Traversing the API efficiently (articles, containers, groups)

When working with large article repositories, it’s crucial to avoid loading too much into memory at once. Our API is paginated, so you’ll be able to iterate over your customer’s entire knowledge system in manageable chunks.

This process requires Groups, Users, and Permissions support.

Certain integrations, like Notion, do not have natively support Groups and Users via their API. For these integrations, you will need to implement Customer-Managed ACLs and adjust this process to fit your ACL implementation.

We recommend processing your data in the following order:

Fetch all Groups – Build a map of group memberships and nested hierarchies.
Fetch all Articles – Files include the complete permissions, so no need to traverse the Container hierarchy.
Fetch Article and Attachment contents – Download content only when you're ready to embed.
Generate embeddings and store them with ACL metadata – Do this incrementally, page by page.

Each of these is described in detail below.

1. Retrieve groups and users for ACL context

Start by calling: GET /groups

You’ll receive a paginated list of Group objects. For each group, extract:

The group ID and name
The list of users
The parent_group

From this, build the following maps:

Group → Users: Which users are direct members of a group
Group → Parent Groups: Which other group inherit from this one (and vice versa if needed)
User → Groups (optional): Can be derived from the above for access filtering later

Make sure to handle pagination by checking the next cursor on each response. Use page_size=100 to reduce the number of requests needed.

Next, call: GET /users

This step is optional, as group records already contain user IDs. If you need user emails or names - for display or filtering - you can fetch these here. Otherwise, you can rely on the IDs collected from the group data.

Why fetch groups first? When you later analyze article permissions, you’ll encounter group IDs. Having the full group map cached avoids needing to resolve them in real time for each article.

2. Retrieve articles and extract permissions

Now retrieve files using: GET /articles

Each article record includes:

Its folder reference
articles-specific metadata
A complete list of permissions, including both direct and inherited

You do not need to climb the article and container hierarchy to aggregate permissions. Permissions on articles reflect the complete set of access controls, including those inherited from parent articles and containers, because this is how third-party platforms return the data to us.

Your job is to:

Expand GROUP permissions to include all users from that group and its descendant groups
Track USER permissions directly
Mark COMPANY and ANYONE access appropriately

This gives you a complete list of users and groups who can access each article. That list becomes your ACL metadata attached to the embedding.

3. Fetch article and attachment content and generate embeddings

Once you have the article:

Extract the raw html using an appropriate HTML parser
Chunk the text if necessary (e.g., every 1,000–2,000 tokens)
For each attachment referenced in the article’s attachments array, optionally download and extract text (e.g., PDFs, docs) only if you plan to make it searchable.
Use your embedding model (e.g., OpenAI, Cohere, in-house) to generate vector representations
Attach ACL metadata
1. The Attachment object does not include its own permissions, so attachments should be treated as accessible exactly when the parent article is accessible (i.e., attachments inherit the article’s access model).

Example:

{
  "article_id": "abc123",
  "allowed_users": ["user-1"],
  "allowed_groups": ["group-3", "group-4"],
  "public": false
}

Process article in a streaming fashion to avoid memory issues.

4. Storing embeddings with ACL Context

When inserting each embedding into your vector index (e.g. Pinecone, Weaviate):

Store the embedding vector
Store metadata: article_id, allowed_users, allowed_groups, and any flags like public or company

During query time, filter results based on the querying user’s ID and group memberships.

Example filter logic:

{
  "$or": [
    { "allowed_users": { "$in": ["user-123"] } },
    { "allowed_groups": { "$in": ["group-456", "group-789"] } },
    { "public": true }
  ]
}

This ensures only authorized users can retrieve content from the index.

Keeping embeddings updated as ACLs change

Once your initial indexing is complete, it’s critical to keep your embeddings and their ACL metadata up to date with any changes in the underlying data. Articles may be moved or re-permissioned, groups may change membership, and new content may be created at any time. If your embeddings are stale, users may retrieve content they shouldn’t or miss out on content they should.

We recommend a single, structured approach for staying in sync:

Wait for Merge to finish syncing, then choose an update strategy

Every time Merge completes a sync with the third-party provider, we can notify you via a sync completion webhook. This webhook ensures that the data is fresh and consistent before you act on it.

Once you receive this sync completion event, choose one of the following methods to process updates:

Option 1: Use `modified_after` to sync all changes at once

After you receive the sync completion webhook, query Merge’s API endpoints with modified_after filters to pull all the changes since your last checkpoint.

You'll want to poll the following endpoints using the timestamp from your last processed sync:

GET /articles?modified_after=...
GET /attachments?modified_after=...
GET /containers?modified_after=...
GET /groups?modified_after=...
GET /users?modified_after=... (optional, for COMPANY-level access)

Process each returned record accordingly (re-embed, re-ACL, delete, etc.) and update your checkpoint.

Option 2: Use real-time webhooks for every individual change

Alternatively, after receiving the sync completion webhook, you can rely on Merge’s “data changed” webhooks to receive a stream of granular events in near real time. These webhooks fire for:

articles.created, articles.updated, articles.deleted
containers.updated, containers.created
group.updated, group.created
And more…

This approach is ideal if you want low-latency updates and are comfortable handling a higher volume of webhook traffic.

Conclusion

With Merge’s Knowledge Base API, you can confidently build ACL-aware embeddings that scale with even the knowledge base systems. By combining permission inheritance with efficient traversal and update strategies, your search or AI workflows will always respect who should have access to what.