Confluence Data Center

The Glean connector for Confluence Data Center indexes Confluence pages, blog posts, attachments (metadata), footer comments, spaces, and related metadata from customer‑managed deployments, while mirroring source permissions.

Key features

Glean captures Confluence pages (including their hierarchical parent/child structure), blog posts, metadata attachments, comments, and more.
Glean respects all user access permissions, ensuring users only see search results for documents they can access. When a user clicks on a search result, they are taken to the Confluence web application, which enforces the permission.
The connector ensures comprehensive data coverage, including metadata, identity data, permissions data, and activity data. It provides real-time synchronization, reflecting updates and permission changes immediately in search results.
All data is stored in the cloud project within the customer’s cloud account (Glean or customer hosted), ensuring no data leaves the customer’s environment.
Glean uses Atlassian’s standard REST API for Confluence to ingest all data.
Near realtime freshness via webhooks; the Glean plugin provides view activity signals that improve ranking.

Versions supported

Supports Confluence Data Center/Server versions 7.4 and above. You can use Applinks manifest /rest/applinks/1.0/manifest endpoint to get the version and other information related to the instance.
Glean also supports Confluence Cloud. For more information, see the Confluence Cloud Connector.

Indexed content and data

The Glean Confluence connector crawls three distinct types of data—Content, Identity, and Activity—to ensure a fast, comprehensive, and securely managed index.

Content

Pages, blog posts, spaces, attachments (metadata), footer comments.
Blog posts have no hierarchy; crawled via standard content listing APIs.
Restricted pages can be indexed if the connector is granted access; permissions remain enforced in results.
Archived pages are not applicable for Confluence Data Center; archived spaces are crawled by default and are configurable.

Activity data and webhooks

Processes create, update, delete, move/restore, and permission change events for fast updates.
Plugin surfaces view activity to improve ranking signals.
Webhook for permission/restriction changes trickle down to sub documents of the document tree with a minimum delay of 20mins (our cache frequency).

Identity data

Crawls users, emails, groups, and memberships; visibility is limited to configured product access groups.
Known DC issue: group members API is unstable in some versions; plugin can serve group memberships as a workaround.

Limitations

Only footer comments are indexed (not inline). If indexing of inline comments is critical to your workflow, consider copying the content of important inline comments into a page comment to ensure it is indexed by Glean.
Confluence mutator crawls do not work.
Image attachment content isn’t indexed; attachment metadata is captured.
Blog posts have no hierarchy.
Content restrictions read API in Server/DC returns all restrictions in one response (no pagination).
Users listing API on Server can 5xx with invalid users; fix product access groups to resolve.
New comments on pages older than 1 day aren’t crawled or indexed via webhooks or incremental crawls. They’re only updated in full crawls.

Rate limits

Queries per Second (QPS): QPS depends on customer server capacity and is configurable; a common default is ~16 aggregate QPS (admin: 3, content: 7, identity: 6).

Update frequency

Identity: full every 10 minutes; no incremental.
Content (pages/blog posts): full every 7 days; incremental hourly (updates since start of day).
Space permissions: full every 3 hours.
Webhooks apply changes in near real time between scheduled crawls.

How the crawl works

The crawler follows the traditional crawler strategy, including utilizing the API and the following ways to get and update data:

Identity Crawl: updating and adding of People data, including users, groups, and other information
Webhooks: are messages sent by the application to notify Glean of changes in real-time, and then Glean either initiates a crawl or picks up the change on the next crawl.
Content Crawls: Full crawls the entire defined scope of the application whereas incremental crawls only capture the changes from the previous full or incremental crawl.

Required permissions

The user setting up this data source must have administrator permissions. You can reach out to Glean Support for any network configuration requirements.

Setup instructions

Perform the following steps to connect Confluence Dataa Center with Glean.

1. Create a service account for Glean

Sign into Confluence as an admin.
Go to User Management.
Create a user with any name, email, and password.
Click Edit Groups.
Add the service account to confluence-administrators. Alternatively, ensure the user is a space administrator for all spaces that should be crawled.

2. Provide basic information about your Confluence instance

Enter the server’s base URL in Glean setup page. For example, https://confluence.mydomain.com.
If a network proxy is used to route requests (contact Glean support to confirm if you are not sure about this), enter the Confluence Server Host or IP in the Server Host or Server IP input fields.
In case there are multiple domains in your Confluence instance, enter all the URLs except the base URL in the Additional domains field. For multiple URLs use commas and no spaces to separate the URLs.
Enter the product access group(s). This should be the group(s) containing all Confluence users. Often, this is confluence-users. For multiple groups use commas and no spaces to separate the group names.
Enter the service account details created earlier into Glean.
Enter the number of API calls per second supported by your Confluence instance.

3. Configure Webhook / Plugin activity

Check the Admin-privileged service account checkbox if the service account is a part of the confluence-adminstrators product access group. This will automate setting up the webhook and configuring the Glean plugin after installation.
Install the Glean activity plugin which is available on the Atlassian Marketplace. The marketplace page will provide the installation instructions.

Note: The following sections can be skipped if the service account has admin privileges. If your service account does not have admin privileges, please navigate to your newly created instance in Admin Apps Setup page and follow the rest of the instructions from there.

3a. Configuring the Glean activity plugin

We need to configure the Glean activity plugin to send the events to the correct endpoint.
Go to Manage Apps in Confluence Admin UI.
Open the glean_search app and click on Configure.
Copy the target URL from Plugin Target URL box shown on Glean UI and hit Submit. The URL must be a valid URL in the format: https://domain-be.glean.com/instance/CONFLUENCE_ABC1234/scio_event
The activity plugin should now be configured successfully.

3b. Connect the webhook

Go to General Configuration in Confluence Admin UI.
Click Create webhook.
Configure as follows:

Config	Value
Name	Glean Search
URL	Copy the URL from Webhook URL box
Webhook Shared Secret	Use any value. Enter it in Glean setup page and click Save!
Events	Select all
Status	active

API Endpoints

Purpose	DC Endpoint	DC Method	DC Permission
List users	search/user	GET	READ
List groups	group	GET	READ
List group members	group/%s/member	GET	READ
List groups of user	user/memberof	GET	READ
Get current user	N/A
Get email of users	user/non-system	GET	ADMIN
List spaces	space	GET	SPACE_ADMIN
CQL based list spaces	search	GET	READ
List pages in space	space/%s/content/page	GET	READ
List blogposts in space	space/%s/content/blogpost	GET	READ
Get space permissions	spaces/spacepermissions.action	GET	Confluence Administrator
List content	content	GET	READ
Get content	content/%s	GET	READ
CQL based list content	content/search	GET	READ
List children of page	pages/%s/children	GET	READ
Fetch applinks	rest/applinks/1.0/listApplicationlinks	GET	Confluence Administrator
Create webhook	rest/api/webhooks	POST	Confluence Administrator
Get content restrictions	content/%s/restriction/byOperation/read	GET	READ
Update content restriction	N/A
Configure plugin	scio_search/1.0/configure	POST
Get installed plugin version	scio_search/1.0/version	GET
Get space permissions via plugin	scio_search/1.0/space_permissions	GET

Content configuration

Note: If Inclusion (Green-Listing) options are enabled, only content from the Inclusion category will be indexed. If Exclusion (Red-Listing) options are enabled, all content in the exclusion category will be removed. If both rules are applied to the same content, then the content will NOT be indexed, as exclusion rules take priority. The rules below should be used MINIMALLY to preserve the enterprise search experience, as most end-users expect to find all content. Most customers do not apply any rules or apply exclusion rules sparingly for sensitive folders. Exclusion rules are applied automatically after the next full crawl, which can vary by corpus size. If a recrawl is needed, please reach out to your Glean representative.

Exclusion (Red-listing) options

Glean provides several options for excluding content from the data crawl, which excludes data from search and chat results.

Space: Exclude certain Confluence spaces from being crawled by Glean by specifying space keys
Pages with specific labels: Exclude pages and blog posts with specific labels from being crawled by Glean
Pages with content matching specific regex: Exclude pages and blog posts with content matching specific regex from being crawled by Glean
Creators: Exclude content created by certain creators from being crawled by Glean.

Confluence Cloud Connector Exclusion Options

Inclusion (Green-listing) options

Glean provides several options for including content from the data crawl, which includes data from search and chat results.

Spaces: Only allow Glean to crawl certain Confluence spaces. Glean will crawl all spaces except those in the Exclusion rules if no spaces are specified.

Confluence Cloud Connector Inclusion Options

Note: Only content specified to be included items will show in search results, chat, or any other Glean applications. Unspecified content will not be included in search results, chat, or other Glean applications.

General

Native Connectors

Partner Connectors

Push API Connectors

Configure Actions and MCP from datasource setup

Key features

Versions supported

Indexed content and data

Content

Activity data and webhooks

Identity data

Limitations

Rate limits

Update frequency

How the crawl works

Required permissions

Setup instructions

1. Create a service account for Glean

2. Provide basic information about your Confluence instance

3. Configure Webhook / Plugin activity

3a. Configuring the Glean activity plugin

3b. Connect the webhook

API Endpoints

Content configuration

Exclusion (Red-listing) options

Inclusion (Green-listing) options

General

Native Connectors

Partner Connectors

Push API Connectors

Configure Actions and MCP from datasource setup

​Key features

​Versions supported

​Indexed content and data

​Content

​Activity data and webhooks

​Identity data

​Limitations

​Rate limits

​Update frequency

​How the crawl works

​Required permissions

​Setup instructions

​1. Create a service account for Glean

​2. Provide basic information about your Confluence instance

​3. Configure Webhook / Plugin activity

​3a. Configuring the Glean activity plugin

​3b. Connect the webhook

​API Endpoints

​Content configuration

​Exclusion (Red-listing) options

​Inclusion (Green-listing) options

Key features

Versions supported

Indexed content and data

Content

Activity data and webhooks

Identity data

Limitations

Rate limits

Update frequency

How the crawl works

Required permissions

Setup instructions

1. Create a service account for Glean

2. Provide basic information about your Confluence instance

3. Configure Webhook / Plugin activity

3a. Configuring the Glean activity plugin

3b. Connect the webhook

API Endpoints

Content configuration

Exclusion (Red-listing) options

Inclusion (Green-listing) options