
- Crawling - Glean fetches data from your connected apps.
- Indexing - Glean creates a model of the data that was fetched and incorporates it into your organization’s search index.
- Learning - Glean processes the data that was fetched using Machine Learning (ML) to create a search and ranking algorithm tailored to your organization’s data and users.
Timelines for Completion
The time required to complete all 3 processes will vary depending on the size of your organization and the volume of content that Glean needs to process. The combined crawling and indexing processes can take approximately:- 2-3 days to complete for a typical small organization, or small volume of content.
- 10-14 days for a typical large organization, or large volume of content.
- The GCP or AWS region that your Glean tenant was deployed to (and the tier of TPU/GPU hardware available in that region).
- The amount of content that needs to be processed as part of each M/L workflow.
About Crawling & Indexing
When you initiate a crawl for a data source for the first time, the crawling and indexing processes are initiated. During this time, Glean will:- Crawl the content (and associated permissions & activity metadata) for the selected data source.
- Create the Glean Knowledge Graph by indexing the crawled content, mapping it together, and creating a real-time model that can be referred to in response to a user’s query.
Checking the Crawling & Indexing Status
You can check the status of your in-progress crawls at any time by going to Admin Console > Data sources and reviewing the table of configured apps. When a data source is undergoing its initial sync, it will appear under the Initial sync in progress section, which is split into two phases:- Crawling (step 1/2) - Glean is actively fetching content and metadata from the data source.
- Indexing (step 2/2) - Glean is processing the crawled content and incorporating it into the Knowledge Graph.
- Items synced - The total number of items (documents, messages, files, etc.) that have been crawled and indexed.
- Change rate (items/day) - The number of changes (edits, additions, deletions) synced in the past 24 hours, reflecting ongoing freshness after initial sync completes.

Crawling & Indexing FAQ
How long does the initial crawl and index process take to complete?
How long does the initial crawl and index process take to complete?
Can multiple data sources be crawled at the same time without impact?
Can multiple data sources be crawled at the same time without impact?
What happens if a document is modified while a crawl is in progress?
What happens if a document is modified while a crawl is in progress?
- Webhooks: Most datasources support webhooks, which Glean leverages to be notified of any content changes. When a webhook is received, it is processed within 1-5 minutes, depending on the datasource.
- Incremental Crawls: Glean performs an incremental crawl of each datasource every 24 hours. These crawls focus on identifying and incorporating changes that have occurred since the last crawl that were not captured via webhooks. This ensures that all recent modifications are captured.
How to verify if a document is crawled and indexed?
How to verify if a document is crawled and indexed?

- Last crawled – when Glean last scanned the file for changes.
- Last indexed – when Glean last added those changes to search.
- Document visibility on Glean – whether the document is currently visible or hidden to users on Glean.
- Document access – you can use the teammate selector to check an individual user’s access to the document: “Full access” (the user can open it) or “No access” (the user cannot)



How do I interpret slow or stalled progress during crawling and indexing?
How do I interpret slow or stalled progress during crawling and indexing?
About Machine Learning
Once the crawling and indexing processes have been completed, Glean will initiate several Machine Learning (ML) workflows that will run on all indexed content. The ML process is critically important and is responsible for:- Optimizing search query understanding and spellcheck.
- Understanding synonyms, acronyms, and semantics used in documents and between employees within your organization.
- Enhancing relevance rankings for search results and people suggestions.
- Enabling query suggestions, predictive text, and autocomplete.
- Training the unique language model for your organization; which is essential for operation of Glean Chat and Glean Assistant.
Checking the Machine Learning (ML) Status
The ML workflows are background processes - it is not currently possible to check the status of these inside the Glean UI.Machine Learning (ML) FAQ
How long does the ML process take to complete?
How long does the ML process take to complete?
- The amount of content that needs to be processed as part of each ML workflow.
- The GCP or AWS region that your Glean tenant was deployed to (and the tier of TPU/GPU hardware available in that region).
- For example, using an Nvidia T4 GPU (if that is all that is supported in your elected deployment region) instead of a dedicated TPU typically increases the time required to run all ML workflows by a factor of 4-6x.
Does the ML process run in parallel with the crawl/index processes?
Does the ML process run in parallel with the crawl/index processes?
What is the impact if the ML process is not completed?
What is the impact if the ML process is not completed?
- Search results will be significantly degraded.
- Glean Assistant and Glean Chat will not respond correctly.
- Spellcheck will be errornous.
- Autocomplete will not function.
- Any synonyms or acronyms used within the organization will not be understood if included in a search query.