Optimizing the Backup Performance for Cosmos DB for MongoDB

Throughput during an Azure Cosmos DB for MongoDB backup is determined by several variables, including provisioned throughput. Provisioned throughput of each collection or database, as administered in the Azure portal, determines the maximum rate at which Commvault software can read data from each collection. For more information, go to Introduction to provisioned throughput in Azure Cosmos DB on Microsoft's website.

Provisioned throughput also indirectly influences the number of physical partitions for a collection. For more information, go to Partitioning and horizontal scaling in Azure Cosmos DB on Microsoft's website.

The parallelism obtained during backup operations for a collection is the minimum of the following values:

The number of streams available to back up the collection. The number of streams you configure for the storage policy is shared among the backups of all collections configured as subclient content.
The number of physical partitions for the collection.

Currently, for an Azure Cosmos DB for MongoDB API (RU) account with no throughput limit on the account, the maximum backup throughput expected for a collection with provisioned 10000 RUs, using two physical partitions, and a documents size of around 1 KB, is about 10 GB per hour, per partition.

For large collections that require a high backup throughput, use an access node/backup gateway that is not heavily loaded with other activity and has sufficient RAM and CPU cores. Using the CommServe computer as an access node/backup gateway can negatively impact backup performance.

You can use multiple access nodes/backup gateways to distribute the collections for backup operations. When a backup operation starts, the first access node/backup gateway listed in the cloud account acts as a coordinator, controlling traffic for backups across all access nodes/backup gateways. The load distribution improves performance and enhances scalability.

A backup operation uses the Change Stream APIs to fetch the documents of a given collection. A full backup fetches the documents from the beginning of the collection, whereas an incremental backup fetches the documents changed since the last backup. The RU consumption for backup operations depends on the size of the changed data and the frequency of full backup operations. You can use “Azure Monitoring Insights” to track the RU consumption during the backup window and make sure the provisioned throughput is sufficient.

A restore operation involves inserting the documents into a newly created collection. The operation consumes the available RUs provisioned for the collection.