High availability and disaster recovery overview
- Updated: 2023/11/22
High availability and disaster recovery overview
High availability (HA) provides a failover mechanism if an IQ Bot service or server fails. Disaster recovery (DR) enables recovery across a geographically separated distance if a disaster causes an entire data center to fail.
IQ Bot uses a minimum of 3 nodes and a maximum of 5 nodes in a cluster for high availability (HA).
IQ Bot HA and DR solution
In the context of IQ Bot, implementation of High Availability (HA) and Disaster Recovery (DR) reduces downtime and maintains continuity of business (CoB) for your bot activities.
- High availability (HA)— High availability is an architectural system design that attempts to safeguard a system against certain failure scenarios. This means that even if parts of a system is failing, as a whole it is still available and usable. High availability solutions typically protect against specific scenarios such as: server failures, single component failures, dependency failures, variable load increases, and networks splits where dependent on system components that become unreachable on a network.
- Disaster recovery (DR)— Disaster recovery involves a set of policies and procedures to enable the recovery or continuation of vital infrastructure and systems following a natural or human-induced disaster. Disaster recovery addresses many different causes of failures in a system where high availability typically accounts for a predictable few. Disaster recovery has a focus on re-establishing services after an incident not just failover. Recovery of a system includes scenarios such as: restarting a service or system, restoring configuration files or a database from backups.
Required HA and DR infrastructure elements
- Distributed approach— In addition to clustering IQ Bot related data center components, we also recommend that you deploy IQ Bot on multiple physical and, or virtual servers.
- Load balancing— Performed by a load balancer, this is the process of distributing application or network traffic across multiple servers to protect service activities and allows workloads to be distributed among multiple servers. This ensures bot activity continues on clustered servers.
- Databases— Databases use their own built-in failover to protect the data.
This ensures database data recovery.
- Between the HA clusters, configure synchronous replication between the
primary (active) and secondary (passive) clustered Microsoft SQL Server instances in the data center. This ensures
consistency in the event of a database node failure.
For the required HA synchronous replication, configure one of the following:
- Backup replica to Synchronous-Commit mode of SQL Server Always On availability groups
- SQL to Server Database Mirroring
- Between the DR sites, configure your database to provide asynchronous replication from the primary (production) DR site to the secondary (recovery) DR site that is at a geographically separated location from the primary DR site.
- Between the HA clusters, configure synchronous replication between the
primary (active) and secondary (passive) clustered Microsoft SQL Server instances in the data center. This ensures
consistency in the event of a database node failure.
Sample scenario
Point all IQ Bot instances within the same cluster to the same database and repository files. This is required to enable sharing data across multiple servers and ensuring data integrity is maintained across IQ Bots servers within a cluster.
HA and DR deployment models
To ensure your IQ Bot is protected by HA and, or DR, configure your data centers according to the deployment models described in:
HA implementation requirements
- Install IQ Bot on multiple servers.
- Access to IQ Bot is through a load balancer.
- Open a RabbitMQ v3.8.18 synchronization port between IQ Bot servers.
- Configure the Microsoft SQL Server in high availability mode.
Installation HA and DR configuration requirements
- The IQ Bot installer does not directly support cluster
installation. To set up a cluster do the following:
- Run the installer on each application server node.
- Share the
output folder
using the access roleEveryone
. - Post installation, execute the
messagequeue_cluster_configuration.bat
with appropriate command line arguments.
- Configure IQ Bot in a high availability configuration.
- Open firewall ports: 4369 and 25672.
- Install RabbitMQ v3.8.18 on every IQ Bot node in the
cluster.
The first node where IQ Bot is installed becomes the primary RabbitMQ v3.8.18 node. The host name of the primary node is used to configure the RabbitMQ v3.8.18 cluster.
- The load balancer is required to distribute a traffic to all IQ Bot server nodes.
- Configure Microsoft SQL Server for high availability. Use the Microsoft SQL Server Always On option.
- For RabbitMQ v3.8.18 specific installation, see your RabbitMQ v3.8.18 documentation.
HA and DR known limitations
- To discover the availability of IQ Bot instances, a load balancer periodically sends pings, attempts connections, or sends requests to test the IQ Bot instances. These tests are called health checks.
- Health checks do not verify the availability of RabbitMQ v3.8.18 instances.