System Design – Components&Concepts

Scalability

Scalability refers to the ability of a system, network, or process to handle a growing amount of work or its potential to be enlarged in order to accommodate that growth. In the context of computing and technology, scalability often pertains to the capability of a system to increase its total output under an increased load when resources (typically hardware or software) are added.

1. Horizontal Scaling (Scaling Out/In): Involves adding more nodes to (or removing nodes from) a system, such as adding more servers to a distributed software application.
2. Vertical Scaling (Scaling Up/Down):Involves adding more power (like CPU, RAM) to an existing machine or enhancing its capabilities.

Single Point of Failure

A single point of failure (SPOF) refers to a component or aspect of a system that, if it fails, will stop the entire system from operating.

Characteristics of a Single Point of Failure

Critical Component: An SPOF is typically an essential component or node in a system without which the system cannot function.
Lack of Redundancy: There is no backup or alternative pathway in place for this component. If it fails, there is no failover mechanism to maintain the system’s operation.
Potential for System-wide Impact: Failure of this single component can lead to the breakdown of the entire system, not just a part of it.

Mitigating Single Points of Failure

Redundancy: Implementing redundant components or systems so that if one fails, another can take over.
Diversification: Using different routes or methods for critical processes (e.g., multiple internet service providers).
Regular Maintenance and Testing: Ensuring that all components are in good working order and testing backup systems regularly.
Failover Mechanisms: Automatic switching to backup systems in case of failure.
Decentralization: Spreading out resources and dependencies to prevent a single failure from affecting the entire system.
Monitoring and Alerting Systems: Implementing tools to quickly identify and respond to failures.

CAP Theorem

The CAP theorem, also known as Brewer’s theorem, states that in a distributed computer system, it is impossible to simultaneously guarantee consistency, availability, and partition tolerance (CAP).

Consistency: Every read receives the most recent write or an error. In other words, all nodes see the same data at the same time. Consistency here is defined in the sense of linearizability, a strong form of consistency.
Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write.
Partition Tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes.

API Gateway

An API Gateway is a server that acts as an intermediary for processing requests from clients, directing them to the appropriate microservices in a backend system. It serves as the single entry point for all client requests and can simplify the client’s interaction with the system by aggregating multiple service calls into one.

How It Works:

Request Routing: The API Gateway receives requests from clients, then routes these requests to the appropriate microservice based on the request’s URL, method, and other attributes.
Load Balancing: It distributes incoming requests evenly across multiple instances of a microservice, enhancing the system’s scalability and reliability.
Authentication and Authorization: The Gateway often handles authentication, ensuring that the client is allowed to make the request, and can also manage authorization levels.
Aggregation: It can aggregate responses from multiple microservices and send a unified response back to the client, reducing the number of round trips required.
Rate Limiting and Throttling: The Gateway can limit the number of requests a client can send in a given time frame, preventing overuse or abuse of the backend services. (Rate limiting refers to preventing the frequency of an operation from exceeding a defined limit)
Caching: It can cache responses to reduce the load on microservices and improve response time for frequently requested data.
Logging and Monitoring: The Gateway provides a central point for logging and monitoring all incoming requests and outgoing responses, aiding in debugging and performance monitoring.

In essence, an API Gateway simplifies the interaction between clients and backend services, providing a centralized point for managing request routing, security, rate limiting, and other cross-cutting concerns.

Lod Balancers

A load balancer is a device or software that distributes network or application traffic across multiple servers. Its primary function is to enhance the reliability, capacity, and performance of a server environment. By spreading the workload evenly, load balancers ensure no single server becomes overwhelmed

There are several algorithms used by load balancers to distribute traffic, each suitable for different scenarios:

Round Robin: This is one of the simplest methods. Traffic is distributed sequentially to each server in a pool. Once the last server is reached, the load balancer starts again with the first server. This method works well with servers of equal capacity.
Least Connections: The load balancer directs traffic to the server with the fewest active connections. This method is beneficial when there are significant and unpredictable session load differences.
Least Response Time: This algorithm sends requests to the server with the lowest average response time and fewest active connections.
Hash-Based (IP Hash): The load balancer uses a unique identifier of the client (like the IP address) to direct traffic. This method ensures that a client is consistently connected to the same server, which can be important for session persistence.
Weighted Round Robin/Weighted Least Connections: These are variations of the Round Robin and Least Connections algorithms, respectively, where servers are assigned a weight based on their capacity. Servers with a higher weight receive more connections.
Random: Requests are distributed randomly among the servers. This method is less commonly used but can be effective in certain scenarios.
Source/Destination IP Hash: The load balancer selects a server based on a hash of the source and destination IP addresses, providing persistence and ensuring that a client is consistently served by the same server.
URL Hash: The load balancer uses the URL of the request to determine which server to use. This can be useful for caching as requests for a particular resource are always directed to the same server.
Geographic Location-Based: The decision is based on the geographic location of the client. This can be effective for delivering content with lower latency.

To avoid a load balancer becoming a single point of failure and causing downtime, you can implement several strategies:

Redundancy and : Use multiple load balancers in a redundant configuration. This setup typically involves at least two load balancers running in parallel. If one fails, the other can take over.
- High Availability (HA) Configuration Deploy load balancers in a high availability mode. This means having a primary load balancer and a secondary (standby) load balancer. The secondary load balancer automatically takes over if the primary fails.
- Load Balancer Clustering: Some load balancing solutions support clustering, where multiple load balancers work together as a single unit. This approach can distribute the load even in case of individual load balancer failures.
- Geographical Redundancy: For critical applications, consider using load balancers deployed in different geographical locations. This protects against regional outages and ensures continuous availability.
- Scalable and Elastic Architecture: Use cloud-based or virtualized load balancers that can scale dynamically with the traffic. This approach helps in managing sudden traffic spikes without overloading a single load balancer.
Regular Health Checks: Implement regular health checks for the load balancers. These checks can monitor various metrics to ensure the load balancer is functioning correctly. If an issue is detected, automatic failover to a backup load balancer can be triggered.
Regular Updates and Maintenance, Load Balancer Performance Monitoring, Testing Failover Mechanisms

Message Queue

A message queue is a form of asynchronous service-to-service communication used in serverless and microservices architectures. It’s essentially a queue of messages that are waiting to be processed. A producer sends a message to the queue, and then a consumer processes the message later, at its own pace. This separation of the message’s sender and receiver decouples the components of a system.

Benefits of Having a Message Queue in a Distributed System

Decoupling of Services: It allows services to operate independently without needing direct connections to each other. This means one service can be updated or fail without directly impacting others.
Asynchronous Processing: Enhances performance by allowing tasks to be handled at different times. This is particularly useful for operations that don’t need to be processed immediately.
Load Balancing: Helps in balancing the load among different workers. When there’s a heavy load, the queue can grow without affecting the incoming requests.
Reliability: Messages can be persisted in a queue, ensuring that they aren’t lost even if a service crashes or there’s a failure in processing.
Scalability: Easier to scale services since adding more consumers can process more messages. The system can adapt to higher loads by simply scaling out the number of workers.
Fault Tolerance: Provides a way to retry failed operations. If a message processing fails, it can be retried without losing the message or needing manual intervention.
Ordering and Guarantees: Some message queues can guarantee the order of message processing, which is crucial for certain applications.

Persistence for System Failover

Message Durability: Most message queue systems offer durability options, meaning that messages are stored safely until they are confirmed to be processed by a consumer. This is typically achieved by writing messages to disk or a distributed storage system.
Dead Letter Queues: In cases where messages fail to process repeatedly, they can be moved to a Dead Letter Queue (DLQ) for later analysis, ensuring the main processing queue isn’t clogged with unprocessable messages.

Preserving High Availability and Scalability

Redundant Infrastructure: By deploying the message queue over a cluster of servers, high availability is ensured. If one node fails, others can take over the processing.
Distributed Architecture: Modern message queue systems are designed to be distributed, meaning they can run across multiple machines, data centers, or even geographies.
Load Distribution: The ability to distribute messages across multiple consumers inherently supports scalability. As the load increases, more consumers (or worker processes) can be added.
Partitioning/Sharding: Some advanced message queues support partitioning of messages, which means the queue can be divided across different nodes or clusters, allowing for concurrent processing and scalability.

In a message queue system, to ensure that two consumers don’t process the same message, the following mechanisms are commonly used:

Visibility Timeout: When a consumer retrieves a message, it becomes invisible to other consumers for a set period. This prevents duplicate processing.
Acknowledgment and Deletion: The consumer must acknowledge after processing the message. If the queue receives this acknowledgment, the message is deleted. If not, the message becomes visible again after the timeout.
Locking Mechanism: Some systems use a lock on the message while it’s being processed, preventing access by other consumers.
Pull Model: Consumers request messages from the queue, reducing the chance of duplication.
Idempotency in Processing: Ensuring that even if a message is processed more than once, the end result remains consistent.

ACID(Atomicity, Consistency, Isolation, and Durability)

Atomicity – each statement in a transaction (to read, write, update or delete data) is treated as a single unit. Either the entire statement is executed, or none of it is executed. This property prevents data loss and corruption from occurring if, for example, if your streaming data source fails mid-stream.
Consistency – ensures that transactions only make changes to tables in predefined, predictable ways. Transactional consistency ensures that corruption or errors in your data do not create unintended consequences for the integrity of your table.
Isolation – when multiple users are reading and writing from the same table all at once, isolation of their transactions ensures that the concurrent transactions don’t interfere with or affect one another. Each request can occur as though they were occurring one by one, even though they’re actually occurring simultaneously.
Durability – ensures that changes to your data made by successfully executed transactions will be saved, even in the event of system failure.

Database & Data Base Clusters

A database is a structured system for storing, managing, and retrieving data. It is an essential component of many software systems, ranging from simple applications to complex enterprise systems. Databases allow for the efficient organization, storage, and retrieval of data, making them a cornerstone of modern computing.

Types of Databases

Relational Databases: These databases store data in tables, with relationships between different tables. Examples include MySQL, PostgreSQL, and Oracle.
NoSQL Databases: Designed for unstructured or semi-structured data, they offer more flexible data models and scalability. Examples include MongoDB, Cassandra, and Redis.
In-Memory Databases: These databases primarily store data in RAM for high-speed data access and processing. Example: Redis.
Distributed Databases: Spread across multiple locations or computing nodes, they are designed for large-scale data storage and high availability. Example: Apache Cassandra.
Object-Oriented Databases: These databases store data in the form of objects, as used in object-oriented programming.
Graph Databases: Optimized for storing and navigating complex relationships between data points. Example: Neo4j.

Database Cluster

A database cluster refers to a group of databases that are managed by a single database management system (DBMS) and work together to provide high availability, reliability, and scalability. In a cluster, databases can be distributed across multiple physical or virtual machines to ensure system resilience and performance. Data security points: prevention, compliance, detection and response.

Vertical Scaling: this involves upgrading the existing hardware of the server where the database resides. It could mean more CPUs, RAM, or faster storage.
Database sharding is a type of database architecture where large databases are broken down into smaller, faster, more easily managed smaller parts called shards
- Range-Based Sharding: Data is partitioned according to a range (e.g., user IDs 1-1000 on one shard, 1001-2000 on another).
- Hash-Based Sharding: Data is distributed based on the hash value of a shard key.
- Directory-Based Sharding: A lookup service determines where data is stored.
Replication: Involves creating copies of databases or database objects for backup, scalability, and high availability. Replicated databases can be used for read-heavy operations, spreading the load
Partitioning: Involves dividing a database into smaller, more manageable parts, but unlike sharding, these parts are not independent. Partitioning can be horizontal (by rows) or vertical (by columns).

Writing Data to a Database:

Data Input: An application or user sends a request to the database to store data. This is usually done through SQL commands (like INSERT or UPDATE) in relational databases, or appropriate API calls in NoSQL databases.
Processing: The DBMS processes the request, ensuring data integrity and consistency according to predefined schemas, constraints, and rules.
Storage: The data is then written to the database’s storage system, which can be disk-based, in-memory, or a combination.

Synchronization in a Database Cluster:

Replication: Data written to one database in the cluster is replicated to other databases. This can be done synchronously (data is written to all nodes simultaneously) or asynchronously (data is written to other nodes after being committed on the primary node).
Consistency: The cluster maintains data consistency across nodes, ensuring that all copies of the data are identical. This might involve conflict resolution strategies in cases where the same data is modified in different nodes.
Failover and Load Balancing: In case of a node failure, the cluster can automatically redirect data requests to other nodes (failover). It can also distribute the load among different nodes to optimize performance and resource utilization (load balancing).

Database synchronization is the process of ensuring that two or more databases (often located in different places or on different servers) contain consistent and up-to-date data. The main goal is to replicate changes made in one database to another, maintaining data consistency across systems.

Data Replication:
- Primary Method: The most common method of database synchronization is replication, where changes made in one database (the master or primary database) are automatically replicated to one or more other databases (secondary or replicas).
- Types of Replication:
  - Synchronous Replication: Changes must be written to all replicas before the transaction is considered complete. This ensures strong consistency but can impact performance.
  - Asynchronous Replication: Changes are replicated after the transaction completes in the primary database, which can lead to temporary inconsistencies but offers better performance.
Conflict Resolution:
- In systems where updates can occur in multiple databases (multi-master replication), a conflict resolution mechanism is required to handle cases where the same data is modified in different locations at the same time.
- Conflict resolution strategies might include “last write wins,” timestamp-based resolution, or more complex application-specific logic.
Bidirectional Sync: Both databases can exchange data, with each database capable of receiving and transmitting changes to the other.
Change Tracking: Databases track changes using mechanisms like transaction logs, timestamps, or triggers.
Synchronization Process: Involves comparing data between databases, identifying differences, and then transferring necessary data to synchronize.

Database Scaling

1. Vertical Scaling: This involves upgrading the existing hardware of the server where the database resides. It could mean more CPUs, RAM, or faster storage.
2. Horizontal Scaling (Sharding): Sharding involves dividing the database into smaller, more manageable parts, known as shards. Each shard can be hosted on a different server.

Range-Based Sharding: Data is partitioned according to a range (e.g., user IDs 1-1000 on one shard, 1001-2000 on another).
Hash-Based Sharding: Data is distributed based on the hash value of a shard key.
Directory-Based Sharding: A lookup service determines where data is stored.

3. Read Replicas: Implement read replicas to handle read-heavy loads. In this setup, all write operations are handled by the primary database, while read operations are distributed among multiple replicas.
4. Caching: Use caching systems like Redis or Memcached to store frequently accessed data in memory.
5. Database Federation: Federation involves splitting databases by function (e.g., user data in one database, posts in another).
6. Using NoSQL Databases: For certain types of data and access patterns, NoSQL databases like MongoDB, Cassandra, or DynamoDB can provide better performance and scalability.
7. Cloud-Based Solutions: cloud database services (like Amazon RDS, Google Cloud SQL, Azure SQL Database) offer easy scalability options.

Cache

A cache is a high-speed data storage layer that stores a subset of data, typically transient in nature, so that future requests for that data can be served faster than accessing the data’s primary storage location. Caches are used to efficiently store temporary data that is likely to be requested again, thereby reducing data access times.

Caching Strategy

Read
- Cache-aside (Lazy Loading): Data is loaded into the cache on demand. This approach avoids filling the cache with unused data but can lead to cache misses.
- Prefetching: Anticipating the data needs of the application and loading data into the cache in advance.
Write
- Write-through Cache: Data is written to the cache and the database simultaneously. This ensures data in the cache is always up-to-date but can introduce latency in write operations.
- Write-Around Cache: Data is written directly to the database and not immediately to the cache.
- Write-back Cache: Data is initially written only to the cache and then written to the database after a certain condition is met (like after a specific time interval or when the cache is full).
Remove
- Time-to-Live (TTL): Implement TTL policies to ensure data freshness and prevent stale data in the cache.
- Cache Invalidation: It’s crucial to invalidate cache entries when the underlying data changes. There are various strategies for this, including time-based expiry, event-based invalidation, and manual invalidation.

How Does a Cache Work?
1. Data Retrieval: When data is first requested, the system checks if this data is in the cache (a process known as a cache hit). If the data is not in the cache (a cache miss), it is retrieved from the primary storage (like a database) and then stored in the cache.
2. Data Storage: Caches typically store data in memory (RAM), which is much faster than disk-based or network storage.
3. Subsequent Access: When the same data is requested again, it can be quickly retrieved from the cache instead of the slower primary storage.
4. Cache Eviction: Due to limited size, caches use eviction policies (like Least Recently Used, LRU) to remove less frequently accessed data.
5. Synchronization: In cases where data changes frequently, synchronization mechanisms ensure that the cache is updated or invalidated appropriately to prevent stale data.

Benefits of using a cache

Faster Data Processing and Improved Performance: Caching significantly reduces the time it takes to access data. By storing copies of frequently accessed data in a cache, which is typically faster to access than the original data source, applications can serve this data more quickly.
Decreased Load on Database Systems: By serving data from the cache, the number of queries to the backend database or server is reduced. This lowers the operational load and can prolong the lifespan of existing hardware.
Improved Reliability and Availability: Caching can enhance the reliability of a system. In scenarios where the backend system is temporarily unavailable, the cache might still be able to serve data, thus ensuring continuous availability.
Smoother User Experience: For end-users, the most noticeable benefit is a smoother, faster, and more seamless experience, as data loads quicker and applications feel more responsive.

Redis As Cache

Distributing Redis across multiple nodes to work as a cache involves setting up a Redis Cluster, which allows you to automatically split your dataset among multiple nodes, providing both data sharding and high availability.

1. Understanding Redis Cluster

Sharding: Redis Cluster automatically shards data across multiple Redis nodes.
High Availability: It uses a master-slave model to provide failover and data redundancy.
Automatic Partitioning: Redis Cluster handles partitions (shards) automatically.

2. Using the Cluster for Caching

Data Distribution: Data is automatically sharded across the different master nodes in the cluster. Redis uses a hash slot mechanism for distributing keys. It’s important to understand this for efficient caching and avoiding hotspots.
Client Interaction: Use a Redis client that supports cluster mode to interact with the Redis Cluster. The client will automatically handle the routing of commands to the appropriate node.
Adding and Removing Nodes: You can dynamically add or remove nodes from the cluster as your caching needs change.

Redis replication handles the expiration of keys in a specific way to ensure consistency between the primary server and its replicas. Here’s how it works:

Expiration Propagation: When a key with an expiration (TTL, Time To Live) set on the primary Redis server expires or is deleted, the primary server sends a DEL command to the replicas. This ensures that the key is removed from all replicas at the same time it’s removed from the primary server.
Expiration Persistence: If a Redis primary server is restarted, the expiration times of keys are saved and reloaded. This ensures that the expiration of keys remains consistent across restarts.
Replicas’ Expiration Handling: Replicas do not independently expire keys based on their TTL. Instead, they wait for the explicit DEL command from the primary server. This prevents situations where network latency or clock drift could cause inconsistencies in key expiration between the primary and its replicas.
Active Expiry vs Passive Expiry: Redis uses two mechanisms for expiring keys – active and passive. In active expiry, Redis periodically checks a random set of keys to see if they have expired. In passive expiry, Redis checks the expiration of a key only when it is accessed. Both methods are used on the primary server, but replicas rely on the DEL commands from the primary server.
Effects on Read Operations: Since replicas do not independently expire keys, a key with an expired TTL might still be accessible on a replica until the DEL command is received from the primary. This could lead to brief inconsistencies in read operations.
Persistence of Expirations: When using disk persistence (like RDB snapshots), Redis stores the expiration times of keys. During the loading process after a restart, Redis will delete any keys that should have expired during the downtime.
Handling Failovers: During a failover, where a replica becomes the new primary, it will start handling key expirations. Any keys that should have expired while it was a replica will be expired promptly after the failover.

WebSockets

WebSockets is a communication protocol that provides a full-duplex communication channel over a single, long-lived connection. It is commonly used for real-time web applications because it allows for interactive communication between a client (like a web browser) and a server.

How WebSockets Work:

Opening Handshake:

It starts with a regular HTTP request from the client, which includes an “Upgrade” header indicating the request to switch to WebSockets.
If the server supports WebSockets, it responds with an HTTP 101 status code (Switching Protocols), confirming the protocol switch.

Data Transfer:

After the handshake, the connection is upgraded from HTTP to WebSockets.
Data can now be sent back and forth between the client and server in real-time. This is done through “frames” of data, which can be either text or binary.

Closing the Connection:

Either the client or server can initiate a close handshake.
This is a graceful way to close the connection, ensuring that any remaining data is transmitted before the connection is terminated.

Using WebSockets:

Basic Example:

Client-Side (JavaScript):

   var socket = new WebSocket('ws://example.com/socket');

   socket.onopen = function(event) {
     console.log('Connection established');
     socket.send('Hello Server!');
   };

   socket.onmessage = function(event) {
     console.log('Message from server:', event.data);
   };

   socket.onclose = function(event) {
     console.log('Connection closed');
   };

Server-Side (Node.js with ws library):

   const WebSocket = require('ws');
   const wss = new WebSocket.Server({ port: 8080 });

   wss.on('connection', function connection(ws) {
     ws.on('message', function incoming(message) {
       console.log('received: %s', message);
       ws.send('Hello Client!');
     });

     ws.send('Welcome to the WebSocket server!');
   });

Practical Use Cases:

Chat Applications: WebSockets are ideal for building real-time chat applications where messages need to be exchanged quickly and efficiently between users.
Gaming: Real-time multiplayer online games can use WebSockets for fast, bi-directional communication.
Live Notifications: For applications that require live updates or notifications (like social media or news feeds).
Financial Trading Platforms: Real-time updates of stock prices or trading data are efficiently handled with WebSockets.

Advantages:

Reduced Latency: Because the connection is kept open, there is less overhead and latency in communication.
Bidirectional Communication: Both client and server can initiate communication, making it suitable for real-time applications.
Efficient: Better use of server resources compared to traditional HTTP polling.

Considerations:

Browser Compatibility: While most modern browsers support WebSockets, older versions may not.
Security: As with any web technology, proper security measures (like using wss:// for encrypted communication) should be implemented.
Fallbacks: For environments where WebSockets aren’t supported, fallback mechanisms like long-polling can be used.

Reconciliation Service

A reconciliation service is a crucial component for an ad-click system, especially in scenarios when dealing with a high volume of transactions and multiple sources of data. Here are the key reasons why a reconciliation service is necessary and the benefits it offers:

1. Accuracy and Consistency

Purpose: Ensures the accuracy and consistency of data across different components of the system (like ad impressions, clicks, and billing).
Benefit: Prevents discrepancies in ad delivery reports, click counts, and billing, thus maintaining trust with advertisers and partners.

2. Fraud Detection

Purpose: Identifies unusual patterns or discrepancies that could indicate fraudulent activity, like click fraud.
Benefit: Protects revenue by detecting and addressing fraudulent activities promptly.

3. Billing and Payment Verification

Purpose: Ensures the correctness of billing and payments by reconciling clicks, impressions, and other chargeable events with billing records.
Benefit: Reduces billing errors and disputes with advertisers, leading to smoother financial operations.

4. Data Integration from Multiple Sources

Purpose: Merges and reconciles data from various sources, such as different ad servers, click trackers, and analytics platforms.
Benefit: Provides a unified view of data, crucial for accurate reporting and decision-making.

Implementation Considerations

Real-time vs Batch Processing: Depending on the volume and requirements, reconciliation can be done in real-time or via batch processing.
Automated Alerts and Reports: Implement automated alerts for discrepancies and regular reporting for ongoing monitoring.
Scalability: Ensure the reconciliation service can scale to handle the increasing volume of data as the ad-click system grows.

Other Terminologies

Race condition
When two or more processes access shared data at the same time and at least one of them modifies the data. This can lead to unpredictable results and bugs because the outcome depends on the sequence or timing of the processes’ execution. Preventing race conditions typically involves implementing proper synchronization mechanisms to ensure that only one thread or process can access the shared resource at a time, or by designing the system in such a way that the shared state is minimized or eliminated.

Normalization
Normalization is a process aimed at organizing data to minimize redundancy and improve data integrity. For example, break a table into two or more, to include customer information tables, order tables.

Denormalization
Denormalization is the process of intentionally introducing redundancy into a database for the purpose of improving performance, usually by merging two or multiple tables.

Shared Responsibility Model
Security and Compliance is a shared responsibility between AWS and the customer. AWS responsibility “Security of the Cloud”. Customer responsibility “Security in the Cloud”

SLO, SLA, SLI

SLO: a precise numerical target for system availability. We term this target the availability Service-Level Objective (SLO) of our system
SLA: an SLA normally involves a promise to someone using your service that its availability SLO should meet a certain level over a certain period, and if it fails to do so then some kind of penalty will be paid
SLI: a direct measurement of a service’s behavior: the frequency of successful probes of our system. This is a Service-Level Indicator (SLI)