Designing a system similar to Twilio, which is a cloud communications platform, requires careful planning and architecture to ensure high availability, scalability, and fault tolerance.
Some basic requirements:
- Send API: Provide a secure and authenticated endpoint to enable the initiation of notification dispatch from various backend systems and microservices.
- Supported Channels: Ensure compatibility with a diverse range of notification channels that offer an API, including but not limited to Email, SMS, and Push Notifications.
- User Preferences Management: Implement a feature that allows users to personalize their notification preferences, including the choice of notification types and preferred channels.
- Compliance with Downstream Service Limits: Design the system to respect rate limits and usage policies of downstream services, such as email and SMS providers, to prevent throttling or suspension.
- Scalability: Design the architecture to support horizontal scaling, facilitating virtually unlimited expansion to accommodate growing demand and traffic.
Below is an overview of the key components and design considerations for such a system:
Core Components
a. API Gateway
- Function: Serves as the entry point for clients to access the service (SMS, Voice, Video, etc.), to send and manager messages.
- Scalability: Load balancing to distribute incoming requests evenly across servers.
- Security: Authentication and rate limiting to prevent abuse.
b. Application Servers
- Function: Handle business logic for different services (e.g., sending SMS, voice messages, and notifications).
- Scalability: Stateless design allows easy horizontal scaling.
c. Database Cluster
- Function: Stores data like user accounts, message logs, billing information.
- Scalability: Use a distributed database system for horizontal scaling.
- Fault Tolerance: Replication across multiple nodes and data centers.
d. Message Queue
- Function: Decouples the application servers from processes that can be executed asynchronously (e.g., sending notifications), buffers messages for delivery, decoupling message submission from sending. For different priorities, there are multiple messages queues, like, p0, p1, and p2 queues.
- Scalability and Fault Tolerance: Distributed queues with redundancy.
e. Content Delivery Network (CDN)
- Function: Serves static content and reduces latency for global users.
- Fault Tolerance: Automatic rerouting in case of node failure.
f. Carrier Integration Service
- Function: provides interfaces with various carriers and SMS gateways for message delivery..
- Scalability: Deploy in multiple regions to handle traffic near users.
g. SID Generation Service
- Function: Generates unique identifiers (SIDs) for each resource (e.g., SMS, call).
- Implementation: Utilize a high-performance, distributed ID generation system (like Twitter’s Snowflake algorithm) to ensure uniqueness and scalability.
h. Notification Template System
- Function: Manages and stores templates for various types of notifications (SMS, email, etc.).
- Features: Support for dynamic content insertion and template versioning.
i. Notification Log
- Function: Records the history of all notifications sent, including details like time, recipient, content, and delivery status.
- Implementation: A scalable logging system, potentially using a time-series database for efficient storage and retrieval of log data.
Database Schema
- Users Table: Stores user information and preferences.
- Fields: UserID, UserName, Email, PhoneNumber, Preferences, CreatedAt, UpdatedAt
- Messages Table: Records details of each message.
- Fields: MessageID (SID), UserID, ContentType, Content, Status, CreatedAt, UpdatedAt
- Templates Table: Holds message templates.
- Fields: TemplateID, TemplateName, Content, Variables, CreatedAt, UpdatedAt
- Message Logs Table: Logs each message’s lifecycle.
- Fields: LogID, MessageID, Status, Timestamp, CarrierInfo
- Carrier Information Table: Contains data related to carriers.
- Fields: CarrierID, CarrierName, APIEndpoint, SupportedFormats, ReliabilityScore
Messages Table Example
Field Name | Data Type | Description | Example Values |
---|---|---|---|
MessageID | VARCHAR(255) | Unique identifier for the message | ‘MSG001’, ‘MSG002’ |
UserID | VARCHAR(255) | Identifier for the user who sent/received the message | ‘USR001’, ‘USR002’ |
MessageType | VARCHAR(50) | Type of the message (e.g., SMS, Email, OTP) | ‘SMS’, ‘Email’, ‘OTP’ |
Content | TEXT | The actual content of the message | ‘Your OTP is 123456’ |
Status | VARCHAR(50) | Current status of the message (e.g., Sent, Delivered) | ‘Sent’, ‘Delivered’, ‘Failed’ |
Timestamp | DATETIME | Date and time when the message was created/sent | ‘2023-01-01 12:30:00’ |
DeliveredTimestamp | DATETIME | Date and time when the message was delivered | ‘2023-01-01 12:32:00’ |
CarrierInfo | VARCHAR(255) | Information about the carrier (for SMS) | ‘Verizon’, ‘AT&T’ |
ErrorCode | VARCHAR(50) | Error code if the message failed to send | ‘Invalid Number’, ‘Network Error’ |
Example APIs
POST https://api.notificationservice.com/v3/mail/send
Authorization: Bearer YOUR_SENDGRID_APIKEY
Content-Type: application/json
Body:
{
"personalizations": [
{
"to": [
{
"email": "recipient@example.com"
}
],
"subject": "Hello, World!"
}
],
"from": {
"email": "sender@example.com"
},
"content": [
{
"type": "text/plain",
"value": "Hello, World!"
}
]
}
Message Data Flow
Assuming messages have a life cycle of:
- Created, which is created by users or services
- Pending, which is processed and put into the message queue to be sent out.
- Sent: messages are sent to carrier’s short message service center(SMSC).
- Delivered: messages are sent to client’s phone by carrier’s short message service center
- Failed: a message was not able to be delivered to the recipient’s device.
1. Message Generation and Submission
- User/Application Request: A user or application initiates a request to send a message (SMS, voice, etc.) via the system’s API.
- API Gateway: The request is received by the API Gateway, which handles authentication and rate limiting.
- SID Generation Service: A unique SID (String Identifier) is generated for the new message, ensuring each message can be individually tracked and managed.
- Content Validation Service: The message content is validated against company policies and regulatory standards to ensure compliance.
- User List Database & Opt-In/Opt-Out Check: The system checks the recipient’s preferences in the User List Database to ensure they have opted in to receive the notification.
2. Message Processing and Templating
- Notification Template System: If the message uses a template, the Notification Template System dynamically populates the message content based on the specified template and user data.
- Application Servers: The business logic is processed, including any necessary data manipulation or additional logic specific to the type of message.
3. Message Queuing and Handling
- Message Queue: The processed message is placed into a Message Queue to be sent. This decouples message sending from processing, enhancing scalability and fault tolerance.
- Not all messages are equal – some transcational messages are more important than marketing messages.
- The incoming load needs to be prioritized, and one type of messages should not affect others, we also need to use rate limit for each users.
4. Sending the Message
- Connection Handlers: The message is picked up from the queue by Connection Handlers, which manage the connections to various carriers and external services.
- Carrier/ISP Integration: The message is sent to the recipient’s carrier or Internet Service Provider (ISP).
5. Status Tracking and Logging
- Status Updates: Throughout this process, the system receives status updates (e.g., sent, delivered, failed) from the carrier/ISP. These updates are processed in real time.
- Notification Log: All activities, including the message content, recipient, status, and timestamps, are logged in the Notification Log, associated with the message’s SID.
6. Delivery and Confirmation
- Delivery to Recipient: The carrier/ISP delivers the message to the end recipient’s device.
- Delivery Confirmation: The system receives a delivery confirmation from the carrier/ISP, updating the message status accordingly.
Monitoring and Alerting
- Real-time Monitoring: Set up monitoring to detect unusual spikes in email sending, which could indicate duplicate sends.
- Alert Systems: Implement alerting mechanisms to notify administrators when potential duplications are detected.
Questions
Q: How would you ensure the scalability of the messaging system?
Answer: Horizontal Scaling: Design the system to scale horizontally by adding more servers or instances as the load increases.
- Load Balancing: Implement load balancers to distribute traffic evenly across servers, preventing any single server from becoming a bottleneck.
- Stateless Design: Ensure that the application servers are stateless, allowing them to scale without dependency on local states.
- Distributed Message Queues: Use distributed message queues (like Kafka or RabbitMQ) to handle high volumes of incoming message requests without overloading the processing servers.
- Database Scalability: Employ a distributed database system that can be scaled out to handle large amounts of data with efficient read/write operations.
Q: How would you handle data consistency and reliability in the system?
Answer: (1) Database Transactions: Use database transactions to ensure that operations like sending a message and updating the message status are atomic and consistent.
(2) Replication: Implement replication in database systems to ensure data is duplicated across multiple nodes, preventing data loss.
(3) Acknowledgment Mechanisms: Use acknowledgment mechanisms from downstream systems (like carriers or email servers) to confirm message delivery and update the system status accordingly.
(4) Retry Logic: Design intelligent retry logic that can distinguish between transient and permanent failures and only retries when appropriate.
Q: how to handle high-volume of messages?
Answer: (1) A Load Balancer should distribute incoming requests evenly across all available Application Server instances. This ensures no single server is overwhelmed.
(2) Application Servers should be designed to scale out horizontally automatically. This means adding more server instances to handle increased load.
(3) Servers should be stateless, meaning they don’t retain user session information. This allows any server in the pool to handle any request.
(4) A robust Message Queue system like Kafka or RabbitMQ can handle high throughput. It temporarily stores messages before they are processed, ensuring that sudden spikes in traffic don’t overwhelm the Application Servers.
(5) Prevents a single user or service from overloading the system by limiting the number of requests they can make in a given timeframe.
Q: how to handle Application Server Downtime?
Answer: (1) Maintain redundant instances of Application Servers in different data centers to ensure availability even if one server or data center goes down.
(2) State Recovery and Persistence: In case of a server failure, the system should be able to recover the state of the operations. This can be achieved through: a.Persistent Message Queue: Ensures that messages waiting to be processed are not lost during a server failure, Messages in the queue at the time of failure will remain there due to the persistence of the Message Queue, once the Application Server is back online or failover is complete, resumes message processing from the queue. b.Database Transactions: Use database transactions to maintain data integrity. If a process is interrupted, the system can roll back to the last consistent state.
(3) Message Acknowledgment and Retries: The system should acknowledge the receipt and processing of messages. In the event of a failure before acknowledgment, the message can be retried.
Q: How to avoid duplicate messages send out due to service failure or any reasons ?
Answer: The idea of avoiding duplicated messages is keep an idempotency key, and save the idempotency key into a global distributed log/DB, Bloom filter can be used to check if a message sent ot not.
(1) Assign a unique identifier (UID) to each message. This UID should be generated at the origin point (e.g., by the client or as soon as the message hits your system). Before sending a message, check if the UID has already been processed. If it has, skip resending.
(2) Avoid duplicated receivers in the application server, implement a mechanism to check if an email to a specific customer for a specific purpose has already been sent before attempting to resend it to maintain idempotency.
(3) Utilize database transactions when updating the state of a message. This ensures that if a process fails midway, the system can roll back to the previous consistent state.
(4) After sending a message, wait for an acknowledgment from the downstream system (carrier, email server, etc.).
Q: How to avoid duplicate emails are send out by multiple application servers in a short time?
Answer: (1) Assign a unique identifier (UID) to each message, also implement a system where each email send request has a unique transaction ID. Before sending an email, check if an email with the same parameters (like recipient, subject, content hash) has already been sent.
(2) Implement logic in the queue system to detect and remove duplicate requests based on unique identifiers in a period of time.
(3) Once the message is delivered, the downstream system (carrier/SMTP server) sends a delivery status back to the original service provider (Twilio/SendGrid). This can be done via a synchronous API response, asynchronous webhooks, or by querying the service provider’s API. Once application instacne receives the acknowledgment and processes it accordingly, like updating the message status in your database.
Q: What happened if the phone is off-line or not connected with internet?
Answer: When an Apple Push Notification (APN) is sent to an iPhone that is turned off or not connected to the internet, the handling of the notification depends on several factors.
1). Notification Storage by APNs. Holding Notifications: The APNs (Apple Push Notification service) will hold the notification for a limited period if the device is offline or unavailable. (Time-to-Live (TTL): When sending a notification, the sender can specify a TTL value, which indicates how long APNs should attempt to deliver the notification.) If the device doesn’t come online within this period, the notification is discarded.
2) Delivery of Notifications Once Device is Online. Reconnecting to APNs: When the iPhone is turned back on and reconnects to the internet, it will re-establish its connection with APNs. If the notification is still within its TTL, APNs will attempt to deliver it to the device.
3)Handling Multiple Notifications: Coalescing Notifications: If there are multiple notifications sent while the device was offline, APNs may coalesce them to reduce load. This means only the latest notification from each app might be delivered, depending on the app’s notification settings.
Q: For twilio SMS notification server, it the receiver’s phone is tuned off, what happen to the to be delivered messages?
Answer: When a Twilio SMS is sent to a phone that is turned off or out of network coverage, the delivery of the message is managed by the Short Message Service Center (SMSC) of the recipient’s mobile carrier.
The carrier’s SMSC attempts to deliver the message to the recipient’s phone. If the phone is off or out of network, the SMSC cannot deliver the message immediately. The SMSC will typically store the message for a certain period, known as the “validity period”(ranges from a few hours to several days). During the validity period, the SMSC will periodically retry sending the SMS to the recipient’s phone. Twilio’s system will be waiting for a delivery report from the carrier. It does not directly control the retry attempts. If the validity period expires and the phone has not become available, the message is typically discarded by the carrier. Throughout this process, the status of the message can be tracked through Twilio’s platform. The sender can see if the message is queued, sent, delivered, or failed.
It’s important to note that SMS delivery, especially in cases where the recipient’s phone is off, is subject to the recipient’s carrier policies and is not guaranteed.
Q: How would you secure the messaging system?
Answer: (1) Encryption: Encrypt sensitive data both in transit (using TLS) and at rest.
(2) Authentication and Authorization: Implement strong authentication and authorization mechanisms for securing APIs.
(3) Regular Audits: Conduct regular security audits and vulnerability assessments.
(4) Compliance: Ensure compliance with relevant data protection regulations like GDPR or HIPAA.
(5) Rate Limiting and Throttling: Implement rate limiting to prevent abuse and DDoS attacks.
Q: How would you monitor and ensure the high availability of the system?
Answer:
(1) Monitoring and Logging: Implement comprehensive monitoring and logging to track system health, performance metrics, and to detect issues.
(2) Health Checks and Alerts: Set up health checks for all services and configure alerts for any service disruptions or degradations.
(3) Redundancy: Design the system with redundancy, having multiple instances of critical components in different availability zones or data centers.
(4) Failover Mechanisms: Implement automated failover mechanisms to switch to backup systems in case of a failure.
(5) Load Testing: Regularly perform load testing to understand how the system behaves under peak loads and adjust configurations accordingly.
Q:How would you implement logging and monitoring in this system? What metrics would you monitor?
Answer: Logging and monitoring can be implemented by:
- Centralized Logging: Use a centralized logging system to collect logs from all components of the system.
- Real-Time Monitoring Tools: Employ monitoring tools to track system health, performance metrics, and user activities.
- Alerts and Notifications: Set up alerts for anomalies, such as spikes in error rates or performance degradation.
- Key Metrics: Monitor metrics like throughput (messages sent per second), delivery rates, error rates, response times, and system resource utilization (CPU, memory, disk I/O).