Designing a high-performance, high-throughput, and accurate distributed ad-show/ad-click system involves several key components, including system architecture, database selection, scalability solutions, and fault tolerance mechanisms.
Life Cycles of Log-lines
Collecting logs from various machines, and eventually to log reporting involves a series of steps, typically encompassing log generation, collection/transportation, aggregation, storage, analysis, and log reporting.
- Log Generation:
- Each machine, application, or device generates logs. These logs are textual records of events that happen within the system, such as user actions, system errors, or other significant events.
- The format and detail level of these logs can vary widely depending on the system and the configuration.
- Log Collection:
- Logs are collected from various sources. This can be done using agents installed on the machines or through centralized log collectors that pull or receive logs over the network.
- Collection methods might vary based on the type of log source (e.g., server, application, network device) and the log format.
- Log Aggregation:
- Log aggregation is the process of consolidating logs from different sources into a single, centralized system.
- This step often involves normalizing log data into a consistent format, making it easier to analyze.
- Aggregation can be performed by tools like log servers or log management systems, which can handle logs in various formats and from different sources.
- Log Storage:
- Once aggregated, logs are stored in a database or a file system. This storage needs to be scalable and efficient, as log data can grow rapidly.
- Some systems also index the log data to enable fast searching and querying.
- Log Analysis:
- The stored logs are analyzed to extract useful information. This can include identifying patterns, detecting anomalies, or generating insights.
- Analysis can be real-time or batch-processed, depending on the requirements. It might use complex algorithms, including machine learning, for more sophisticated analysis.
- Log Reporting and Visualization:
- The results of the log analysis are then reported to users. This can be in the form of dashboards, email alerts, or detailed reports.
- Visualization tools help in presenting the data in an easily understandable format, often using charts, graphs, and tables.
- Log Archiving:
- Finally, for long-term storage and compliance, logs may be archived. This involves moving older log data to less expensive storage while ensuring it can still be accessed if needed.
High Level Design
Data Collection Service
A Data Collection Service is a system component primarily responsible for gathering and ingesting data from various sources.
- Harvesting Data: Collects data from multiple sources like user interactions, system logs, external APIs, and IoT devices.
- Standardization: Ensures that incoming data is formatted and structured uniformly for easy processing and analysis.
- Initial Processing: Performs preliminary data cleaning and validation to ensure data quality.
- Data Integration: Combines the collected data into a centralized repository or forwards it to other system components for further processing.
Data Aggregation Service
A Data Aggregation Service plays a critical role in collecting, combining, and managing data from various sources.
- Integrates data from different sources to create a comprehensive dataset. For example, combining click data with user demographic information.
- Applies transformations to raw data to make it more suitable for analysis. This could include cleaning, deduplication, or restructuring of data.
Database Writer
A Database Writer service in a system like an ad-click platform is responsible for writing data to a database. It acts as a bridge between the application and the database, ensuring that data from various sources (like clicks, impressions, and user interactions) is efficiently and reliably stored in the database.
Reconciliation Service
A reconciliation service plays important role in ensuring data integrity, detecting fraud, maintaining compliance, and facilitating accurate billing and reporting. Its implementation should be carefully planned to align with the scale and specific requirements of the system.
Cache
A cache around the database serves to temporarily store frequently accessed data, reducing the number of direct queries to the database. This enhances system performance by speeding up data retrieval times, decreasing database load, and improving overall efficiency, especially for read-heavy operations.
Database Clusters
Database clusters is a group of databases working together to improve performance, availability, and scalability. They distribute the database load across multiple nodes (servers), ensuring high availability through redundancy, and allow for horizontal scaling to handle increased traffic and data volume efficiently.
Scale of Estimation
- Number of Ads Served Per Day: Assume 100 million ads are served daily.
- Click-Through Rate (CTR): Assume an average CTR of 1%. So, 1% of the ads served result in clicks.
- Daily Clicks: 1% of 100 million = 1 million clicks per day.
- Ad-show: Besides clicks, assume each user generates 10 ad-show events per day.
- Daily Ad-Show Events: 10 x 100 million ads = 1 billion events per day.
- Total Daily Events (Clicks + Shows): 1 billion ad-shows + 1 million clicks = 1.001 billion events per day.
- Event Rate Per Second: 1.001 billion events / 86,400 seconds ≈ 11,580 events per second.
Estimating Read QPS:
- Real-Time Dashboard: 100 dashboards refreshing every 30 seconds = 0.5 QPS per dashboard x 100 = 50 QPS.
- Aggregate Reports: 100 reports generated per hour = 100 reports / 3600 seconds ≈ 0.03 QPS.
- Ad-Hoc Queries: 500 queries per hour = 500 / 3600 ≈ 0.14 QPS.
- In total, let’s round them together about 51 QPS
In short,
- 1 billion traffic per day, means 11.5k/second, about 12k QPS
- Peak QPS, assume 5 times of average, that is Peak = 60k QPS
- If one click generate 0.5 KB data, then daily requirements: 1b(10^9) * 0.5kb = 500GB, monthly storage size is about 15TB.
API Design
- Ad Delivery API
This API is responsible for delivering ads to users.
GET /ads
Description: Retrieve a list of ads based on user context.
Parameters: user_id, page_context, location, device_type.
Response: List of ad objects with details like ad_id, image_url, click_url, impression_count.
- Click Tracking API
This API tracks when a user clicks on an ad.
POST /clicks
Description: Record a click on an ad.
Body: { “user_id”: string, “ad_id”: string, “timestamp”: datetime, “location”: string, “device_info”: string }
Response: Acknowledgment with a unique click_id.
- Analytics API
This API provides analytics and reporting capabilities.
GET /analytics/clicks
Description: Get click statistics for a specified period.
Parameters: start_date, end_date, ad_id (optional).
Response: Statistics like total clicks, click-through rate, geographical distribution.
GET /analytics/impressions
Description: Get impression data for ads.
Parameters: start_date, end_date, ad_id (optional).
Response: Data like total impressions, unique users, impression rate.
Data Model
1. Ads Table
Field | Type | Description |
---|---|---|
ad_id | Primary Key | Unique identifier for each ad |
advertiser_id | Foreign Key | Identifier for the advertiser |
content | Text | Content of the ad (e.g., image URL, text) |
target_audience | Text | Intended audience for the ad |
start_date | Date | Start date of the ad campaign |
end_date | Date | End date of the ad campaign |
budget | Decimal | Total budget for the ad |
2. Clicks Table
Field | Type | Description |
---|---|---|
click_id | Primary Key | Unique identifier for each click |
ad_id | Foreign Key | ID of the clicked ad |
user_id | Foreign Key | ID of the user who clicked |
timestamp | DateTime | Time of the click |
location | Text | Geographic location of the click |
device_info | Text | Information about the user’s device |
Ad System Components
1. Load Balancers
- Purpose: To distribute incoming traffic evenly across servers, ensuring reliability and availability.
- Key Feature: Supports auto-scaling and prevents any single point of failure.
2. Web Servers
- Purpose: To handle user requests, serve ads, and record clicks.
- Key Feature: Should be stateless to allow easy scaling.
3. Ad Delivery Engine
- Purpose: Selects the most relevant ads to display to users based on various criteria.
- Key Feature: Utilizes algorithms and machine learning for targeted ad delivery.
4. Click Tracking System
- Purpose: Records all clicks on ads and captures relevant data (like timestamp, user ID).
- Key Feature: High throughput and low latency for real-time tracking.
- Reconciliation Service: A reconciliation service is a crucial component for an ad-click system, especially in scenarios where dealing with a high volume of transactions and multiple sources of data.
5. Data Storage Layer
- Purpose: Stores ad content, user data, click logs, and other related information.
- Key Feature: Scalable and high-performance databases (NoSQL for operational data, SQL or Data Warehousing for analytics).
6. Cache System
- Purpose: Temporarily stores frequently accessed data to reduce database load.
- Key Feature: Typically in-memory data stores like Redis or Memcached for fast access.
7. Data Processing and Analytics Engine
- Purpose: Processes large volumes of data for insights, reporting, and optimization.
- Key Feature: Utilizes Big Data technologies (like Hadoop, Spark) and supports real-time analytics.
8. Content Delivery Network (CDN)
- Purpose: Speeds up the delivery of ad content to users globally.
- Key Feature: Geographically distributed network of servers.
9. Monitoring and Logging System
- Purpose: Monitors the system’s health and performance; logs events and errors.
- Key Feature: Real-time monitoring and alerting capabilities (using tools like Prometheus, ELK stack).
10. User Authentication and Authorization System
- Purpose: Manages user access and security.
- Key Feature: Secure authentication mechanisms and role-based access control.
11. API Gateway
- Purpose: Serves as a single entry point for all API requests.
- Key Feature: Manages API requests and can handle rate limiting, authentication.
12. Backup and Disaster Recovery System
- Purpose: Ensures data integrity and system availability in case of failures.
- Key Feature: Regular backups and strategies for quick recovery.
13. Scalability and Auto-Scaling Mechanisms
- Purpose: Allows the system to handle varying loads efficiently.
- Key Feature: Dynamic scaling of resources based on traffic patterns.
TSDB
A time series database is a specialized type of database optimized for storing and managing time-stamped or time-series data — data points that are indexed in time order. It’s particularly well-suited for tracking changes over time and is widely used in various applications like financial markets, sensor data monitoring, application performance metrics, and IoT devices. Key characteristics include:
- Time-Stamped Data: Each record is associated with a timestamp, representing a specific point in time.
- Efficient Data Storage: Optimized to handle large volumes of sequential data points.
- Time-Based Queries: Offers fast retrieval based on time intervals, which is crucial for time series analysis.
- Data Aggregation: Supports aggregation over time periods (e.g., averages, sums, min/max) for analysis and reporting.
- Scalability: Can handle high volumes of write and read operations typical in time-based data.
- Real-Time Processing: Often capable of real-time data processing, making them ideal for live monitoring and alerting systems.
In a time series database used for an ad-click/ad-shown system, data is typically structured to efficiently support time-based queries. Time series databases are designed to handle large volumes of sequential data, making them ideal for tracking ad events that are timestamped.
Transforming raw ad-click and ad-shown log lines into a format suitable for a Time Series Database (TSDB) involves parsing the logs, extracting relevant information, and structuring it according to the TSDB’s schema. This process also includes timestamp normalization, tag extraction, and field assignment.
1. Raw Log Lines
Consider the following raw log lines for ad-click and ad-shown events:
2024-01-06T12:00:00.123Z | CLICK | AdId:12345 | UserId:67890 | Device:Mobile | Region:North America
2024-01-06T12:00:00.321Z | SHOW | AdId:12345 | UserId:67890 | Device:Mobile | Region:North America
2. Parsing and Extraction
Parse each line to extract the timestamp, event type, and other details:
- Timestamp:
2024-01-06T12:00:00.123Z
(for the click event) - Event Type:
CLICK
- AdId:
12345
- UserId:
67890
- Device:
Mobile
- Region:
North America
3. Structuring for TSDB
Structure the parsed data into a format suitable for the TSDB:
Transformed Data for Ad-Click Event:
{
"timestamp": "2024-01-06T12:00:00.123Z",
"event_type": "CLICK",
"tags": {
"ad_id": "12345",
"user_id": "67890",
"device": "Mobile",
"region": "North America"
},
"fields": {
"count": 1
}
}
Transformed Data for Ad-Shown Event:
{
"timestamp": "2024-01-06T12:00:00.321Z",
"event_type": "SHOW",
"tags": {
"ad_id": "12345",
"user_id": "67890",
"device": "Mobile",
"region": "North America"
},
"fields": {
"count": 1
}
}
Considerations in the Transformation Process:
- Timestamp Normalization: Ensuring that timestamps are in a uniform format, often UTC, for consistent time-based querying.
- Tag Extraction: Identifying key-value pairs that provide context to the event, such as
ad_id
,user_id
,device
, andregion
. These tags are indexed for efficient querying. - Field Assignment: Allocating measurable values to fields, such as a
count
of 1 for each event. Fields usually store the data points that are queried and aggregated. - Batch Processing vs. Stream Processing: Depending on the system architecture, this transformation can happen in batch (processing logs at intervals) or in real-time (stream processing as logs are generated).
- Data Enrichment (Optional): Sometimes, additional processing might be done to enrich the data, like adding information about the ad campaign, categorizing devices, etc.
- Error Handling: Implementing robust error handling and validation to ensure data integrity.
Once transformed, these data points are sent to the TSDB, typically via an API or a data ingestion pipeline, where they are stored and made available for time-based querying and analysis. This transformation enables efficient data handling and analytics in real-time, which is crucial for dynamic ad performance monitoring and optimization.
Range Queries Again TSDB
Time series databases excel in handling range queries, especially those based on time intervals. To retrieve data over specific time ranges (like the last 30 seconds, 1 minute, 5 minutes, etc.), you would typically use a query language specific to the time series database in use.
Example Queries
- Last 30 Seconds
SELECT ad_id, COUNT(*) as event_count
FROM ad_events
WHERE event_type = 'CLICK' AND
timestamp > NOW() - INTERVAL '30 seconds'
GROUP BY ad_id
ORDER BY event_count DESC
Considerations
- Indexing: Efficient indexing on timestamps and tags is crucial for fast query execution.
- Aggregation: Many time series databases support built-in aggregation functions (like COUNT, SUM, AVG) for summarizing data over time intervals.
- Scalability: Ensure the database can handle the high write throughput and query load typical in ad-tracking systems.
- Data Retention Policies: Implementing data retention policies to manage data growth and storage costs.
- Time Zone Handling: Ensure that the time zone of the TSDB and the query are aligned, especially if the TSDB stores timestamps in UTC.
- Performance Optimization: For large datasets, consider strategies to optimize query performance, such as indexing on timestamps and event types.
- Data Freshness: Ensure that the TSDB is receiving and processing data in real-time for accurate, up-to-date query results.
How Log Aggregation Works Exactly
Let’s start by looking at examples of ad-click and ad-shown log lines again. These log lines typically contain key information about each event, such as the time of the event, the user ID, the ad ID, and other relevant details.
Ad-Click Log Line Examples
2023-11-06T12:00:00.123Z | CLICK | AdId:12345 | UserId:67890 | Device:Mobile | Region:North America
2023-11-06T12:00:01.456Z | CLICK | AdId:54321 | UserId:98765 | Device:Desktop | Region:Europe
2023-11-06T12:00:02.789Z | CLICK | AdId:11223 | UserId:22334 | Device:Tablet | Region:Asia
Ad-Shown Log Line Examples
2023-11-06T12:00:00.321Z | SHOW | AdId:12345 | UserId:67890 | Device:Mobile | Region:North America
2023-11-06T12:00:01.654Z | SHOW | AdId:54321 | UserId:98765 | Device:Desktop | Region:Europe
2023-11-06T12:00:02.987Z | SHOW | AdId:11223 | UserId:22334 | Device:Tablet | Region:Asia
Log aggregation typically involves the following steps:
- Collection: Logs are collected from various sources (servers, devices, applications).
- Normalization: Logs are converted into a standardized format for easier processing.
- Transportation: Logs are transferred to a central system or database.
- Aggregation: Logs are combined, often by time stamps or event types, to provide a unified view.
- Analysis: Aggregated logs are analyzed for patterns, trends, or anomalies.
After aggregation, the data might be summarized in a way that provides insights into user behavior, ad performance, etc. Some examples of aggregated data:
Aggregated by Ad ID (Hourly)
Hour: 2024-01-06T12:00 - 2024-01-06T13:00
AdId: 12345 | Clicks: 150 | Shows: 300 | Click-Through Rate (CTR): 50%
AdId: 54321 | Clicks: 75 | Shows: 200 | Click-Through Rate (CTR): 37.5%
AdId: 11223 | Clicks: 50 | Shows: 250 | Click-Through Rate (CTR): 20%
Aggregated by Region (Daily)
Date: 2024-01-06
Region: North America | Total Clicks: 1000 | Total Shows: 3000 | CTR: 33.3%
Region: Europe | Total Clicks: 800 | Total Shows: 2500 | CTR: 32%
Region: Asia | Total Clicks: 900 | Total Shows: 2700 | CTR: 33.3%
Aggregated by Device Type (Monthly)
Month: January 2024
Device: Mobile | Total Clicks: 5000 | Total Shows: 15000 | CTR: 33.3%
Device: Desktop | Total Clicks: 4500 | Total Shows: 14000 | CTR: 32.1%
Device: Tablet | Total Clicks: 4000 | Total Shows: 12000 | CTR: 33.3%
These examples illustrate how log aggregation can turn raw log data into actionable insights, helping advertisers and publishers understand ad performance across different dimensions such as time, region, or device type.
Considerations to Ensure High Availability
1. Ensuring that servers are always up and maintaining high availability in an ad-click system involves implementing a robust and resilient infrastructure.
- Add Redundancy: Use duplicate components to avoid single points of failure.
- Load Balancing: Distribute traffic evenly across servers.
- Geographical Distribution: Host servers in different locations to mitigate regional outages.
- Auto-Scaling: Dynamically adjust resources based on demand.
- Disaster Recovery Plan: Have backups and a failover strategy for major incidents.
2. Avoiding the re-processing of a message by multiple data aggregation services in a message queue system involves implementing strategies for idempotent message handling and acknowledgment.
- Unique Message Identification
Each message in the queue should have a unique identifier, to track the processing status of each message individually. - Message Acknowledgment
Once a data aggregation service processes a message, it should send an acknowledgment to the message queue, to inform the queue that the message has been successfully processed and can be safely removed or marked as processed. - Visibility Timeout
Set a visibility timeout for each message when a service picks it up, the message becomes invisible to other services during this period, preventing duplicate processing. - Dead Letter Queue
Use a dead letter queue for messages that fail processing repeatedly, to separate problematic messages and prevent them from clogging the system. - Concurrency Control
Control the number of messages a service processes at a time, to prevent overwhelming the service and reduce the chance of duplicate processing.
3. Managing hotspot issues in a high-traffic system
- Identify Hotspots and Benchmarking: Use monitoring tools to detect uneven load distribution and conduct benchmarking tests to understand the system’s performance limits.
- Distributed Database Design: Choose a distributed database capable of handling balanced workloads across all nodes and implement intelligent query routing for efficient traffic management.
- Message Queue Management: Apply strategies like pipelining for parallel processing, batching for efficient message handling, and carefully managed partitioning to distribute load evenly across the system.
- Dynamic Scaling of Data Aggregators: Implement auto-scaling for data aggregators based on real-time load, allowing the system to adapt to changing traffic conditions and prevent overloading of individual components.
4. Ensuring accurate timestamps from different systems in an ad-click system requires synchronization and standardization across the entire infrastructure.
- Use Network Time Protocol (NTP): Synchronizes clocks of all devices in the network to a common time source.
- Standardize Time Zones: Use a standard time zone across all systems, preferably UTC (Coordinated Universal Time).
- Consistent Timestamp Format: adopt a consistent timestamp format across all systems (e.g., ISO 8601 format).
- Server Clock Drift Management: Regularly monitoring and manage clock drift in server hardware, automated corrections and periodic checks to ensure clocks remain accurate.
- Timestamp at the Source: Ensure that timestamps are generated as close to the event source as possible, ideally at the time the event occurs.
Global Local Aggregation and Split Distinct Aggregation
Global Local Aggregation and Split Distinct Aggregation are techniques used in distributed computing and data processing, particularly in scenarios involving large-scale data aggregation. These methods are designed to handle massive datasets efficiently by breaking down the aggregation process into manageable parts. Here’s a brief overview of each:
Global Local Aggregation
To aggregate data in a distributed system where data is scattered across multiple nodes or servers.
- Local Aggregation: Each node performs an aggregation operation on its local data. This step reduces the volume of data by summarizing it locally.
- Global Aggregation: The results from all nodes are then aggregated globally to produce the final result. This step might involve combining sums, averages, counts, etc., from each node.
- Use Case: Useful in distributed databases or big data platforms where data is not stored centrally but needs to be aggregated across the network.
Split Distinct Aggregation
To efficiently process aggregation queries that involve distinct counts or operations in a distributed system.
- Splitting: The distinct aggregation operation is split into smaller, more manageable tasks. This might involve dividing the dataset based on certain criteria or distributing the workload across multiple nodes.
- Partial Aggregation: Each part of the split operation is processed separately, often in parallel. This might involve counting distinct values within each subset of the data.
- Combining Results: The partial results from each task are then combined to form the final aggregation result.
- Use Case: Particularly valuable in scenarios where you need to count distinct values (like unique visitors or customers) across a large, distributed dataset.
Reference
- https://medium.com/@nvedansh/system-design-ad-click-event-aggregation-8fb6aa7817fc