Building Traffalyzer: A Step-by-Step Guide to Web Traffic Log Analysis

Introduction
Traffalyzer is a Python-based tool that processes web traffic logs, aggregates data per IP address, and generates detailed reports in CSV or JSON format. In this article, we'll walk through the creation of Traffalyzer, discussing why certain design choices were made and how each component works.
Project Overview
The project is divided into several key modules:
- log_generator.py: Simulates web traffic logs by generating entries with random data.
- log_processor.py: Parses each log line and extracts the necessary fields.
- report_generator.py: Aggregates data per IP, calculates totals and percentages, and creates a report.
- report_analyzer.py: Reads the report and prints a nicely formatted top-N ranking of IP addresses.
Each module plays a critical role in building a professional log analysis tool.
Module 1: Log Generator
The log_generator.py
module simulates one day of web traffic. Each log line is formatted as:
TIMESTAMP;BYTES;STATUS;REMOTE_ADDR
For example:
2025-02-15T00:00:17;3708;OK;105.148.108.58
Generating IP Addresses
Initially, we generate random IP addresses:
def generate_random_ip() -> str:
"""Generate a random IPv4 address."""
return ".".join(str(random.randint(0, 255)) for _ in range(4))
However, using a uniform random generator for every request leads to an overly homogeneous distribution. In real-world traffic, only a subset of IPs (users) generate most of the traffic. To simulate this, we create a pool of IPs with weights:
from typing import Tuple, List
def generate_ip_pool(pool_size: int = 100) -> Tuple[List[str], List[float]]:
"""
Generate a pool of random IP addresses along with corresponding weights.
Some IPs will be 'more popular' than others.
"""
ip_pool = [generate_random_ip() for _ in range(pool_size)]
weights = [random.expovariate(1.0) for _ in range(pool_size)]
total = sum(weights)
weights = [w / total for w in weights]
return ip_pool, weights
Then, when generating each log line, we use these weights:
def generate_log_line(timestamp: str, ip_pool: List[str], weights: List[float]) -> str:
"""
Generate a single log line using weighted random selection for the IP.
Format: TIMESTAMP;BYTES;STATUS;REMOTE_ADDR
"""
bytes_value = random.randint(100, 5000)
status = "OK" if random.random() < 0.9 else random.choice(["ERROR", "NOT_FOUND", "FAIL"])
remote_addr = random.choices(ip_pool, weights=weights, k=1)[0]
return f"{timestamp};{bytes_value};{status};{remote_addr}\n"
Using random.choices
with weights introduces variability in the frequency of each IP, creating a more realistic traffic pattern.
Module 2: Log Processor
The log_processor.py
module is responsible for parsing each log line. It splits the line by semicolons and converts numeric fields appropriately:
def parse_log_line(line: str):
"""
Parse a single log line.
Expected format: TIMESTAMP;BYTES;STATUS;REMOTE_ADDR
"""
parts = line.strip().split(';')
if len(parts) != 4:
return None
timestamp, bytes_str, status, remote_addr = parts
try:
bytes_value = int(bytes_str)
except ValueError:
return None
return {
"timestamp": timestamp,
"bytes": bytes_value,
"status": status,
"remote_addr": remote_addr
}
This function ensures that only correctly formatted log entries are processed further.
Module 3: Report Generator
The report_generator.py
module aggregates the processed log data per IP address. It calculates:
- Total number of requests per IP
- Total bytes sent per IP
- Percentages relative to the overall totals
Before writing the report, it also ensures the output directory exists:
import os
output_dir = os.path.dirname(output_file)
if output_dir and not os.path.exists(output_dir):
os.makedirs(output_dir, exist_ok=True)
After aggregating the data, the report is generated in either CSV or JSON format, depending on user preference. Sorting is done in descending order based on the number of requests.
Module 4: Report Analyzer
Finally, the report_analyzer.py
module reads the generated report and prints a top-N ranking of IP addresses. The output is formatted to display:
- Rank
- IP Address
- Number of Requests
- Percent of Total Requests
- Total Bytes
A snippet of the formatted output looks like this:
Top 10 IP Addresses by Number of Requests
--------------------------------------------------------------------------------
Rank IP Address Requests Percent Requests Total Bytes
--------------------------------------------------------------------------------
1 191.216.109.190 712552 10.01% 1816954322
...
This module allows a quick visual inspection of which IPs are driving the majority of the traffic.
Design Considerations
-
Modularity:
The project is divided into clear modules (log generation, processing, report generation, and analysis). This not only simplifies testing but also makes future enhancements easier. -
Realistic Traffic Simulation:
By introducing weighted random selection for IPs, we mimic a common phenomenon in web traffic where a small number of users account for a large fraction of requests. -
Extensibility:
Each module is designed to be independent, enabling you to replace or extend functionality (e.g., integrating with real log data sources or adding more sophisticated analysis).
Conclusion
Traffalyzer provides a robust foundation for web traffic log analysis. In this guide, we've covered the step-by-step development of Traffalyzer, from generating realistic logs to processing and analyzing them. By carefully designing each component, we've built a tool that not only meets the requirements needs to solve the problem but also serves as a practical example of modular Python programming.
Complete source code is hosted on Traffalyzer Repository
Happy coding and exploring your web traffic data with Traffalyzer!