Building Traffalyzer: A Step-by-Step Guide to Web Traffic Log Analysis

Traffalyzer overview

Introduction

Traffalyzer is a Python-based tool that processes web traffic logs, aggregates data per IP address, and generates detailed reports in CSV or JSON format. In this article, we'll walk through the creation of Traffalyzer, discussing why certain design choices were made and how each component works.

Project Overview

The project is divided into several key modules:

  • log_generator.py: Simulates web traffic logs by generating entries with random data.
  • log_processor.py: Parses each log line and extracts the necessary fields.
  • report_generator.py: Aggregates data per IP, calculates totals and percentages, and creates a report.
  • report_analyzer.py: Reads the report and prints a nicely formatted top-N ranking of IP addresses.

Each module plays a critical role in building a professional log analysis tool.

Module 1: Log Generator

The log_generator.py module simulates one day of web traffic. Each log line is formatted as:

TIMESTAMP;BYTES;STATUS;REMOTE_ADDR

For example:

2025-02-15T00:00:17;3708;OK;105.148.108.58

Generating IP Addresses

Initially, we generate random IP addresses:

def generate_random_ip() -> str:
    """Generate a random IPv4 address."""
    return ".".join(str(random.randint(0, 255)) for _ in range(4))

However, using a uniform random generator for every request leads to an overly homogeneous distribution. In real-world traffic, only a subset of IPs (users) generate most of the traffic. To simulate this, we create a pool of IPs with weights:

from typing import Tuple, List

def generate_ip_pool(pool_size: int = 100) -> Tuple[List[str], List[float]]:
    """
    Generate a pool of random IP addresses along with corresponding weights.
    Some IPs will be 'more popular' than others.
    """
    ip_pool = [generate_random_ip() for _ in range(pool_size)]
    weights = [random.expovariate(1.0) for _ in range(pool_size)]
    total = sum(weights)
    weights = [w / total for w in weights]
    return ip_pool, weights

Then, when generating each log line, we use these weights:

def generate_log_line(timestamp: str, ip_pool: List[str], weights: List[float]) -> str:
    """
    Generate a single log line using weighted random selection for the IP.
    
    Format: TIMESTAMP;BYTES;STATUS;REMOTE_ADDR
    """
    bytes_value = random.randint(100, 5000)
    status = "OK" if random.random() < 0.9 else random.choice(["ERROR", "NOT_FOUND", "FAIL"])
    remote_addr = random.choices(ip_pool, weights=weights, k=1)[0]
    return f"{timestamp};{bytes_value};{status};{remote_addr}\n"

Using random.choices with weights introduces variability in the frequency of each IP, creating a more realistic traffic pattern.

Module 2: Log Processor

The log_processor.py module is responsible for parsing each log line. It splits the line by semicolons and converts numeric fields appropriately:

def parse_log_line(line: str):
    """
    Parse a single log line.
    Expected format: TIMESTAMP;BYTES;STATUS;REMOTE_ADDR
    """
    parts = line.strip().split(';')
    if len(parts) != 4:
        return None
    timestamp, bytes_str, status, remote_addr = parts
    try:
        bytes_value = int(bytes_str)
    except ValueError:
        return None
    return {
        "timestamp": timestamp,
        "bytes": bytes_value,
        "status": status,
        "remote_addr": remote_addr
    }

This function ensures that only correctly formatted log entries are processed further.

Module 3: Report Generator

The report_generator.py module aggregates the processed log data per IP address. It calculates:

  • Total number of requests per IP
  • Total bytes sent per IP
  • Percentages relative to the overall totals

Before writing the report, it also ensures the output directory exists:

import os

output_dir = os.path.dirname(output_file)
if output_dir and not os.path.exists(output_dir):
    os.makedirs(output_dir, exist_ok=True)

After aggregating the data, the report is generated in either CSV or JSON format, depending on user preference. Sorting is done in descending order based on the number of requests.

Module 4: Report Analyzer

Finally, the report_analyzer.py module reads the generated report and prints a top-N ranking of IP addresses. The output is formatted to display:

  • Rank
  • IP Address
  • Number of Requests
  • Percent of Total Requests
  • Total Bytes

A snippet of the formatted output looks like this:

Top 10 IP Addresses by Number of Requests
--------------------------------------------------------------------------------
Rank IP Address        Requests   Percent Requests     Total Bytes
--------------------------------------------------------------------------------
1    191.216.109.190     712552             10.01%      1816954322
...

This module allows a quick visual inspection of which IPs are driving the majority of the traffic.

Design Considerations

  • Modularity:
    The project is divided into clear modules (log generation, processing, report generation, and analysis). This not only simplifies testing but also makes future enhancements easier.

  • Realistic Traffic Simulation:
    By introducing weighted random selection for IPs, we mimic a common phenomenon in web traffic where a small number of users account for a large fraction of requests.

  • Extensibility:
    Each module is designed to be independent, enabling you to replace or extend functionality (e.g., integrating with real log data sources or adding more sophisticated analysis).

Conclusion

Traffalyzer provides a robust foundation for web traffic log analysis. In this guide, we've covered the step-by-step development of Traffalyzer, from generating realistic logs to processing and analyzing them. By carefully designing each component, we've built a tool that not only meets the requirements needs to solve the problem but also serves as a practical example of modular Python programming.

Complete source code is hosted on Traffalyzer Repository

Happy coding and exploring your web traffic data with Traffalyzer!