HiveParser: a multi-threaded log parser in C

I was working through OSTEP at the time, somewhere around the concurrency chapters, and the textbook examples had stopped scratching the itch. I wanted something where I could actually watch threads step on each other — a program big enough that the locks I was reading about would show up as real, measurable pain. So I wrote a syslog parser in C and called it HiveParser.

The job is small on paper. Read a syslog file, pull out the severity field on each line, fold the counts into a report. The whole thing fits in a few hundred lines. The interesting part lives in how those lines arrange themselves around eight cores.

Starting where every concurrency tutorial starts

I built the obvious version first, the one every tutorial draws on the whiteboard. One producer, a shared queue behind a mutex, a pool of workers, and one global report struct guarded by another mutex. It worked. It was also slower with eight threads than with one, which is the kind of result that makes you sit up straight.

Initial design — every worker funnels through two locks before doing real work.

Watching it under perf made the story plain. The threads spent most of their lives waiting on each other. The queue lock was a turnstile; the report lock was a second turnstile right behind it. Adding workers only added more people to the queue.

That part I expected. OSTEP had already warned me. The part I did not expect was how much of the cost lived somewhere else entirely.

The allocator was the real bottleneck

The hot loop looked like this:

void *worker(void *arg) {
    while (1) {
        char *line = dequeue(queue);

        log_entry_t *entry = malloc(sizeof(log_entry_t));
        parse_line(line, entry);

        pthread_mutex_lock(&lock);
        update_global_report(entry);
        pthread_mutex_unlock(&lock);

        free(entry);
    }
}

I had assumed malloc was free, or close enough. It is not. glibc’s allocator holds its own internal locks to keep the heap consistent across threads, and when eight workers all call malloc and free on every single line, they end up serializing through the allocator the same way they serialized through my queue. I had built a second contention point and decorated it with my own mutex on top.

I found this the unglamorous way: I added timing around each segment of the loop, expected update_global_report to dominate, and watched malloc win by a wide margin. That was the moment the project stopped being about pthreads and started being about memory.

Moving the entry to the stack

The fix is small once you see it. A log_entry_t lives for exactly one iteration. It never escapes the worker. There is no reason for it to ever touch the heap.

void *worker(void *arg) {
    report_t local_report = {0};

    while (1) {
        char *line = dequeue(queue);

        log_entry_t entry;
        parse_line(line, &entry);

        update_report(&entry, &local_report);
        free(line);
    }

    return merge_into_global(&local_report);
}

Two changes. The entry moved onto the stack, so allocation became a register adjustment. The per-thread report_t moved with it, so each worker accumulated into its own private struct and only touched shared memory once, at the end, after pthread_join.

Final design — one shared queue feeds N independent thread-local pipelines that fold into the master report on join.

The pattern OSTEP calls a sloppy counter. Each worker keeps its own tally; the main thread does one cheap pass to fold them together once everyone is done. The global mutex disappeared from the hot path entirely.

What the numbers ended up looking like

After the refactor, scaling started behaving the way the textbook promised. Doubling the threads roughly halved the wall-clock time, until the box ran out of physical cores.

Wall-clock time vs thread count · 2.1M-line syslog · 8-core box.

Eight threads finished in 0.6 seconds against the single-threaded 4.2. The remaining gap from perfect linear scaling is the queue itself, which I left on a single mutex because the producer is fast and the workers spend most of their time parsing rather than dequeuing. A lock-free MPMC queue is the obvious next step, and also the point where the project stops being a weekend exercise and starts being a real engineering job, so I stopped.

What I’d tell past me

The lesson I keep coming back to is that the locks you write are usually visible to you, and the locks you inherit usually are not. I spent the first evening optimizing my own mutex and the second evening discovering that malloc had been the problem the whole time. Profile before you theorize. The allocator counts as part of your concurrency model whether you like it or not.

The other thing, smaller but worth saying out loud: stack allocation is genuinely free in a way that’s easy to forget after years of writing in higher-level languages. C will let you put a struct in a register-adjacent slot and walk away. Use that.