Journal

/

Overengineered search

Developed a suffix-array-based search engine for the site today. While a simple regex search was enough, couldn’t resist the technical elegance of a proper index.

Indexer: Implemented the indexer in Perl to crawl the HTML, lowercase the text, and encode it into UTF-8 bytes. Used a null byte sentinel to mark document boundaries and stored the lexicographically sorted 32-bit unsigned integer offsets to sa.bin:

my @sa = 0 .. (length($corpus) - 1);
{
    use bytes; # Force compare 8-bit Unicode value comparisons
    @sa = sort { 
        # First 64 bytes check (fast path)
        (substr($corpus, $a, 64) cmp substr($corpus, $b, 64)) || 
        # Full string fallback (required for correctness)
        (substr($corpus, $a) cmp substr($corpus, $b))
    } @sa;
}

32-bit offsets provide a 4 GB ceiling—overkill for a personal site, but comforting to have.

It takes about 50ms to index my 12-entry website on a T490. As the site grows, the O(L⋅N log N) sort could become a bottleneck. So I introduced a fast path that caps L at 64 bytes—roughly the size of a cache line on common hardware.

Search: Implemented the search in a FastCGI script as a textbook range query with two binary searches. Leveraged the fixed-width offsets for fast random access to the index:

seek($fh_sa, $mid * 4, 0);
read($fh_sa, my $bin_off, 4);
my $off = unpack("L", $bin_off);
seek($fh_cp, $off, 0);
read($fh_cp, my $text, $query_len);

Chose seek + read over mmap because it outperformed mmap for <1k files. At 10k, mmap was occasionally faster (~200 µs), but used more memory—possibly due to OpenBSD’s VM security trade-offs. Results may vary by OS.

Benchmarked on T490 (i7-10510U, OpenBSD 7.8, article size: 16 KB) against linear regex search:

=============================================================
SEARCH BENCHMARK: Suffix array vs. Linear regex
ARTICLE SIZE: 16 KB
=============================================================

500 files:
-------------------------------------------------------------
METRIC          | SA                   | REGEX
----------------+----------------------+---------------------
Search time     |             0.0012s  |             0.0407s
Peak RAM        |              8828 KB |              9136 KB
Indexing time   |             0.1475s  |                 N/A
Index size      |            204.94 KB |                 N/A
-------------------------------------------------------------

1,000 files:
-------------------------------------------------------------
METRIC          | SA                   | REGEX
----------------+----------------------+---------------------
Search time     |             0.0019s  |             0.0795s
Peak RAM        |              8980 KB |              9460 KB
Indexing time   |             0.3101s  |                 N/A
Index size      |            410.51 KB |                 N/A
-------------------------------------------------------------

10,000 files:
-------------------------------------------------------------
METRIC          | SA                   | REGEX
----------------+----------------------+---------------------
Search time     |             0.0161s  |             0.9120s
Peak RAM        |             12504 KB |             12804 KB
Indexing time   |            10.9661s  |                 N/A
Index size      |           4163.44 KB |                 N/A
-------------------------------------------------------------

Security: httpd, slowcgi, Perl are in the OpenBSD base system. Used file system permissions to govern access. Hardened the system by running it in chroot.

Resource exhaustion and XSS attacks are inherent. Limited concurrent searches using lock-file semaphores, and capped the query length (64 B) and the result set (20). Mitigated XSS by HTML-escaping all output using HTML::Escape.

At my current pace of twelve articles a year, this should work for the next 833 years. Penciled in the next release for Anno Domini 2859.

Commit: 6da102d | Benchmarks: 8a4da68