I’ve been rewriting little pieces of a scanner for a C compiler. For recognizing keywords, I’ve chosen to use a perfect hash function.
A perfect hash function is a hash function which requires only one lookup and does not have any collisions. You could also think of this as a sort of “table”, but with just mappings from only the keys to the values. A traditional approach to this uses GNU gperf. It generates C or C++ code (there is a difference) based on some list of keywords.
Let’s give this a try to get a sense of how fun it is. The number one rule of software is that you have to have fun.
Here’s an example of looking for a series of keywords:
for
each
do
end
begin
I call this file “a.gperf”. The “a” is for “Aiden”.
I then used the gperf
on my machine. It was just at /usr/bin/gperf
!
/* C code produced by gperf version 3.0.3 */
/* Command-line: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/gperf a.gperf */
/* Computed positions: -k'1' */
#if !((' ' == 32) && ('!' == 33) && ('"' == 34) && ('#' == 35) \
&& ('%' == 37) && ('&' == 38) && ('\'' == 39) && ('(' == 40) \
&& (')' == 41) && ('*' == 42) && ('+' == 43) && (',' == 44) \
&& ('-' == 45) && ('.' == 46) && ('/' == 47) && ('0' == 48) \
&& ('1' == 49) && ('2' == 50) && ('3' == 51) && ('4' == 52) \
&& ('5' == 53) && ('6' == 54) && ('7' == 55) && ('8' == 56) \
&& ('9' == 57) && (':' == 58) && (';' == 59) && ('<' == 60) \
&& ('=' == 61) && ('>' == 62) && ('?' == 63) && ('A' == 65) \
&& ('B' == 66) && ('C' == 67) && ('D' == 68) && ('E' == 69) \
&& ('F' == 70) && ('G' == 71) && ('H' == 72) && ('I' == 73) \
&& ('J' == 74) && ('K' == 75) && ('L' == 76) && ('M' == 77) \
&& ('N' == 78) && ('O' == 79) && ('P' == 80) && ('Q' == 81) \
&& ('R' == 82) && ('S' == 83) && ('T' == 84) && ('U' == 85) \
&& ('V' == 86) && ('W' == 87) && ('X' == 88) && ('Y' == 89) \
&& ('Z' == 90) && ('[' == 91) && ('\\' == 92) && (']' == 93) \
&& ('^' == 94) && ('_' == 95) && ('a' == 97) && ('b' == 98) \
&& ('c' == 99) && ('d' == 100) && ('e' == 101) && ('f' == 102) \
&& ('g' == 103) && ('h' == 104) && ('i' == 105) && ('j' == 106) \
&& ('k' == 107) && ('l' == 108) && ('m' == 109) && ('n' == 110) \
&& ('o' == 111) && ('p' == 112) && ('q' == 113) && ('r' == 114) \
&& ('s' == 115) && ('t' == 116) && ('u' == 117) && ('v' == 118) \
&& ('w' == 119) && ('x' == 120) && ('y' == 121) && ('z' == 122) \
&& ('{' == 123) && ('|' == 124) && ('}' == 125) && ('~' == 126))
/* The character set is not based on ISO-646. */
error "gperf generated tables don't work with this execution character set. Please report a bug to <bug-gnu-gperf@gnu.org>."
#endif
#define TOTAL_KEYWORDS 5
#define MIN_WORD_LENGTH 2
#define MAX_WORD_LENGTH 5
#define MIN_HASH_VALUE 2
#define MAX_HASH_VALUE 9
/* maximum key range = 8, duplicates = 0 */
#ifdef __GNUC__
__inline
#else
#ifdef __cplusplus
inline
#endif
#endif
static unsigned int
hash (str, len)
register const char *str;
register unsigned int len;
{
static unsigned char asso_values[] =
{
10,10,10,10,10,10,10,10,10,10,
10,10,10,10,10,10,10,10,10,10,
10,10,10,10,10,10,10,10,10,10,
10,10,10,10,10,10,10,10,10,10,
10,10,10,10,10,10,10,10,10,10,
10,10,10,10,10,10,10,10,10,10,
10,10,10,10,10,10,10,10,10,10,
10,10,10,10,10,10,10,10,10,10,
10,10,10,10,10,10,10,10,10,10,
10,10,10,10,10,10,10,10, 0,10,
0, 5, 0,10,10,10,10,10,10,10,
10,10,10,10,10,10,10,10,10,10,
10,10,10,10,10,10,10,10,10,10,
10,10,10,10,10,10,10,10,10,10,
10,10,10,10,10,10,10,10,10,10,
10,10,10,10,10,10,10,10,10,10,
10,10,10,10,10,10,10,10,10,10,
10,10,10,10,10,10,10,10,10,10,
10,10,10,10,10,10,10,10,10,10,
10,10,10,10,10,10,10,10,10,10,
10,10,10,10,10,10,10,10,10,10,
10,10,10,10,10,10,10,10,10,10,
10,10,10,10,10,10,10,10,10,10,
10,10,10,10,10,10,10,10,10,10,
10,10,10,10,10,10,10,10,10,10,
10,10,10,10,10,10
};
return len + asso_values[(unsigned char)str[0]];
}
const char *
in_word_set (str, len)
register const char *str;
register unsigned int len;
{
static const char * wordlist[] =
{
"", "",
"do",
"for",
"",
"begin",
"", "",
"end",
"each"
};
if (len <= MAX_WORD_LENGTH && len >= MIN_WORD_LENGTH)
{
unsigned int key = hash (str, len);
if (key <= MAX_HASH_VALUE)
{
register const char *s = wordlist[key];
if (*str == *s && !strcmp (str + 1, s + 1))
return s;
}
}
return 0;
}
Wow, I do love a good block of GNU-style C code.
The initial mapping of characters is a quick check to ensure we’re dealing with a character set that works with how C is defined. If it isn’t, then we just rage quit.
Let’s see how we can arrive at an output.
Assume we start with the string “begin”. We know this is going to be inside the set.
hash("begin", 5)
is a fun function call.
The length is 5 and the first char is “b”, which maps to 98. The asso_values
array at 98 value ends up being 0. We add this to 5 to get 5. When we lookup the value in the array and compare it to the original "begin"
, it is equal.
Consider now a word like "bagin"
. This would end up evaluating to the same index in the wordlist
array. We need to run a linear comparison on the strings to see where they differ. Essentially the worst case is one where we have all but the last character is the same as the string.
This is a pretty simple thing to understand, so let’s move on to what I’ve been doing in my compiler.
Let’s take a peek at some example code in Rust.
use phf::{phf_map, Map};
static MY_MAP: Map<&'static str, u32> = phf_map! {
"hello" => 1,
"world" => 2,
};
fn main() {
println!("{}", MY_MAP["hello"]);
}
Now we use cargo expand
to look more into depth.
#![feature(prelude_import)]
#[prelude_import]
use std::prelude::rust_2024::*;
#[macro_use]
extern crate std;
use phf::{phf_map, Map};
static MY_MAP: Map<&'static str, u32> = phf::Map {
key: 12913932095322966823u64,
disps: &[(0u32, 0u32)],
entries: &[("hello", 1), ("world", 2)],
};
fn main() {
{
::std::io::_print(format_args!("{0}\n", MY_MAP["hello"]));
};
}
We have some partially explained pieces here. Since println!
is a macro, we expand it. This seems to suggest we’re working with the phf::Map
.
The key looks pretty cool too.
Based on what I read on the docs page, the immutable map is written at compile time by the Rust compiler.
We can notice this by running hexfiend or some other binary inspector and then look for the words “hello” and “world”.
objdump -s target/release/a.out | grep -C 5 "world"
10003a608 48ffff17 46050000 50000018 45ffff17 H...F...P...E...
10003a618 55050000 50000018 42ffff17 68050000 U...P...B...h...
10003a628 50000018 3fffff17 77050000 50000018 P...?...w...P...
10003a638 3cffff17 87050000 <.......
Contents of section __TEXT,__const:
10003a640 68656c6c 6f776f72 6c640a00 00000000 helloworld......
10003a650 00000000 00000000 00040000 00000000 ................
10003a660 00000000 00000000 08000000 00000000 ................
10003a670 01000000 00000000 38000000 00000000 ........8.......
10003a680 00020000 00000000 00000000 00000000 ................
10003a690 00000000 00000000 04000000 00000000 ................