Most key-value databases, including KVdb, store keys in a lexicographic order. What does this mean?
Imagine you save some values to a bucket:
- Set
peach
- Set
apple
- Set
kiwi
What order will the keys be returned when enumerating the bucket?
apple
, kiwi
, peach
as expected. It’s basically dictionary order.
What about numbers?
Aha! Good question:
- Set
10
- Set
2
- Set
1
How are the keys returned?
1
, 10
, 2
Wait, what?
You’ve probably encountered this when using a sort function that uses lexicographic sort order:
$ node
> [10, 2, 1].sort()
[ 1, 10, 2 ]
It turns out that with numbers, this is usually never what you want. To understand this, we have to grok lexicographical sort order. In a dictionary, the word “apple” appears before “application” because the letter “e” comes before the letter “i” in the alphabet. Thus the 5th letter is the first significant difference in the two words.
If we apply this to the numbers “1” and “10”, both have “1” in common, but the first “1” is shorter, so it’s sorted first. Take “10” and “2”: “2” is shorter, but it’s lexicographic order is after the first digit of “10”. Yes, it means we are ordering numbers as if they were strings. How do we ensure that numbers are ordered naturally?
Let’s zero-pad them:
> [‘10’, ‘02’, ‘01’].sort()
[ ‘01’, ‘02’, ‘10’ ]
Naturally, this also works for date and time values that are encoded in a representation like ISO 8601, whose lexicographical and natural sort order are the same, due to the fixed-width format, assuming the year (the YYYY) doesn’t exceed the maximum 4-digit year 9999 😇
For negative numbers and decimal/floating point numbers, it’s better to encode them into a big endian binary format so they maintain the expected sort order.
Now that we understand key enumeration order, we can design our keys in such a way that makes data querying efficient (i.e., reduces the number of queries). Let’s imagine we’re building a leaderboard for a game and want to keep track of players’ scores. A first attempt could be:
player:1:score
=> 1000
player:2:score
=> 1350
But you’ll quickly realize if you have a million players, building a leaderboard requires iterating through every player’s score and then picking the top scores. If we apply what we just learned about lexicographic sorting, we can design our keys a bit differently, by setting an upper bound on scores. For the sake of this example, let’s make it 999,999,999 and encode scores using 9 digits:
score:000001000
=> {“player_id”: 1, score: 1000}
score:000001350
=> {“player_id”: 2, score: 1350}
To fetch the top 50 scores, we’ll enumerate keys in reverse order, so the highest lexicographically sorted key is returned first:
const leaders = await bucket.list({
prefix: ‘score:’,
reverse: true,
limit: 50,
values: true})
for(const [key, value] of leaders) {
console.log(`player: ${value.player_id} score: ${player.score}`)
}
OK, but what if two players are tied for the same score? Well, that presents a problem, since only one player can have a unique score at a time. Let’s suffix our keys with the player id to maintain uniqueness:
score:000001350:1
=> {“player_id”: 1, score: 1350}
score:000001350:2
=> {“player_id”: 2, score: 1350}
Now, you can have two players with the same score and show that in the leaderboard. Fair and square!
Thus you can see how taking advantage of lexicographical ordering of keys makes certain kinds of queries more natural. Typically, you might sort results in your application, but in many cases, you can push this task to the database and have it deal with ordering so your application just consumes the results.