November 1, 2019

Sorting Ordered Data in a Key-Value Store

We discuss key naming in the context of lexicographical sort order, which most key-value systems use, and how to design keys accordingly.

Most key-value databases, including KVdb, store keys in a lexicographic order. What does this mean?

Imagine you save some values to a bucket:

  1. Set peach
  2. Set apple
  3. Set kiwi

What order will the keys be returned when enumerating the bucket?

apple, kiwi, peach as expected. It’s basically dictionary order.

What about numbers?

Aha! Good question:

  1. Set 10
  2. Set 2
  3. Set 1

How are the keys returned?

1, 10, 2

Wait, what?

You’ve probably encountered this when using a sort function that uses lexicographic sort order:

$ node
> [10, 2, 1].sort()
[ 1, 10, 2 ]

It turns out that with numbers, this is usually never what you want. To understand this, we have to grok lexicographical sort order. In a dictionary, the word apple” appears before application” because the letter e” comes before the letter i” in the alphabet. Thus the 5th letter is the first significant difference in the two words.

If we apply this to the numbers 1” and 10”, both have 1” in common, but the first 1” is shorter, so it’s sorted first. Take 10” and 2”: 2” is shorter, but it’s lexicographic order is after the first digit of 10”. Yes, it means we are ordering numbers as if they were strings. How do we ensure that numbers are ordered naturally?

Let’s zero-pad them:

> [‘10’, ‘02’, ‘01’].sort()
[ ‘01’, ‘02’, ‘10’ ]

Naturally, this also works for date and time values that are encoded in a representation like ISO 8601, whose lexicographical and natural sort order are the same, due to the fixed-width format, assuming the year (the YYYY) doesn’t exceed the maximum 4-digit year 9999 😇

For negative numbers and decimal/floating point numbers, it’s better to encode them into a big endian binary format so they maintain the expected sort order.

Now that we understand key enumeration order, we can design our keys in such a way that makes data querying efficient (i.e., reduces the number of queries). Let’s imagine we’re building a leaderboard for a game and want to keep track of players’ scores. A first attempt could be:

player:1:score => 1000

player:2:score => 1350

But you’ll quickly realize if you have a million players, building a leaderboard requires iterating through every player’s score and then picking the top scores. If we apply what we just learned about lexicographic sorting, we can design our keys a bit differently, by setting an upper bound on scores. For the sake of this example, let’s make it 999,999,999 and encode scores using 9 digits:

score:000001000 => {“player_id”: 1, score: 1000}

score:000001350 => {“player_id”: 2, score: 1350}

To fetch the top 50 scores, we’ll enumerate keys in reverse order, so the highest lexicographically sorted key is returned first:

const leaders = await bucket.list({
  prefix: ‘score:’,
  reverse: true,
  limit: 50,
  values: true})

for(const [key, value] of leaders) {
  console.log(`player: ${value.player_id} score: ${player.score}`)
}

OK, but what if two players are tied for the same score? Well, that presents a problem, since only one player can have a unique score at a time. Let’s suffix our keys with the player id to maintain uniqueness:

score:000001350:1 => {“player_id”: 1, score: 1350}

score:000001350:2 => {“player_id”: 2, score: 1350}

Now, you can have two players with the same score and show that in the leaderboard. Fair and square!

Thus you can see how taking advantage of lexicographical ordering of keys makes certain kinds of queries more natural. Typically, you might sort results in your application, but in many cases, you can push this task to the database and have it deal with ordering so your application just consumes the results.


Tags: Key Value Design


Previous post
Sync localStorage into the Cloud We build a remote storage interface for the Web Storage API, so you can persist localStorage into a remote key-value store.

Like what you see? Give KVdb a spin.