POSTS / A Charming Algorithm for Count-Distinct

Published: 2024-01-23

This paper called Distinct Elements in Streams: An Algorithm for the (Text) Book by Chakraborty, Vinodchandran, and Meel introduces one algorithm fot estimating the number of distinct elements appearing in a stream. Here using stream but not array or list means that the number of elements may be large and uncertain and as a result, trying to save them may be not very practical.

Trivial methods may be to sort or use a hash table to get a actual number. However, in this question we want to estimate the number of distinct elements but not precisely count them.

This algorithm starts from one trivial method of using hash table.

function countDistinct(list) {
    let seen = new Set();
    for (let value of list) {
        seen.add(value);
    }
    return seen.size;
}

This solution will have to store every element we’ve seen. If we want to save memory, for example, we want to save just half of the actual number of distinct elements, we can save each element with a probability of $\frac{1}{2}$ . The idea is intuitive, but let’s give it a simple analysis and find out why it’s wrong.

function countDistinct(list) {
    let seen = new Set();
    for (let value of list) {
        if (Math.random() < 0.5) {
            seen.add(value);
        }
    }
    return seen.size * 2;
}

Here we each time meet an element we save it in the hash table with a probability of $\frac{1}{2}$ then each distinct element will be save with a probability of $1-\frac{1}{2^{n}}$ which $n$ presents the times we seen that element. That’s why we can’t just use seen.size * 2 to estimate the actual size of origin size of the hash table. What’s more, if each element comes up several times we will still save lots of them.

We give it a very easy fix then it works: if we can use the probability of the first time we check, every element gets a probability of $\frac{1}{2}$ to stay in the hash table.

function countDistinct(list) {
    let seen = new Set();
    for (let value of list) {
        seen.delete(value);
        if (Math.random() < 0.5) {
            seen.add(value);
        }
    }
    return seen.size * 2;
}

An enhanced version is to set a p for the probability of if the element will be in the hash table.

function countDistinct(list) {
    let seen = new Set();
    for (let value of list) {
        seen.delete(value);
        if (Math.random() < p) {
            seen.add(value);
        }
    }
    return seen.size / p;
}

After correctness can be guaranteed, let’s talk about how to manage the memory. Though we try to store elements in a ratio of p and reduced our memory usage by some constant factor, we can still get a very large number of elements in out hash table, which means a threshold to hold the size of hash table and indicates how big is “too big”.

The final trick of this algorithm is to pick $p$ dynamically. If the hash table goes big enough, we set a new smaller $p$ and make sure the old elements still fit the restriction of the filter.

function countDistinct(list, thresh) {
    let p = 1;
    let seen = new Set();
    for (let value of list) {
        seen.delete(value);
        if (Math.random() < p) {
            seen.add(value);
        }
        if (seen.size === thresh) {
            // Objects now need to win an extra coin flip to be included
            // in the set. Every element in `seen` already won n-1 coin
            // flips, so they now have to win one more.
            seen = new Set([...seen].filter(() => Math.random() < 0.5));
            p *= 1 / 2;
        }
    }
    return seen.size / p;
}

After reading this blog A Charming Algorithm for Count-Distinct.