What is the distributed algorithm to determine the median of arrays of integers located on different computers?
In
a situation like there are a number of servers, each server is holding
a billion integers and medium of integers in all servers combined must
be found.
by Michael Harris
Suppose
you have a master node (or are able to use a consensus protocol to
elect a master from among your servers). The master first queries the
servers for the size of their sets of data, call this n, so that it
knows to look for the k = n/2 largest element.
The master then
selects a random server and queries it for a random element from the
elements on that server. The master broadcasts this element to each
server, and each server partitions its elements into those larger than
or equal to the broadcasted element and those smaller than the
broadcasted element.
Each server returns to the master the size
of the larger-than partition, call this m. If the sum of these sizes is
greater than k, the master indicates to each server to disregard the
less-than set for the remainder of the algorithm. If it is less than k,
then the master indicates to disregard the larger-than sets and updates
k = k - m. If it is exactly k, the algorithm terminates and the value
returned is the pivot selected at the beginning of the iteration.
If the algorithm does not terminate, recurse beginning with selecting a new random pivot from the remaining elements.
Analysis:
Let
n be the total number of elements and s be the number of servers.
Assume that the elements are roughly randomly and evenly distributed
among servers (each server has O(n/s) elements). In iteration i, we
expect to do about O(n/(s*2^i)) work on each server, as the size of each
servers element sets will be approximately cut in half (remember, we
assumed roughly random distribution of elements) and O(s) work on the
master (for broadcasting/receiving messages and adding the sizes
together). We expect O(log(n/s)) iterations. Adding these up over all
iterations gives an expected runtime of O(n/s + slog(n/s)), and assuming
s << sqrt(n) which is normally the case, this becomes simply (O(n/s)), which is the best you could possibly hope for.
Note also that this works not just for finding the median but also for finding the kth largest value for any value of k.
No comments:
Post a Comment