Tuesday, December 2, 2014

Hash table vs Binary search tree

from here

Hash table and binary search tree are two fundamental data structures in CS. When would you want to use one over another one? What is the difference? This is a common interview question that I had most frequently been asked.
Well, it is not easy to answer by one or two sentences. The main difference between hash table and trees is on two aspects: Implementation details and  behaviors and performance under different circumstance.
Let me start with implementation. Hash table uses hash function to assign index to input data and put data into array under corresponding index. If the size of hash table is 100, hash function can generate 0~99 to input data and store them into hash table. Theoretically,  time complexity of insertion and looking-up is constant time (O(1)).
Binary search tree is implemented as the rule that all left children’s values are less than root, while all right children’s value are greater than it. If the tree is balanced, it always takes O(log(n)) time to insert a new node or look up.O(log(n)) is not as fast as constant time but rather fast.  n is the total number in tree and log(n) is usually depth of tree. Notice that I mentioned if it is balanced tree, however there are sophisticated algorithm (e.g., RB tree) to maintain tree balanced.
It seems we can prefer hash table over tree, but it is not always the case. Hash table has significant drawbacks:
1. As more data input comes, there is huge probability that collision shows up (hash function maps different data to same index). There are two ways to handle collision. First is linear probing that implement hash table as array of linked list. In this case, worst time for insertion or retrieve or deletion is O(n) that all input data are mapped to same index. Besides, hash table need more space than number of input data.  Second way is open addressing. It would not consume more space than input data, but at worst case insertion and retrieve is still O(n), which is extremely slow than constant time.
2. You have to know approximate size of input data before initializing hash table. Otherwise you need to resize hash table which is a very time-consuming operation. For example, your hash table size is 100 and then you want to insert the 101st element. Not only the size of hash table is enlarged to 150, all element in hash table have to be rehashed. This insertion operation takes O(n).
3. The elements stored in hash table are unsorted. In certain circumstance, we want data to be stored with sorted order, like contacts in cell phone.
However, binary search tree performs well against hash table:
1. Binary search tree never meets collision, which means binary search tree can guarantee insertion, retrieve and deletion are implemented in O(log(n)), which is hugely fast than linear time. Besides, space needed by tree is exactly same as size of input data.
2. You do not need to know size of input in advance.
3. all elements in tree are sorted that in-order traverse takes O(n) time.
OK, let me make a summary.
If you know how many data to maintain, and have enough space to store hash table and do not need data to be sorted, hash table is always good choice. Because, hash table provides constant time operation for insertion, retrieve and deletion. On the other hand, if items will be consistently added, binary search tree’s O(log(n)) operation is acceptable, comparing with rehashing operation during running time.
Besides, If you actually do not know size of input items, but after inserting, most operations are item looking up, hash table is preferred due to constant retrieve time. However, if items are continuously added or removed, tree’s O(log(n)) insertion and deletion time are more suitable in this condition.
In a word, there is no one answer that hash table or tree is better. All we need to know is pros and cons of hash table and tree in different conditions. Best decision can be made with knowledge of benefits and trade-offs of these two structures.

No comments:

Post a Comment