Saturday, March 8, 2014

HashMap - Importance of Capacity and Load Factor



HashMap of Java is a implementation of HashTable (Array into which data is inserted using a hash function* is called HashTable) data structure which provides constant time O(1) performance for searching (get) and insertion (put) operations. It's performance is affected by two parameters such as Capacity and Load Factor.  In HashMap, default Capacity 16 ,  default Load Factor is 0.75. "High speed" Hash table is highly recommended  if you do not want to retrieve items in order and you can predict the your data size because its efficiency depends on load factor and data size (declared initially).
If we carefully look into the HashMap constructors, we can notice these params.


Constructs an empty HashMap with the default initial capacity (16) and the default load factor (0.75).

HashMap(int initialCapacity).
Constructs an empty HashMap with the specified initial capacity and the default load factor (0.75).


HashMap(int initialCapacity, float loadFactor).
Constructs an empty HashMap with the specified initial capacity and load factor.

Let's see how capacity and load factors are defined.


1. Capacity C is the number of buckets in the hash table.
2. Load factor LF is a measure of how full the hash table is allowed to get before its capacity is automatically increased.  LF is the number of items in the table divided by table size. Load factor should not too high or too low. It should be moderate.                                     
 
L.F. =  N / Capacity.

3. Minimal Rehashing : We need to minimize rehashing which directly depends on Load Factor.


3.1. How rehashing occurs ? :
When the number of entries/elements in the hash table (N) exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets. 

                                 Rehashing occurs if N > L.F. * Capacity.

3.2 Minimize rehashing :
Performance can be increased if we minimize the rehash operation because every time rebuilt, a data structure will cause the performance overhead. So if our initial capacity is greater than the maximum number of element or key-value pair divided by load factor rehash operation can be minimized. whenever we set initial capacity always keep load factor and number of key value pair in mind.


* The purpose of a hash function is to take a range of key values and transform them
into index values in such a way that the key values are distributed randomly across
all the indices of the hash table.

References :
http://docs.oracle.com/javase/7/docs/api/java/util/HashMap.html