Saturday, March 8, 2014

HashMap - Importance of Capacity and Load Factor



HashMap of Java is a implementation of HashTable (Array into which data is inserted using a hash function* is called HashTable) data structure which provides constant time O(1) performance for searching (get) and insertion (put) operations. It's performance is affected by two parameters such as Capacity and Load Factor.  In HashMap, default Capacity 16 ,  default Load Factor is 0.75. "High speed" Hash table is highly recommended  if you do not want to retrieve items in order and you can predict the your data size because its efficiency depends on load factor and data size (declared initially).
If we carefully look into the HashMap constructors, we can notice these params.


Constructs an empty HashMap with the default initial capacity (16) and the default load factor (0.75).

HashMap(int initialCapacity).
Constructs an empty HashMap with the specified initial capacity and the default load factor (0.75).


HashMap(int initialCapacity, float loadFactor).
Constructs an empty HashMap with the specified initial capacity and load factor.

Let's see how capacity and load factors are defined.


1. Capacity C is the number of buckets in the hash table.
2. Load factor LF is a measure of how full the hash table is allowed to get before its capacity is automatically increased.  LF is the number of items in the table divided by table size. Load factor should not too high or too low. It should be moderate.                                     
 
L.F. =  N / Capacity.

3. Minimal Rehashing : We need to minimize rehashing which directly depends on Load Factor.


3.1. How rehashing occurs ? :
When the number of entries/elements in the hash table (N) exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets. 

                                 Rehashing occurs if N > L.F. * Capacity.

3.2 Minimize rehashing :
Performance can be increased if we minimize the rehash operation because every time rebuilt, a data structure will cause the performance overhead. So if our initial capacity is greater than the maximum number of element or key-value pair divided by load factor rehash operation can be minimized. whenever we set initial capacity always keep load factor and number of key value pair in mind.


* The purpose of a hash function is to take a range of key values and transform them
into index values in such a way that the key values are distributed randomly across
all the indices of the hash table.

References :
http://docs.oracle.com/javase/7/docs/api/java/util/HashMap.html


Sunday, May 26, 2013

Array List Vs Vector




Parameters
ArrayList
Vector
Synchronization

Not-synchronized. The list should be wrapped using the Collections.synchronizedList method

Synchronized
Data Growth Methods

 When we insert an element into an ArrayList or a Vector, the object will need to expand its internal array if it runs out of room. ArrayList increases its array size by 50 percent. 

newCapacity = (oldCapacity * 3)/2 + 1;
Vector defaults to doubling the size of its array.
newCapacity = (capacityIncrement > 0) ?
                (oldCapacity + capacityIncrement) : (oldCapacity * 2)
Default Size
Constructs an empty list with an initial capacity of 10.

Constructs an empty vector with internal data array size of 10 and its standard capacity increment is zero.

Traversal
Uses Iterator which are
 are fail-fast. i.e. when one thread changes the collection by add / remove operations , while another thread is traversing it through an Iterator using hasNext() or next() method, the iterator fails quickly by throwing ConcurrentModificationException . The fail-fast behavior of iterators can be used to detect bugs.
Uses Iterator/Enumerator. The Enumerations returned by the methods of classes like Hashtable, Vector are not fail-fast that is achieved by synchronizing the block of code inside the nextElement()method that locks the current Vector object which costs lots of time.





Tuesday, August 7, 2012

Comparison of Java Collections - Performance and other parameters


 I assume readers have basic knowledge about data general purpose data structures and Java Collections framework to under stand this article. 

Java Collections framework was designed and implemented using general purpose data structures to cater to different kinds of users. We do not need to write our own hash table or linked list to handle huge data items.


Basically, Java collections classes (Set/Map) falls into three core data structures such as Hash Table, Red black Tree and Linked List.



Why Hash Table ?.  
- It is based on arrays
- Uses hashing technique which offers very fast insertion and   searching.
      - If we don’t need to visit items in order, and we can predict in advance the size of our data, 
hash tables are unparalleled in speed and convenience.

Why Red Black Tree ?         

        - It is a balanced tree,  guarantees O(logN) performance whether the input data is ordered or not.
        - Advantage in a red-black tree is that sorted data doesn’t lead to slow O(N) performance.

Why LinkedList ?        

        - Used whenever the amount of data to be stored cannot be predicted in advance or when data will frequently be inserted and deleted.

1) Comparison of HashMap / HashSet / TreeSet / LinkedHashMap:


HashMap/HashSet : Underlying data structure is HashTable.  As the you see, name Hash* - indicates Hash Table data structure.
TreeMap/TreeSet : Underlying data structure is Red black Tree.
LinkedHashMap/LinkedHashSet : Underlying data structure is doubly Linked List and Hash Table.

HashMap/HashSet/LinkedHashMap/LinkedHashSet - all of them have constant time performance for retrieval,insertion and deletion operations. But LinkedHashMap/LinkedHashSet performance is bit lesser than HashTable since it needs to maintain additional linked list.

Except red-black tree (which is non-linear), all the data structures used here are linear type.


Operations
HashTable
Red black Tree
Linked List + HashTable
Retrieval (get)
O(1) - HashMap/HashSetO(log N) -TreeMap/TreeSetO(1) - LinkedHashMap.
Insertion (put)
O(1) - HashMap/HashSetO(log N) - TreeMap/TreeSetO(1) - LinkedHashMap
Deletion
O(1) - HashMap/HashSetO(log N) - TreeMap/TreeSetO(1) - LinkedHashMap
Parameters affecting performance
Initial Capacity and Load Factor
   
Initial Capacity and Load Factor
Iteration Order
Does not guarantee the iteration orderSorted according to the natural order
This linked list defines the iteration ordering, which is normally the order in which keys were inserted into the map (insertion-order). Insertion order is not affected if a key is re-inserted into the map.


O(1) - constant time : Better performance.
O(n) - linear time : Worst performance for the large data.
O(log N) - logarithmic time : Average Performance.


2) Comparison ArrayList and LinkedList 

Array list allows random access , so we can grab any element in constant time whereas linked list allows sequential access of elements. You can walk front/back but grabbing element in the middle is proportional to the size of the list.

ArrayList is useful when
- we need random access of elements.
- we don’t need to remove elements inbetween.
- adding elements at amortized time

LinkedList is useful when
- we prefer iteration over the the random access
 - we need efficient removal in between the elements or at start.
Operations
ArrayList
LinkedList
Base Data Structure
Dynamically resizing array
Doubly linked list
Retrieval (get)
O(1)
O(n)
Insertion (add)
O(1) - on average
O(n) - on worst case
O(1)
Deletion
O(n)
O(1) - Iterator.remove
Mostly useful for
Efficient Retrieval - (by random access)
Efficient Removal
Is NULL permitted
Yes
Yes
Iterators
Iterator and ListIterator which are fail-fast*
Iterator and ListIterator which are fail fast*

* Fail-fast :  If the list is structurally modified at any time after the iterator is created, in any way except through the Iterator's own remove or add methods, the iterator will throw a ConcurrentModificationException.