Consistent Hashing – part 1

This post is divided into two parts. This is the Part 1 and Part 2 is coming soon!!!

Consistent Hashing:-

Consistent Hashing is a algorithm which can be used to minimise the operation cost of rehashing/hashing in a hash based implementation where scalability and availability is really important.

Consistent hashing is a special kind of hashing such that when a hash table is resized, only {K/n} keys need to be remapped on average, where { K} is the number of keys, and {n} is the number of slots/buckets. In contrast, in most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped because the mapping between the keys and the slots/bucket is defined by a modular operation.


In traditional hash tables, we find the position of object o based on the hash function mod current no. of buckets bucket_no = hash( o) mod n. This will work perfectly until we remove/add bucket. For more understanding lets take an example of Caching system.

Cache system Example:-

Suppose, initially we have three machines in our cache cluster. (A, B, C) and we have stored 6 values(Name as hash key) in our cache system. After applying the hash function here is the state of the system.

Node A: Key1,key4

Node B: Key2,key5

Node C: key3,key6

What if a node D is added now. After re-hashing state might look like:

Node A: Key2,key4

Node B: Key3

Node C: key1

Node D: key6

If you see closely most of the keys are moved to different node. So, Most of the keys will be cache miss if tried to get accessed. This should not happen only few of the keys should be moved in both the cases of addition/removal of Nodes. Consistent Hashing algorithm can be used to solve this problem.


Consistent Hashing is a algorithm in which same hash is generated for the keys and it is independent of the number of the nodes. So, keys hash will be constant and each node is assigned a group of hash values.


Let’s look at this in more detail. The hash function actually maps keys and caches to a number range. This should be familiar to every Java programmer – the hashCode method on Object returns an int, which lies in the range -231 to 231-1. Imagine mapping this range into a circle so the values wrap around. Here’s a picture of the circle with a number of objects (1, 2, 3, 4) and caches (A, B, C) marked at the points that they hash to (based on a diagram from Web Caching with Consistent Hashing by David Karger et al):


To find which cache an object goes in, we move clockwise round the circle until we find a cache point. So in the diagram above, we see object 1 and 4 belong in cache A, object 2 belongs in cache B and object 3 belongs in cache C. Consider what happens if cache C is removed: object 3 now belongs in cache A, and all the other object mappings are unchanged. If then another cache D is added in the position marked it will take objects 3 and 4, leaving only object 1 belonging to A.


This works well, except the size of the intervals assigned to each cache is pretty hit and miss. Since it is essentially random it is possible to have a very non-uniform distribution of objects between caches.

Examples of use: 

  1. Couchbase
  2. Partitioning component of Amazon’s storage system Dynamo
  3. Openstack‘s Object Storage Service Swift
  4. Data partitioning in Apache Cassandra
  5. Data Partitioning in Voldemort
  6. Akka‘s consistent hashing router
  7. Riak, a distributed key-value database
  8. GlusterFS, a network-attached storage file system
  9. Skylable, an open-source distributed object-storage system
  10. Akamai Content Delivery Network
  11. Discord chat application
  12. Maglev: A Fast and Reliable Software Network Load Balancer



Database Partitioning vs Sharding


Partitioning is dividing your logical data into multiple entity to achieve more performance,Availability or maintainability. It has two types:


  1. Vertical Partitioning:-“Vertical partitioning” is the act of splitting up the data stored in one entity into multiple entities – again for space and performance reasons.  For example, a customer might only have one billing address, yet I might choose to put the billing address information into a separate table with a CustomerId reference so that I have the flexibility to move that information into a separate database, or different security context, etc.
  2. Horizontal Partitioning(Sharding):- When you shard a database, you create replica’s of the schema, and then divide what data is stored in each shard based on a shard key.  For example, I might shard my customer database using CustomerId as a shard key – I’d store ranges 0-10000 in one shard and 10001-20000 in a different shard.  When choosing a shard key, the DBA will typically look at data-access patterns and space issues to ensure that they are distributing load and space across shards evenly.This also improves the write speed because Sharding is like Shared nothing architecture.

Summary: Partitioning is generic term which is dividing the logical data into multiple entities. Either you can do it horizontally by dividing the data of an Entity by replicating the schema over multiple machine and dividing the data over multiple machine based on the shard key (Sharding). or we can divide the data into multiple entities on the same machine which is Vertical Scaling.


Finding Nth highest salary using SQL interview question

This is one of very common question to find the Nth highest salary using sql.

Create table Command:-


  1. Generic Solution:- This solution use correlated subquery. Below given query can be use to find 2nd, 3rd……nth highest salary in an employee table. This solution will work on any database but it can little slower due to subquery.

  2. Mysql Specific Solution:- In Mysql we can use limit to solve this problem and it is fast solution but it is vendor specific.


    Please add comment if there is any better solution. Thanks in advance.

Parameters are Passed by value or reference in Java

This is sometime confusing whether in java it is “passed by Value or Reference”. Here is the answer to this confusing question.

In Java parameters are “passed by value”.

  1. In case of Primitive Data Types, the data value of actual parameter is passed.
  2. If actual parameter is a reference to an Object (Class,Enum or Array object instance), then the reference value is passed, not the object itself. Means changes made to formal parameters will be reflected in the actual parameters.

ClassNotFoundException vs. NoClassDefFoundError java

This is very common interview question in core java. Here is the difference between the two. Both of them occur at run-time in application but they occur in different scenario.

  1. ClassNotFoundException:- ClassNotFoundException is a runtime exception which occurs when we try to load a class at runtime using Class.forName() method but dependent jar is not present in the classPath. Popular example for this could be when we connect to Database using JDBC we use Class.forName() to load the Driver Class to get the connection.

    if mysql jar is not present in the classPath then we get ClassNotFoundException.
  2.  NoClassDefFoundError:- NoClassDefFoundError is an error which occurs at runtime when class is available at compile time and but at runtime class is not available.

    Now if we compile the above given program then we get two class files A.class and B.class but if delete A.class then we will get NoClassDefFoundError.

Instance Member variable are overwritten not overridden java

Member variable are not overridden, they are overwritten.

  1. In subclass member variable to be available they need to be either protected/public.
  2. Instance member variable of subclass hide the variable of super class. e.g. parent.variable will call variable of parent and child.variable will call variable of child. See example given below.

    Example 2:


Serialization in java(interview questions)

Question 1. What is Serialization in java?

Object Serialization in Java is a process used to convert Object into a binary format which can be persisted into disk or sent over network to any other running Java virtual machine; the reverse process of creating object from binary stream is called deserialization in Java. Java provides Serialization API for serializing and deserializing object which includes,, ObjectInputStream and ObjectOutputStream etc. Java programmers are free to use default Serialization mechanism which Java uses based upon structure of class but they are also free to use there own custom binary format, which is often advised as Serialization best practice, Because serialized binary format becomes part of Class’s exported API and it can potentially break Encapsulation in Java provided by private and package-private fields. This pretty much answer the question What is Serialization in Java.

Question 2: How to make a Java class Serializable?

Making a class Serializable in Java is very easy, Your Java class just needs to implements interface and JVM will take care of serializing object in default format. Decision to making a Class Serializable should be taken concisely because though near term cost of making a Class Serializable is low, long term cost is substantial and it can potentially limit your ability to further modify and change its implementation because like any public API, serialized form of an object becomes part of public API and when you change structure of your class by implementing addition interface, adding or removing any field can potentially break default serialization, this can be minimized by using a custom binary format but still requires lot of effort to ensure backward compatibility. One example of How Serialization can put constraints on your ability to change class is SerialVersionUID. If you don’t explicitly declare SerialVersionUID then JVM generates its based upon structure of class which depends upon interfaces a class implements and several other factors which is subject to change. Suppose you implement another interface than JVM will generate a different SerialVersionUID for new version of class files and when you try to load old object object serialized by old version of your program you will get InvalidClassException.


Question 3: What is the difference between Serializable and Externalizable interface in Java?

Answer:  This is most frequently asked question in Java serialization interview. Here is my version Externalizable provides us writeExternal() and readExternal() method which gives us flexibility to control java serialization mechanism instead of relying on Java’s default serialization. Correct implementation of Externalizable interface can improve performance of application drastically.

It is a marker interface it doesn’t have any method.
It’s not a marker interface.
It has method’s called writeExternal() and readExternal()
Default Serialization process
YES, Serializable provides its own default serialization process, we just need to implement Serializable interface.
NO, we need to override writeExternal() and readExternal() for serialization process to happen.
Customize serialization process
We can customize default serialization process by defining following methods in our class >readObject() and writeObject()
Note: We are not overriding these methods, we are defining them in our class.
Serialization process is completely customized
We need to override Externalizable interface’s writeExternal() and readExternal() methods.
Control over Serialization
It provides less control over Serialization as it’s not mandatory to define readObject() and writeObject() methods.
Externalizable provides you great control over serialization process as it is important to override  writeExternal() and readExternal() methods.
Constructor call during deSerialization
Constructor is not called during deSerialization.
Constructor is called during deSerialization.

Question 4: How many methods Serializable has? If no method then what is the purpose of Serializable interface?

Answer: Serializable interface exists in package and forms core of java serialization mechanism. It doesn’t have any method and also called Marker Interface in Java. When your class implements interface it becomes Serializable in Java and gives compiler an indication that use Java Serialization mechanism to serialize this object.

Question 5: What is serialVersionUID? What will happen if i do not define it in class?

One of my favorite question interview question on Java serialization. SerialVersionUID is an ID which is stamped on object when it get serialized usually hashcode of object, you can use tool serialver to see serialVersionUID of a serialized object . SerialVersionUID is used for version control of object. you can specify serialVersionUID in your class file also. Consequence of not specifying serialVersionUID is that when you add or modify any field in class then already serialized class will not be able to recover because serialVersionUID generated for new class and for old serialized object will be different. Java serialization process relies on correct serialVersionUID for recovering state of serialized object and throws in case of serialVersionUID mismatch, to learn more about serialversionuid.

Question 6: While serializing you want some of the members not to serialize? How do you achieve it?

Another frequently asked Serialization interview question. This is sometime also asked as what is the use of transient variable, does transient and static variable gets serialized or not etc. so if you don’t want any field to be part of object’s state then declare it either static or transient based on your need and it will not be included during Java serialization process.

Question 7: What will happen if one of the members in the class doesn’t implement Serializable interface?

Answer: One of the easy question about Serialization process in Java. If you try to serialize an object of a class which implements Serializable, but the object includes a reference to an non- Serializable class then a ‘NotSerializableException’ will be thrown at runtime.

Question 8: If a class is Serializable but its super class in not, what will be the state of the instance variables inherited from super class after deserialization?

Answer: When we deserialize the object.

If superclass has implemented Serializable – constructor is not called during DeSerialization process.

If superclass has not implemented Serializable – constructor is called during DeSerialization process.

Java serialization process only continues in object hierarchy till the class is Serializable i.e. implements Serializable interface in Java and values of the instance variables inherited from super class will be initialized by calling constructor of Non-Serializable Super class during deserialization process. Once the constructor chaining will started it wouldn’t be possible to stop that , hence even if classes higher in hierarchy implements Serializable interface , there constructor will be executed. As you see from the statement this Serialization interview question looks very tricky and tough but if you are familiar with key concepts its not that difficult.

You can try writing a program for both the cases which has super class as serializable and not serializable.

Question 9: Can you Customize Serialization process or can you override default Serialization process in Java?
Answer: The answer is yes you can. We all know that for serializing an object ObjectOutputStream.writeObject (saveThisobject) is invoked and for reading object ObjectInputStream.readObject() is invoked but there is one more thing which Java Virtual Machine provides you is to define these two method in your class. If you define these two methods in your class then JVM will invoke these two methods instead of applying default serialization mechanism. You can customize behavior of object serialization and deserialization here by doing any kind of pre or post processing task. Important point to note is making these methods private to avoid being inherited, overridden or overloaded. Since only Java Virtual Machine can call private method integrity of your class will remain and Java Serialization will work as normal. In my opinion this is one of the best question one can ask in any Java Serialization interview, a good follow-up question is why should you provide custom serialized form for your object?

Question 10:Suppose super class of a new class implement Serializable interface, how can you avoid new class to being serialized?
Answer: Using the custom serialization you can provide definition of writeObject method where you can throw NotSerializableException.

Question 11: Which methods are used during Serialization and DeSerialization process in Java?
Answer: Java Serialization is done by class. That class is a filter stream which is wrapped around a lower-level byte stream to handle the serialization mechanism. To store any object via serialization mechanism we call ObjectOutputStream.writeObject(saveThisobject) and to deserialize that object we call ObjectInputStream.readObject() method. Call to writeObject() method trigger serialization process in java. one important thing to note about readObject() method is that it is used to read bytes from the persistence and to create object from those bytes and its return an Object which needs to be type cast to correct type.

Question 12: Suppose you have a class which you serialized it and stored in persistence and later modified that class to add a new field. What will happen if you deserialize the object already serialized?
Answer: This will depend upon if you have defined the static final serialVersionUID. If it is not defined then for each object a serialVersionUID is generated based on the hashCode of this object.In that case if you add new fields and then you try to deserialize the object then there will be InvalidClassException and if we have defined the serialVersionUID then there will be no issues.

Question 13: Why static member variables are not part of java serialization process (Important)?

Answer: Serialization is applicable on the instance variable which are either objects or primitives. As static variable are class level variable they doesn’t exists at instance level so, they are not part of serialized object.

Question 14: What will happen if one the member of class does not implement Serializable interface (Important)?

Answer: NotSerializableException will be thrown.

Question 15: What will happen if we have used List, Set and Map as member of class?

Answer: These collection classes implements serializable so it will work fine.

Question 16:Is constructor of class called during DeSerialization process?

Answer: It depends on whether our object has implemented Serializable or Externalizable.

If Serializable has been implemented – constructor is not called during DeSerialization process.But, if Externalizable has been implemented – constructor is called during DeSerialization process.

Question 17: Is constructor of super class called during DeSerialization process of subclass (Important)?

Answer: It is depends on whether our superclass has implemented Serializable or not.

If superclass has implemented Serializable – constructor is not called during DeSerialization process.
If superclass has not implemented Serializable – constructor is called during DeSerialization process.
Question 18: How you can avoid Deserialization process creating another instance of Singleton class (Important)?


We can simply use readResove() method to return same instance of class, rather than creating a new one.

Defining readResolve() method ensures that we don’t break singleton pattern during DeSerialization process.
 private Object readResolve() throws ObjectStreamException {
      return instance;

Also define readObject() method:
 private void readObject(ObjectInputStream ois) throws IOException,ClassNotFoundException{
       synchronized (SingletonClass.class) {
        if (instance == null) {
              instance = this;

Sort By vs Order By vs Cluster By vs Distribute by in hive

In Hive we have different command contructs like sort by, order by, cluster by and distribute by they can be confusion to differentiate among them. Below given are the differentiation among them along with examples:

Sort By:-

Hive uses sort by to sort the data based on the data type of the column to be used for sorting per reducer i.e. overall sorting of output is not maintained. e.g. if column is of numeric type the data will be sorted per reducer in numeric order.

For Example:

Here, key and value both are numeric. Suppose we have two reducer have the output as given below.

Reducer 1

Reducer 2

Here, we can clearly see overall output is not in sorted order.


Very similar to ORDER BY of SQL. The overall sorting is maintained in case of order by and output is produced in single reducer. Hence, we need to use limit clause so that reducer is not overloaded.

For Example:



Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. However,Distribute By does not guarantee clustering or sorting properties on the distributed keys.

For example, we are Distributing By x on the following 5 rows to 2 reducer:






NOTE:  partition by does not guarantee the ordering per reducer. Also each reducer will contain non-overlapping output ranges.


Cluster By is a short-cut for both Distribute By and Sort By.

Ordering : Global ordering between multiple reducers.

Outcome : N or more sorted files with non-overlapping ranges.

For Example:

Instead of specifying Cluster By, the user can specify Distribute By and Sort By, so the partition columns and sort columns can be different.





Accessing file using hdfs url

Recently i came across a scenario where i need to access the HDFS file system using hdfs url. So, following is the command to access any file path.

Here, master is the hostname of namenode and 54310 is the port.

NOTE: Here, i was using apache Hadoop plane vanila distribution. You need to use the port as per the distribution like cloudera/Hortonwork.


UDF vs UDTF vs UDAF in hive


Hive have so many in-built functions like: But, if we want to extend the functionality of hive we can use UDF, UDTF and UDAF.

UDF:- Please use the given below link to see how UDF works in hive.
UDTF:- Please use the given below link to see how UDTF works in hive.
UDAF:- Please use the given below link to see how UDAF works in hive.

Thats it!!!

Page 1 Page 2 Page 3