Serialization in java(interview questions)

Question 1. What is Serialization in java?

Object Serialization in Java is a process used to convert Object into a binary format which can be persisted into disk or sent over network to any other running Java virtual machine; the reverse process of creating object from binary stream is called deserialization in Java. Java provides Serialization API for serializing and deserializing object which includes,, ObjectInputStream and ObjectOutputStream etc. Java programmers are free to use default Serialization mechanism which Java uses based upon structure of class but they are also free to use there own custom binary format, which is often advised as Serialization best practice, Because serialized binary format becomes part of Class’s exported API and it can potentially break Encapsulation in Java provided by private and package-private fields. This pretty much answer the question What is Serialization in Java.

Question 2: How to make a Java class Serializable?

Making a class Serializable in Java is very easy, Your Java class just needs to implements interface and JVM will take care of serializing object in default format. Decision to making a Class Serializable should be taken concisely because though near term cost of making a Class Serializable is low, long term cost is substantial and it can potentially limit your ability to further modify and change its implementation because like any public API, serialized form of an object becomes part of public API and when you change structure of your class by implementing addition interface, adding or removing any field can potentially break default serialization, this can be minimized by using a custom binary format but still requires lot of effort to ensure backward compatibility. One example of How Serialization can put constraints on your ability to change class is SerialVersionUID. If you don’t explicitly declare SerialVersionUID then JVM generates its based upon structure of class which depends upon interfaces a class implements and several other factors which is subject to change. Suppose you implement another interface than JVM will generate a different SerialVersionUID for new version of class files and when you try to load old object object serialized by old version of your program you will get InvalidClassException.


Question 3: What is the difference between Serializable and Externalizable interface in Java?

Answer:  This is most frequently asked question in Java serialization interview. Here is my version Externalizable provides us writeExternal() and readExternal() method which gives us flexibility to control java serialization mechanism instead of relying on Java’s default serialization. Correct implementation of Externalizable interface can improve performance of application drastically.

It is a marker interface it doesn’t have any method.
It’s not a marker interface.
It has method’s called writeExternal() and readExternal()
Default Serialization process
YES, Serializable provides its own default serialization process, we just need to implement Serializable interface.
NO, we need to override writeExternal() and readExternal() for serialization process to happen.
Customize serialization process
We can customize default serialization process by defining following methods in our class >readObject() and writeObject()
Note: We are not overriding these methods, we are defining them in our class.
Serialization process is completely customized
We need to override Externalizable interface’s writeExternal() and readExternal() methods.
Control over Serialization
It provides less control over Serialization as it’s not mandatory to define readObject() and writeObject() methods.
Externalizable provides you great control over serialization process as it is important to override  writeExternal() and readExternal() methods.
Constructor call during deSerialization
Constructor is not called during deSerialization.
Constructor is called during deSerialization.

Question 4: How many methods Serializable has? If no method then what is the purpose of Serializable interface?

Answer: Serializable interface exists in package and forms core of java serialization mechanism. It doesn’t have any method and also called Marker Interface in Java. When your class implements interface it becomes Serializable in Java and gives compiler an indication that use Java Serialization mechanism to serialize this object.

Question 5: What is serialVersionUID? What will happen if i do not define it in class?

One of my favorite question interview question on Java serialization. SerialVersionUID is an ID which is stamped on object when it get serialized usually hashcode of object, you can use tool serialver to see serialVersionUID of a serialized object . SerialVersionUID is used for version control of object. you can specify serialVersionUID in your class file also. Consequence of not specifying serialVersionUID is that when you add or modify any field in class then already serialized class will not be able to recover because serialVersionUID generated for new class and for old serialized object will be different. Java serialization process relies on correct serialVersionUID for recovering state of serialized object and throws in case of serialVersionUID mismatch, to learn more about serialversionuid.

Question 6: While serializing you want some of the members not to serialize? How do you achieve it?

Another frequently asked Serialization interview question. This is sometime also asked as what is the use of transient variable, does transient and static variable gets serialized or not etc. so if you don’t want any field to be part of object’s state then declare it either static or transient based on your need and it will not be included during Java serialization process.

Question 7: What will happen if one of the members in the class doesn’t implement Serializable interface?

Answer: One of the easy question about Serialization process in Java. If you try to serialize an object of a class which implements Serializable, but the object includes a reference to an non- Serializable class then a ‘NotSerializableException’ will be thrown at runtime.

Question 8: If a class is Serializable but its super class in not, what will be the state of the instance variables inherited from super class after deserialization?

Answer: When we deserialize the object.

If superclass has implemented Serializable – constructor is not called during DeSerialization process.

If superclass has not implemented Serializable – constructor is called during DeSerialization process.

Java serialization process only continues in object hierarchy till the class is Serializable i.e. implements Serializable interface in Java and values of the instance variables inherited from super class will be initialized by calling constructor of Non-Serializable Super class during deserialization process. Once the constructor chaining will started it wouldn’t be possible to stop that , hence even if classes higher in hierarchy implements Serializable interface , there constructor will be executed. As you see from the statement this Serialization interview question looks very tricky and tough but if you are familiar with key concepts its not that difficult.

You can try writing a program for both the cases which has super class as serializable and not serializable.

Question 9: Can you Customize Serialization process or can you override default Serialization process in Java?
Answer: The answer is yes you can. We all know that for serializing an object ObjectOutputStream.writeObject (saveThisobject) is invoked and for reading object ObjectInputStream.readObject() is invoked but there is one more thing which Java Virtual Machine provides you is to define these two method in your class. If you define these two methods in your class then JVM will invoke these two methods instead of applying default serialization mechanism. You can customize behavior of object serialization and deserialization here by doing any kind of pre or post processing task. Important point to note is making these methods private to avoid being inherited, overridden or overloaded. Since only Java Virtual Machine can call private method integrity of your class will remain and Java Serialization will work as normal. In my opinion this is one of the best question one can ask in any Java Serialization interview, a good follow-up question is why should you provide custom serialized form for your object?

Question 10:Suppose super class of a new class implement Serializable interface, how can you avoid new class to being serialized?
Answer: Using the custom serialization you can provide definition of writeObject method where you can throw NotSerializableException.

Question 11: Which methods are used during Serialization and DeSerialization process in Java?
Answer: Java Serialization is done by class. That class is a filter stream which is wrapped around a lower-level byte stream to handle the serialization mechanism. To store any object via serialization mechanism we call ObjectOutputStream.writeObject(saveThisobject) and to deserialize that object we call ObjectInputStream.readObject() method. Call to writeObject() method trigger serialization process in java. one important thing to note about readObject() method is that it is used to read bytes from the persistence and to create object from those bytes and its return an Object which needs to be type cast to correct type.

Question 12: Suppose you have a class which you serialized it and stored in persistence and later modified that class to add a new field. What will happen if you deserialize the object already serialized?
Answer: This will depend upon if you have defined the static final serialVersionUID. If it is not defined then for each object a serialVersionUID is generated based on the hashCode of this object.In that case if you add new fields and then you try to deserialize the object then there will be InvalidClassException and if we have defined the serialVersionUID then there will be no issues.

Question 13: Why static member variables are not part of java serialization process (Important)?

Answer: Serialization is applicable on the instance variable which are either objects or primitives. As static variable are class level variable they doesn’t exists at instance level so, they are not part of serialized object.

Question 14: What will happen if one the member of class does not implement Serializable interface (Important)?

Answer: NotSerializableException will be thrown.

Question 15: What will happen if we have used List, Set and Map as member of class?

Answer: These collection classes implements serializable so it will work fine.

Question 16:Is constructor of class called during DeSerialization process?

Answer: It depends on whether our object has implemented Serializable or Externalizable.

If Serializable has been implemented – constructor is not called during DeSerialization process.But, if Externalizable has been implemented – constructor is called during DeSerialization process.

Question 17: Is constructor of super class called during DeSerialization process of subclass (Important)?

Answer: It is depends on whether our superclass has implemented Serializable or not.

If superclass has implemented Serializable – constructor is not called during DeSerialization process.
If superclass has not implemented Serializable – constructor is called during DeSerialization process.
Question 18: How you can avoid Deserialization process creating another instance of Singleton class (Important)?


We can simply use readResove() method to return same instance of class, rather than creating a new one.

Defining readResolve() method ensures that we don’t break singleton pattern during DeSerialization process.
 private Object readResolve() throws ObjectStreamException {
      return instance;

Also define readObject() method:
 private void readObject(ObjectInputStream ois) throws IOException,ClassNotFoundException{
       synchronized (SingletonClass.class) {
        if (instance == null) {
              instance = this;

Sort By vs Order By vs Cluster By vs Distribute by in hive

In Hive we have different command contructs like sort by, order by, cluster by and distribute by they can be confusion to differentiate among them. Below given are the differentiation among them along with examples:

Sort By:-

Hive uses sort by to sort the data based on the data type of the column to be used for sorting per reducer i.e. overall sorting of output is not maintained. e.g. if column is of numeric type the data will be sorted per reducer in numeric order.

For Example:

Here, key and value both are numeric. Suppose we have two reducer have the output as given below.

Reducer 1

Reducer 2

Here, we can clearly see overall output is not in sorted order.


Very similar to ORDER BY of SQL. The overall sorting is maintained in case of order by and output is produced in single reducer. Hence, we need to use limit clause so that reducer is not overloaded.

For Example:



Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. However,Distribute By does not guarantee clustering or sorting properties on the distributed keys.

For example, we are Distributing By x on the following 5 rows to 2 reducer:






NOTE:  partition by does not guarantee the ordering per reducer. Also each reducer will contain non-overlapping output ranges.


Cluster By is a short-cut for both Distribute By and Sort By.

Ordering : Global ordering between multiple reducers.

Outcome : N or more sorted files with non-overlapping ranges.

For Example:

Instead of specifying Cluster By, the user can specify Distribute By and Sort By, so the partition columns and sort columns can be different.





Accessing file using hdfs url

Recently i came across a scenario where i need to access the HDFS file system using hdfs url. So, following is the command to access any file path.

Here, master is the hostname of namenode and 54310 is the port.

NOTE: Here, i was using apache Hadoop plane vanila distribution. You need to use the port as per the distribution like cloudera/Hortonwork.


UDF vs UDTF vs UDAF in hive


Hive have so many in-built functions like: But, if we want to extend the functionality of hive we can use UDF, UDTF and UDAF.

UDF:- Please use the given below link to see how UDF works in hive.
UDTF:- Please use the given below link to see how UDTF works in hive.
UDAF:- Please use the given below link to see how UDAF works in hive.

Thats it!!!

UDAF in hive

UDAF(User Defined Aggregation Function):-

Aggregate functions perform a calculation on a set of values and return a single value. Values are aggregated in chunks (potentially across many tasks), so the implementation has to be capable of combining partial aggregations into a final result.

E.g. To find the maximum number in a table.


  1. First create Max class which extends UDAF and inside this class create an inner static class MaxIntUDAFEvaluator class extending UDAFEvaluator class.

  2. Create table as given below:
  3. insert some data using .txt file having numbers
  4. Add jar file to hive with full path on hive CLI/beeline CLI or add this jar to .bashrc.
  5. Create temporary function as shown below:
  6. Use the select statement to find max


Thats it!!!

UDTF in hive

UDTF(User Defined Tablular Function) :- 

User defined tabular function works on one row as input and returns multiple rows as output. So here the relation is one to many.

e.g Hive built in EXPLODE() function. Now lets take an array column USER_IDS as ARRAY<10,12,5,45> then SELECT EXPLODE(USER_IDS) as ID FROM T_USER. will give 10,12,5,45 as four different rows in output. UDTF can be used to split a column into multiple column as well which we will look in below example. Here alias “AS” clause is mandatory .

Problem Statement:– Expand the name column from emp table to give it as two separate column in First_name,Last_name.


Here are the following steps needs to be followed in order to solve this problem.

  1. We have to extend a base Class GenericUDTF to write our business logic in Java.
  2. We need to override 3 methods namely initialize(), process() and close() in our class ExpandNameDetails.class.
  3. add jar to classpath:-         Add the exported JAR file to hive classpath using below command from hive terminal: ADD  /home/anuj/HIVE/HIVE-UDTF-split.jar or Alternatively: You can add exported JAR files in bashrc file using command “nano ~/.bashrc” as HIVE_AUX_JAR_PATH = ‘/home/anuj/HIVE/HIVE-UDTF-split.jar’. It will avoid adding your hive jar to class path each time you login to hive session or hadoop as it will be loaded during hadoop cluster loading by the framework itself.
  4. create temporary function:
  5. Try executing the function on emp table name field.

    Thats it!!!


UDF in hive

Regular UDF:-  Hive provide us the some of the build in functions but if we want to extend some of the functionality of hive then we can use UDF(User defined Function). These function needs to be added using a java program and a jar needs to be created. Let us discuss this using an example. Following steps need to be followed to create UDF in hive.


  1. Create a UDF class extending UDF class.
  2. We need to export the jar from above given java code to the system directory.
  3. add jar file into the hive CLI or beeline terminal.                                                           Add the exported JAR file to hive classpath using below command from hive terminal: ADD  /home/anuj/HIVE/HIVE-UDF-trim.jar or Alternatively: You can add exported JAR files in bashrc file using command “nano ~/.bashrc” as HIVE_AUX_JAR_PATH = ‘/home/anuj/HIVE/HIVE-UDF-trim.jar’. It will avoid adding your hive jar to class path each time you login to hive session or hadoop as it will be loaded during hadoop cluster loading by the framework itself.
  4. add temporary function:  Run the following command to add function to hive.
  5. Use the function in hive commands.


This UDF is for primitive types argument only e.g. Text, IntWritable,LongWritable etc. For complex types like: struct, array etc different type of UDF needs to be written.


Hive performance improvements

To improve hive performance. Following are the more commonly used ways to improve hive performance:

  1. Execution Engine
  2. Using Custom file formats
  3. Use Vertorization
  4. Bucketing & Partitioning
  5. Tweaking no of mappers and their memory
  6. Parallel execution


  1. Execution Engine:- 


    Hive can use the Apache Tez execution engine instead of the venerable Map-reduce engine. I won’t go into details about the many benefits of using Tez which are mentioned here; use Tez by setting to ‘true’ the following in the beginning of your Hive query:


  2.  Using Custom file formats:- Use ORCFile format for faster performance in queries in hive. It has really fast response time.As an example, consider two large tables A and B (stored as text files, with some columns not all specified here), and a simple query like:


    This query may take a long time to execute since tables emp1 and emp2 are both stored as TEXT. Converting these tables to ORCFile format will usually reduce query time significantly:


    ORC supports compressed storage (with ZLIB or as shown above with SNAPPY) but also uncompressed storage.

    Converting base tables to ORC is often the responsibility of your ingest team, and it may take them some time to change the complete ingestion process due to other priorities. The benefits of ORCFile are so tangible that I often recommend a do-it-yourself approach as demonstrated above – convert emp1 into emp1_ORC and emp2 into emp2_ORC and do the join that way, so that you benefit from faster queries immediately, with no dependencies on other teams.


    Vectorized query execution improves performance of operations like scans, aggregations, filters and joins, by performing them in batches of 1024 rows at once instead of single row each time.

    and is easily enabled with two parameters settings:

    Bucketing & Partitioning:- Hive partitioning is an effective method to improve the query performance on larger tables. Partitioning allows you to store data in separate sub-directories under table location. It greatly helps the queries which are queried upon the partition key(s). Although the selection of partition key is always a sensitive decision, it should always be a low cardinal attribute, e.g. if your data is associated with time dimension, then date could be a good partition key. Similarly, if data has association with location, like a country or state, then it’s a good idea to have hierarchical partitions like country/state.                                                                                            Bucketing improves the join performance if the bucket key and join keys are common. Bucketing in Hive distributes the data in different buckets based on the hash results on the bucket key. It also reduces the I/O scans during the join process if the process is happening on the same keys.Additionally it’s important to ensure the bucketing flag is set (SET hive.enforce.bucketing=true;) every time before writing data to the bucketed table. To leverage the bucketing in the join operation we should SET hive.optimize.bucketmapjoin=true. This setting hints to Hive to do bucket level join during the map stage join. It also reduces the scan cycles to find a particular key because bucketing ensures that the key is present in a certain bucket.

  4. Tweaking no of mappers and their memory:-

    The default hive.input.format is set to This configuration could give less number of mappers than the split size (i.e., # blocks in HDFS) of the input table.

    Try setting for hive.input.format.

    Note Apache Tez uses by the default.

    You can then control the maximum number of mappers via setting:

Joins in hive

Syntax of joins in hive:-

Only equality joins, outer joins, and left semi joins are supported in Hive. Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job. Also, more than two tables can be joined in Hive.

For this blog we will use two table as follow to discuss joins in hive:

Join Example:

Multiple tables can be joined in the same query:

Join implementation with Map Reduce

 Hive converts joins over multiple tables into a single map/reduce job if for every table the same column is used in the join clauses. The query below is converted into a single map/reduce job as only id column for e2 is involved in the join.

It is very interesting to note that any number of tables can be joined in single map/reduce process as long as they fit the above criteria.

However if the join colums are not the same for all tables the is converted into multiple map/reduce jobs

In this case the first map/reduce job joins e1 with e2 and the results are then joined with e3 in the second map/reduce job.

Largest Table LAST

In the MapReduce job for regular inner joins, mappers run on both tables, emitting out records from that need to be joined by evaluating any UDFs in the query and filtering out any records based on the where clause. Then the shuffle phase is run which “shuffles” the keys based on the join key (idin above example). Subsequently, in the reduce phase, essentially a cross-product takes place between records from each table that have the same join key.
                   In every map/reduce stage of the join, the last table in the sequence is streamed through the reducers where as the others are buffered. Therefore, it helps to reduce the memory needed in the reducer for buffering the rows for a particular value of the join key by organizing the tables such that the largest tables appear last in the sequence. e.g. in

all the three tables are joined in a single map/reduce job and the values for a particular value of the id for tables e1 and e2 are buffered in the memory in the reducers. Then for each row retrieved from e3, the join is computed with the buffered rows.

For the query:

there are two map/reduce jobs involved in computing the join. The first of these joins e1 with e2 and buffers the values of e1 while streaming the values of e2 in the reducers. The second of one of these jobs buffers the results of the first join while streaming the values of e3 through the reducers.

Streamtable hint

 You can also specify which table should be streamed usually its the larger one(Table).

Outer Joins

e.g. Example for LEFT OUTER JOIN. Similarly for RIGHT and FULL.

NOTE: These joins are not commutative instead they are left-associative regardless whether it LEFT or RIGHT OUTER Joins.


Means join condition will start from left it will join e1 and e2 and results will be joined with e3.

Left Semi Join

 LEFT SEMI JOIN implements the correlated IN/EXISTS subquery semantics in an efficient way.Since Hive currently does not support IN/EXISTS subqueries, you can rewrite your queries using LEFT SEMI JOIN. The restrictions of using LEFT SEMI JOIN is that the right-hand-side table should only be referenced in the join condition (ON-clause), but not in WHERE- or SELECT-clauses etc.

This type of query

Can be written as:

Map Side Join

If all but one of the tables being joined are small, the join can be performed as a map only job. The query does not need a reducer. For every mapper a,b is read completely. A restriction is that e1 FULL/RIGHT OUTER JOIN e2 cannot be performed.

Bucketed Map Join

 If the tables being joined are bucketized, and the buckets are a multiple of each other, the buckets can be joined with each other.
In conf/hive-site.xml you need to set the following parameters:

That’s it!!!


Hiveserver1 vs Hiveserver2


HiveServer is an optional service that allows a remote client to submit requests to Hive, using a variety of programming languages, and retrieve results. HiveServer is built on Apache ThriftTM (, therefore it is sometimes called the Thrift server although this can lead to confusion because a newer service named HiveServer2 is also built on Thrift. Since the introduction of HiveServer2, HiveServer has also been called HiveServer1.


 Limitations of hiveserver1:-
  • Support for remote client connection but only one client can connect at a time.
  • No session management support.
  • Because of thrift no concurrency control due to thrift API.
  • No Authentication support provided









Hiveserver2 is a improved version which solves the problem of hiveserver1 like: Concurrency,authentication,authorization etc

Hiveserver2 Architecture:-

HiveServer2 implements a new Thrift-based RPC interface that can handle concurrent clients. The current release supports Kerberos, LDAP, and custom pluggable authentication. The new RPC interface also has better options for JDBC and ODBC clients, especially for metadata access.
















Like the original HiveServer1, HiveServer2 is a container for the Hive execution engine. For each client connection, it creates a new execution context that serves Hive SQL requests from the client. The new RPC interface enables the server to associate this Hive execution context with the thread serving the client’s request.

Clients for HiveServer2:-

  1. JDBC:-
  2. Beeline CLI:- Beeline is a JDBC application based on the SQLLine CLI that supports embedded and remote-client modes. The embedded mode is where the Hive runtime is part of the client process itself; there’s no server involved.
  3. ODBC


The Hive metastore service runs in its own JVM process. Clients other than Hive, like Apache Pig, connect to this service via HCatalog for metadata access. HiveServer2 supports local as well as remote metastore modes – which is useful when you have more than one service (Pig, Cloudera Impala, and so on) that needs access to metadata. This is the recommended deployment mode with HiveServer2:











Authentication support is another major feature of HiveServer2. In the original HiveServer, if you can access the host/port over the network, you can access the data – so it relies on support for multiple authentication options to restrict access.

In contrast, HiveServer2 support Kerberos, pass-through LDAP, and pass-through plug-able custom authentication. All client types – JDBC, ODBC, as well as Beeline CLI — support these authentication modes. This enables the Hive deployment to easily integrate with existing authentication services.

Gateway to Secure Hadoop

Today, the Hadoop ecosystem only supports Kerberos for authentication. That means for accessing secure Hadoop, one needs to get a Kerberos ticket. However, enabling Kerberos on every client box can be a very challenging task and thus can restrict access to Hive and Hadoop.

To address that issue, HiveServer2 can authenticate clients over non-Kerberos connections (eg. LDAP) and run queries against Kerberos-secured Hadoop data. This approach allows users to securely access Hive without complex security infrastructure or limitations.


Page 1 Page 2