Instance Member variable are overwritten not overridden java

Member variable are not overridden, they are overwritten.

  1. In subclass member variable to be available they need to be either protected/public.
  2. Instance member variable of subclass hide the variable of super class. e.g. parent.variable will call variable of parent and child.variable will call variable of child. See example given below.

    Example 2:

     

Serialization in java(interview questions)

Question 1. What is Serialization in java?

Object Serialization in Java is a process used to convert Object into a binary format which can be persisted into disk or sent over network to any other running Java virtual machine; the reverse process of creating object from binary stream is called deserialization in Java. Java provides Serialization API for serializing and deserializing object which includes java.io.Serializable, java.io.Externalizable, ObjectInputStream and ObjectOutputStream etc. Java programmers are free to use default Serialization mechanism which Java uses based upon structure of class but they are also free to use there own custom binary format, which is often advised as Serialization best practice, Because serialized binary format becomes part of Class’s exported API and it can potentially break Encapsulation in Java provided by private and package-private fields. This pretty much answer the question What is Serialization in Java.

Question 2: How to make a Java class Serializable?

Making a class Serializable in Java is very easy, Your Java class just needs to implements java.io.Serializable interface and JVM will take care of serializing object in default format. Decision to making a Class Serializable should be taken concisely because though near term cost of making a Class Serializable is low, long term cost is substantial and it can potentially limit your ability to further modify and change its implementation because like any public API, serialized form of an object becomes part of public API and when you change structure of your class by implementing addition interface, adding or removing any field can potentially break default serialization, this can be minimized by using a custom binary format but still requires lot of effort to ensure backward compatibility. One example of How Serialization can put constraints on your ability to change class is SerialVersionUID. If you don’t explicitly declare SerialVersionUID then JVM generates its based upon structure of class which depends upon interfaces a class implements and several other factors which is subject to change. Suppose you implement another interface than JVM will generate a different SerialVersionUID for new version of class files and when you try to load old object object serialized by old version of your program you will get InvalidClassException.

 

Question 3: What is the difference between Serializable and Externalizable interface in Java?

Answer:  This is most frequently asked question in Java serialization interview. Here is my version Externalizable provides us writeExternal() and readExternal() method which gives us flexibility to control java serialization mechanism instead of relying on Java’s default serialization. Correct implementation of Externalizable interface can improve performance of application drastically.

SERIALIZABLE
EXTERNALIZABLE
Methods
It is a marker interface it doesn’t have any method.
It’s not a marker interface.
It has method’s called writeExternal() and readExternal()
Default Serialization process
YES, Serializable provides its own default serialization process, we just need to implement Serializable interface.
NO, we need to override writeExternal() and readExternal() for serialization process to happen.
Customize serialization process
We can customize default serialization process by defining following methods in our class >readObject() and writeObject()
Note: We are not overriding these methods, we are defining them in our class.
Serialization process is completely customized
We need to override Externalizable interface’s writeExternal() and readExternal() methods.
Control over Serialization
It provides less control over Serialization as it’s not mandatory to define readObject() and writeObject() methods.
Externalizable provides you great control over serialization process as it is important to override  writeExternal() and readExternal() methods.
Constructor call during deSerialization
Constructor is not called during deSerialization.
Constructor is called during deSerialization.

Question 4: How many methods Serializable has? If no method then what is the purpose of Serializable interface?

Answer: Serializable interface exists in java.io package and forms core of java serialization mechanism. It doesn’t have any method and also called Marker Interface in Java. When your class implements java.io.Serializable interface it becomes Serializable in Java and gives compiler an indication that use Java Serialization mechanism to serialize this object.

Question 5: What is serialVersionUID? What will happen if i do not define it in class?

One of my favorite question interview question on Java serialization. SerialVersionUID is an ID which is stamped on object when it get serialized usually hashcode of object, you can use tool serialver to see serialVersionUID of a serialized object . SerialVersionUID is used for version control of object. you can specify serialVersionUID in your class file also. Consequence of not specifying serialVersionUID is that when you add or modify any field in class then already serialized class will not be able to recover because serialVersionUID generated for new class and for old serialized object will be different. Java serialization process relies on correct serialVersionUID for recovering state of serialized object and throws java.io.InvalidClassException in case of serialVersionUID mismatch, to learn more about serialversionuid.

Question 6: While serializing you want some of the members not to serialize? How do you achieve it?

Another frequently asked Serialization interview question. This is sometime also asked as what is the use of transient variable, does transient and static variable gets serialized or not etc. so if you don’t want any field to be part of object’s state then declare it either static or transient based on your need and it will not be included during Java serialization process.

Question 7: What will happen if one of the members in the class doesn’t implement Serializable interface?

Answer: One of the easy question about Serialization process in Java. If you try to serialize an object of a class which implements Serializable, but the object includes a reference to an non- Serializable class then a ‘NotSerializableException’ will be thrown at runtime.

Question 8: If a class is Serializable but its super class in not, what will be the state of the instance variables inherited from super class after deserialization?

Answer: When we deserialize the object.

If superclass has implemented Serializable – constructor is not called during DeSerialization process.

If superclass has not implemented Serializable – constructor is called during DeSerialization process.

Java serialization process only continues in object hierarchy till the class is Serializable i.e. implements Serializable interface in Java and values of the instance variables inherited from super class will be initialized by calling constructor of Non-Serializable Super class during deserialization process. Once the constructor chaining will started it wouldn’t be possible to stop that , hence even if classes higher in hierarchy implements Serializable interface , there constructor will be executed. As you see from the statement this Serialization interview question looks very tricky and tough but if you are familiar with key concepts its not that difficult.

You can try writing a program for both the cases which has super class as serializable and not serializable.

Question 9: Can you Customize Serialization process or can you override default Serialization process in Java?
Answer: The answer is yes you can. We all know that for serializing an object ObjectOutputStream.writeObject (saveThisobject) is invoked and for reading object ObjectInputStream.readObject() is invoked but there is one more thing which Java Virtual Machine provides you is to define these two method in your class. If you define these two methods in your class then JVM will invoke these two methods instead of applying default serialization mechanism. You can customize behavior of object serialization and deserialization here by doing any kind of pre or post processing task. Important point to note is making these methods private to avoid being inherited, overridden or overloaded. Since only Java Virtual Machine can call private method integrity of your class will remain and Java Serialization will work as normal. In my opinion this is one of the best question one can ask in any Java Serialization interview, a good follow-up question is why should you provide custom serialized form for your object?

Question 10:Suppose super class of a new class implement Serializable interface, how can you avoid new class to being serialized?
Answer: Using the custom serialization you can provide definition of writeObject method where you can throw NotSerializableException.

Question 11: Which methods are used during Serialization and DeSerialization process in Java?
Answer: Java Serialization is done by java.io.ObjectOutputStream class. That class is a filter stream which is wrapped around a lower-level byte stream to handle the serialization mechanism. To store any object via serialization mechanism we call ObjectOutputStream.writeObject(saveThisobject) and to deserialize that object we call ObjectInputStream.readObject() method. Call to writeObject() method trigger serialization process in java. one important thing to note about readObject() method is that it is used to read bytes from the persistence and to create object from those bytes and its return an Object which needs to be type cast to correct type.

Question 12: Suppose you have a class which you serialized it and stored in persistence and later modified that class to add a new field. What will happen if you deserialize the object already serialized?
Answer: This will depend upon if you have defined the static final serialVersionUID. If it is not defined then for each object a serialVersionUID is generated based on the hashCode of this object.In that case if you add new fields and then you try to deserialize the object then there will be InvalidClassException and if we have defined the serialVersionUID then there will be no issues.

Question 13: Why static member variables are not part of java serialization process (Important)?

Answer: Serialization is applicable on the instance variable which are either objects or primitives. As static variable are class level variable they doesn’t exists at instance level so, they are not part of serialized object.

Question 14: What will happen if one the member of class does not implement Serializable interface (Important)?

Answer: NotSerializableException will be thrown.

Question 15: What will happen if we have used List, Set and Map as member of class?

Answer: These collection classes implements serializable so it will work fine.

Question 16:Is constructor of class called during DeSerialization process?

Answer: It depends on whether our object has implemented Serializable or Externalizable.

If Serializable has been implemented – constructor is not called during DeSerialization process.But, if Externalizable has been implemented – constructor is called during DeSerialization process.

Question 17: Is constructor of super class called during DeSerialization process of subclass (Important)?

Answer: It is depends on whether our superclass has implemented Serializable or not.

If superclass has implemented Serializable – constructor is not called during DeSerialization process.
If superclass has not implemented Serializable – constructor is called during DeSerialization process.
Question 18: How you can avoid Deserialization process creating another instance of Singleton class (Important)?

Answer:

We can simply use readResove() method to return same instance of class, rather than creating a new one.


Defining readResolve() method ensures that we don’t break singleton pattern during DeSerialization process.
 
 private Object readResolve() throws ObjectStreamException {
      return instance;
 }


Also define readObject() method:
 private void readObject(ObjectInputStream ois) throws IOException,ClassNotFoundException{
       ois.defaultReadObject();
       synchronized (SingletonClass.class) {
        if (instance == null) {
              instance = this;
        }
       }
 }

Sort By vs Order By vs Cluster By vs Distribute by in hive

In Hive we have different command contructs like sort by, order by, cluster by and distribute by they can be confusion to differentiate among them. Below given are the differentiation among them along with examples:

Sort By:-

Hive uses sort by to sort the data based on the data type of the column to be used for sorting per reducer i.e. overall sorting of output is not maintained. e.g. if column is of numeric type the data will be sorted per reducer in numeric order.

For Example:

Here, key and value both are numeric. Suppose we have two reducer have the output as given below.

Reducer 1

Reducer 2

Here, we can clearly see overall output is not in sorted order.

ORDER BY:-

Very similar to ORDER BY of SQL. The overall sorting is maintained in case of order by and output is produced in single reducer. Hence, we need to use limit clause so that reducer is not overloaded.

For Example:

Output:

DISTRIBUTE BY:-

Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. However,Distribute By does not guarantee clustering or sorting properties on the distributed keys.

For example, we are Distributing By x on the following 5 rows to 2 reducer:

Input:

Output:

Reducer1

 

Reducer2:

NOTE:  partition by does not guarantee the ordering per reducer. Also each reducer will contain non-overlapping output ranges.

CLUSTER BY:-

Cluster By is a short-cut for both Distribute By and Sort By.

Ordering : Global ordering between multiple reducers.

Outcome : N or more sorted files with non-overlapping ranges.

For Example:

Instead of specifying Cluster By, the user can specify Distribute By and Sort By, so the partition columns and sort columns can be different.

Example:

 


 

 

Accessing file using hdfs url

Recently i came across a scenario where i need to access the HDFS file system using hdfs url. So, following is the command to access any file path.

Here, master is the hostname of namenode and 54310 is the port.

NOTE: Here, i was using apache Hadoop plane vanila distribution. You need to use the port as per the distribution like cloudera/Hortonwork.

 

UDF vs UDTF vs UDAF in hive

UDF vs UDTF vs UDAF

Hive have so many in-built functions like: But, if we want to extend the functionality of hive we can use UDF, UDTF and UDAF.

UDF:- Please use the given below link to see how UDF works in hive.
UDF
UDTF:- Please use the given below link to see how UDTF works in hive.
UDTF
UDAF:- Please use the given below link to see how UDAF works in hive.
UDAF

Thats it!!!

UDAF in hive

UDAF(User Defined Aggregation Function):-

Aggregate functions perform a calculation on a set of values and return a single value. Values are aggregated in chunks (potentially across many tasks), so the implementation has to be capable of combining partial aggregations into a final result.

E.g. To find the maximum number in a table.

Steps:

  1. First create Max class which extends UDAF and inside this class create an inner static class MaxIntUDAFEvaluator class extending UDAFEvaluator class.

     
  2. Create table as given below:
  3. insert some data using .txt file having numbers
  4. Add jar file to hive with full path on hive CLI/beeline CLI or add this jar to .bashrc.
  5. Create temporary function as shown below:
  6. Use the select statement to find max

     

Thats it!!!

UDTF in hive

UDTF(User Defined Tablular Function) :- 

User defined tabular function works on one row as input and returns multiple rows as output. So here the relation is one to many.

e.g Hive built in EXPLODE() function. Now lets take an array column USER_IDS as ARRAY<10,12,5,45> then SELECT EXPLODE(USER_IDS) as ID FROM T_USER. will give 10,12,5,45 as four different rows in output. UDTF can be used to split a column into multiple column as well which we will look in below example. Here alias “AS” clause is mandatory .

Problem Statement:– Expand the name column from emp table to give it as two separate column in First_name,Last_name.

Solution:-

Here are the following steps needs to be followed in order to solve this problem.

  1. We have to extend a base Class GenericUDTF to write our business logic in Java.
  2. We need to override 3 methods namely initialize(), process() and close() in our class ExpandNameDetails.class.
  3. add jar to classpath:-         Add the exported JAR file to hive classpath using below command from hive terminal: ADD  /home/anuj/HIVE/HIVE-UDTF-split.jar or Alternatively: You can add exported JAR files in bashrc file using command “nano ~/.bashrc” as HIVE_AUX_JAR_PATH = ‘/home/anuj/HIVE/HIVE-UDTF-split.jar’. It will avoid adding your hive jar to class path each time you login to hive session or hadoop as it will be loaded during hadoop cluster loading by the framework itself.
  4. create temporary function:
  5. Try executing the function on emp table name field.

    Thats it!!!

 

UDF in hive

Regular UDF:-  Hive provide us the some of the build in functions but if we want to extend some of the functionality of hive then we can use UDF(User defined Function). These function needs to be added using a java program and a jar needs to be created. Let us discuss this using an example. Following steps need to be followed to create UDF in hive.

Steps:

  1. Create a UDF class extending UDF class.
  2. We need to export the jar from above given java code to the system directory.
  3. add jar file into the hive CLI or beeline terminal.                                                           Add the exported JAR file to hive classpath using below command from hive terminal: ADD  /home/anuj/HIVE/HIVE-UDF-trim.jar or Alternatively: You can add exported JAR files in bashrc file using command “nano ~/.bashrc” as HIVE_AUX_JAR_PATH = ‘/home/anuj/HIVE/HIVE-UDF-trim.jar’. It will avoid adding your hive jar to class path each time you login to hive session or hadoop as it will be loaded during hadoop cluster loading by the framework itself.
  4. add temporary function:  Run the following command to add function to hive.
  5. Use the function in hive commands.


     

This UDF is for primitive types argument only e.g. Text, IntWritable,LongWritable etc. For complex types like: struct, array etc different type of UDF needs to be written.

 

Hive performance improvements

To improve hive performance. Following are the more commonly used ways to improve hive performance:

  1. Execution Engine
  2. Using Custom file formats
  3. Use Vertorization
  4. Bucketing & Partitioning
  5. Tweaking no of mappers and their memory
  6. Parallel execution

 

  1. Execution Engine:- 

    USE TEZ

    Hive can use the Apache Tez execution engine instead of the venerable Map-reduce engine. I won’t go into details about the many benefits of using Tez which are mentioned here; use Tez by setting to ‘true’ the following in the beginning of your Hive query:

     

  2.  Using Custom file formats:- Use ORCFile format for faster performance in queries in hive. It has really fast response time.As an example, consider two large tables A and B (stored as text files, with some columns not all specified here), and a simple query like:

     

    This query may take a long time to execute since tables emp1 and emp2 are both stored as TEXT. Converting these tables to ORCFile format will usually reduce query time significantly:

     

    ORC supports compressed storage (with ZLIB or as shown above with SNAPPY) but also uncompressed storage.

    Converting base tables to ORC is often the responsibility of your ingest team, and it may take them some time to change the complete ingestion process due to other priorities. The benefits of ORCFile are so tangible that I often recommend a do-it-yourself approach as demonstrated above – convert emp1 into emp1_ORC and emp2 into emp2_ORC and do the join that way, so that you benefit from faster queries immediately, with no dependencies on other teams.

  3. USE VECTORIZATION:-

    Vectorized query execution improves performance of operations like scans, aggregations, filters and joins, by performing them in batches of 1024 rows at once instead of single row each time.

    and is easily enabled with two parameters settings:

    Bucketing & Partitioning:- Hive partitioning is an effective method to improve the query performance on larger tables. Partitioning allows you to store data in separate sub-directories under table location. It greatly helps the queries which are queried upon the partition key(s). Although the selection of partition key is always a sensitive decision, it should always be a low cardinal attribute, e.g. if your data is associated with time dimension, then date could be a good partition key. Similarly, if data has association with location, like a country or state, then it’s a good idea to have hierarchical partitions like country/state.                                                                                            Bucketing improves the join performance if the bucket key and join keys are common. Bucketing in Hive distributes the data in different buckets based on the hash results on the bucket key. It also reduces the I/O scans during the join process if the process is happening on the same keys.Additionally it’s important to ensure the bucketing flag is set (SET hive.enforce.bucketing=true;) every time before writing data to the bucketed table. To leverage the bucketing in the join operation we should SET hive.optimize.bucketmapjoin=true. This setting hints to Hive to do bucket level join during the map stage join. It also reduces the scan cycles to find a particular key because bucketing ensures that the key is present in a certain bucket.

  4. Tweaking no of mappers and their memory:-

    The default hive.input.format is set to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat. This configuration could give less number of mappers than the split size (i.e., # blocks in HDFS) of the input table.

    Try setting org.apache.hadoop.hive.ql.io.HiveInputFormat for hive.input.format.

    Note Apache Tez uses org.apache.hadoop.hive.ql.io.HiveInputFormat by the default.

    You can then control the maximum number of mappers via setting:

Joins in hive

Syntax of joins in hive:-

Only equality joins, outer joins, and left semi joins are supported in Hive. Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job. Also, more than two tables can be joined in Hive.

For this blog we will use two table as follow to discuss joins in hive:

Join Example:

Multiple tables can be joined in the same query:

Join implementation with Map Reduce

 Hive converts joins over multiple tables into a single map/reduce job if for every table the same column is used in the join clauses. The query below is converted into a single map/reduce job as only id column for e2 is involved in the join.


It is very interesting to note that any number of tables can be joined in single map/reduce process as long as they fit the above criteria.

However if the join colums are not the same for all tables the is converted into multiple map/reduce jobs


In this case the first map/reduce job joins e1 with e2 and the results are then joined with e3 in the second map/reduce job.

Largest Table LAST

In the MapReduce job for regular inner joins, mappers run on both tables, emitting out records from that need to be joined by evaluating any UDFs in the query and filtering out any records based on the where clause. Then the shuffle phase is run which “shuffles” the keys based on the join key (idin above example). Subsequently, in the reduce phase, essentially a cross-product takes place between records from each table that have the same join key.
                   In every map/reduce stage of the join, the last table in the sequence is streamed through the reducers where as the others are buffered. Therefore, it helps to reduce the memory needed in the reducer for buffering the rows for a particular value of the join key by organizing the tables such that the largest tables appear last in the sequence. e.g. in

all the three tables are joined in a single map/reduce job and the values for a particular value of the id for tables e1 and e2 are buffered in the memory in the reducers. Then for each row retrieved from e3, the join is computed with the buffered rows.

For the query:

there are two map/reduce jobs involved in computing the join. The first of these joins e1 with e2 and buffers the values of e1 while streaming the values of e2 in the reducers. The second of one of these jobs buffers the results of the first join while streaming the values of e3 through the reducers.

Streamtable hint

 You can also specify which table should be streamed usually its the larger one(Table).
e.g.

Outer Joins

LEFT
RIGHT
FULL
e.g. Example for LEFT OUTER JOIN. Similarly for RIGHT and FULL.

NOTE: These joins are not commutative instead they are left-associative regardless whether it LEFT or RIGHT OUTER Joins.

e.g.

Means join condition will start from left it will join e1 and e2 and results will be joined with e3.

Left Semi Join

 LEFT SEMI JOIN implements the correlated IN/EXISTS subquery semantics in an efficient way.Since Hive currently does not support IN/EXISTS subqueries, you can rewrite your queries using LEFT SEMI JOIN. The restrictions of using LEFT SEMI JOIN is that the right-hand-side table should only be referenced in the join condition (ON-clause), but not in WHERE- or SELECT-clauses etc.

This type of query


Can be written as:

Map Side Join

If all but one of the tables being joined are small, the join can be performed as a map only job. The query does not need a reducer. For every mapper a,b is read completely. A restriction is that e1 FULL/RIGHT OUTER JOIN e2 cannot be performed.

Bucketed Map Join

 If the tables being joined are bucketized, and the buckets are a multiple of each other, the buckets can be joined with each other.
In conf/hive-site.xml you need to set the following parameters:

That’s it!!!

Thanks!!!

Page 1 Page 2 Page 3