Top Interview Questions for Hadoop You Should Know.

Big data analytics burst into the scene and the impact is irrevocable; Hadoop followed next. In this article, I will share the questions or rather concepts along with sample answers which you should know to ace the interview process. Now, once you have a head start on what is being asked, I would advise you to dig in deeper into the related concepts as well.

For all we know, Hadoop is massively popular but we ought to know who is using it today. Let me walk you through.

As there is boundless data available, managing it is a major concern and Hadoop is a great platform for old legacy data as well as new unstructured data. When it comes to industry wise segmentation, IT services are closely lead by the quite obvious Computer Software based industries.

Big players like JP Morgan, Goldman Sachs, Facebook, Google, Twitter, use Hadoop frameworks effectively. Other industries include banks, reservation portals like Yatra and Make My Trip, insurance, and trading companies to name a few.

The Big Data industry is expected to reach a whopping $48.6 billion by 2019, so without wasting any more time,

Let us begin.

  1. What is the difference between Namenode and Datanode in Hadoop?

(HDFS Master) Namenode

Namenode regulates the file access to the clients. It maintains and manages the slave nodes and assigns tasks to them. Namenode executes file system namespace operations like opening, closing, and remaining files and directories. It should be deployed on a reliable hardware.

(HDFS Slave) Datanode

There are many slaves or data nodes in HDFS which manage the storage of data. These slave nodes are the actual worker nodes which perform the tasks and serve the read and write requests from the file. Slave nodes also carry out block creation, deletion, and replication upon instruction from the Namenode. Once a block is written on a data node, it replicates it to other data nodes and the process continues until the number of replicas mentioned is created. Datanodes can be deployed on commodity hardware.

This video will help you to understand the basic Architecture of Hadoop Architecture

Understand the Hadoop Architecture

2. What do you mean by metadata in HDFS? Where is it stored in Hadoop?

HDFS Namenode stores meta-data i.e, number of data blocks, replicas and other details. This 

Metadata can be accessed in the memory in the master to retrieve the data swiftly.

It is basically stored in Namenode(HDFS Master)

3.  What all modes Hadoop can be run in?

  • Standalone Mode: This is the default mode of Hadoop, it uses the local file system for input and output operations. Mainly used for debugging, the Standalone mode does not support the use of HDFS. Further, no custom configuration required for mapred-site.xml, core-site.xml, hdfs-site.xml files. Much faster when compared to other modes.
  • Pseudo-Distributed Mode (Single-Node Cluster): Here, you will have to configure for all the aforementioned three files. In this case, all daemons are running on one node. Consequently, the Master and the Slave node are both the same.
  • Fully Distributed Mode (Multiple Cluster Node): Hadoop gets its image for production and this is where it all happens. Here, the data is used and further distributed across a number of nodes in a cluster. And unlike Pseudo mode, separate nodes are allotted.

 

 

4. What are the Blocks in HDFS Architecture? What if the file size is less than block size?

HDFS in Apache Hadoop splits huge files into small chunks known as Blocks. These are the smallest chunks of data in a filesystem. We do not have any control over the block like block location. Namenode is the deciding authority. With a default size of 128 MB, HDFS block can be configured as per the need. All the blocks of the file are of the same size except the last block, which can be the same size or smaller.

If the data size is less than the block size, then the block size will be equal to the data size. Consider an example, if the file size is 129 MB, then it will create 2 blocks. One block of default size 128 MB and the other of just 1 MB. We need to note here that the second block is not of 128 MB again as it is a sheer waste of space. Hadoop is smart enough to not waste 127 MB. Thus, only 1 MB is allocated for the remaining 1MB data. This method of breaking the block size as per requirement saves disk seek time.

5. What is Secondary Namenode?

In HDFS, when a Namenode starts, it will first read the HDFS state from an image file; the FSImage. Next, it will apply edits from the edits log file. Namenode then writes new HDFS state for the FSImage followed by normal operation with an empty edits file. At the tie of start-up, NameNode merges FSImage and edits the files, so the edit log file can become huge over time. It is due to these larger edits that the restart of the Namenode takes longer.

Secondary Namenode to the rescue.

Secondary Namenode downloads the FSImage and edits the files, so the edit log file could get very large over time. Consequently, next restart of Namenode takes longer.

6. Explain what is heartbeat in HDFS?

Heartbeat is a signal used between a data node and namenode and between task tracker and job tracker, if the namenode or job tracker does not respond to the signal, it is considered that there are some issues with the data node or with the task tracker.

7. What if a data node fails?

When a data node fails:

Jobtracker and namenode detects the failure

All the tasks are re-scheduled on the failed node.

Namenode will replicate the user data in another node.

8. What are active and passive Namenodes?

Hadoop 2x has two namenodes. Namely- Active Namenode and passive namenode.

Active Namenode works and runs in the cluster.

Passive Namenode is a standby namenode. The data of a passive namenode is similar to that of an active namenode. When the active namenode fails, it is replaced by the passive namenode in the cluster. This ensures that the cluster is never without a namenode and as a result, it never fails.

9. What happens if 2 clients are trying to access the same file on the HDFS?

HDFS supports exclusive write-only.

When the first client contacts the Namenode to open the file for writing, the Namenode will grant a lease to the client for the creation of the file. When the second client tries to open the same file for writing, the “namenode” will notice that the lease for the file is already granted to another client and will reject the open request for the second client.

10. State the reason why we can’t perform “aggregation”(addition) in a mapper? Why do we need the reducer for this?

“Aggregation” cannot be performed in a mapper as sorting does not occur in the “mapper”. Sorting occurs only on the side of the reducer. The “Mapper” method initialization depends on each input split. During “aggregation”, the value of the previous instance is lost. Therefore, a new “mapper” will be initialized for each row. For each row, “inputsplit” again gets divided into the “mapper”. Hence, we cannot have a track of previous row value.

11. What does a “MapReduce Partitioner” do?

A “MapReduce Partitioner” ensures that all the values of a single key go to the same “reducer” to provide even distribution of the map output over the “reducers”. It redirects the “mapper’ output to the “reducer” by determining which “reducer” is responsible for the particular key.

You might face a few trick questions. Here a few key points to tackle a few of them.

Note: A namenode without any data doesn’t exist in Hadoop. If there is a Namenode, it will contain some data in it otherwise it won’t exist at all.

Note: The Map-reduce programming model does not allow reducers to communicate with each other.”Reducers” run in isolation.

Source: Google.

12. What are the core methods of a Reducer?

There are 3 core methods of a reducer, namely:

  1. Setup (): setup () is used for configuration of various parameters like the size of the input data, heap size etc.
  2. Reduce (): Called as once per key, this is the heart of reducer.
  3. Cleanup (): As the name suggests, this command is called only once at the end for clearing all the temporary files.

And now, time for some quick quiz.

Put your thinking hats on and let us begin. 

Hadoop Quiz.

 

  1. Sources of structured data are typically from:
  1. RDBMS
  2. server logs
  3. Social media website generated data
  4. Machine generated data

Answer: Option A.

  1. The explosion of DATA (big data) is mainly due to:
  1. Unstructured and semi-structured data.
  2. Structured data.
  3. Semi-structured data.
  4. Unstructured data.

Answer: Option A.

  1. What are the challenges of vertical scaling?
  1. Hardware needs to be re-designed each time.
  2. Cost is high.
  3. Vertical scaling easily reaches an upper limit.
  4. All of the above.

Answer: Option C.

  1. Hadoop is based on which of the processing models:
  1. Supercomputers.
  2. MPP.
  3. Map-reduce.
  4. All of the above.

Answer: Option C.

  1. What are the kind of problems which can be best solved by a parallel processing system?
  1. Searching type problems.
  2. Sorting type problems.
  3. Divide and conquer type problems.
  4. All kinds of problems can be parallelized.

Answer: Option C.

  1. Which of these are the essential components of a Java map-reduce program?
  1. Mapper.
  2. Driver.
  3. Reducer.
  4. All of the above.

Answer: Option D.

  1. Which of the following components of a map reduce program can be programmed?
  1. Mapper.
  2. reducer.
  3. sorter.
  4. All of the above.

Answer: Option D.

  1. The sort module in map reduce module runs:
  1. After the mapper completes.
  2. At the beginning of the reduce stage.
  3. After the map stage completes.
  4. At the end of the Reduce stage.

Answer: Option A.

  1. The partitioner code in Map reduce by default:
  1. Does not run.
  2. Needs to be re-programmed by the programmer.
  3. Runs the hash partition algorithm.
  4. Runs a key by-pass logic at the start of the reducing stage.

Answer: Option C.

  1. A Map reduce program consists of:
  1. The number of reducers depends on the problem type.
  2. No reducer.
  3. 1 reducer.
  4. 2 reducers.

Answer: Option A.

  1. The value which is received as an output from the mapper in the word count frequency problem is:
  1. The actual word.
  2. 0.
  3. 1,
  4. Key.

Answer: Option C.

  1. The value which is received as an output from the reducer in the word count frequency problem is:

1. The actual word.

2. The list of words.

3. The sum of values for each key.

Answer: Option C.

  1. To get the correct output in the map reduce program for the word count frequency problem, the number of reducers must be:
  1. 2
  2. 0
  3. 1
  4. 3

Answer: Option C.

  1. The sort module in map reduce sorts data based on:
  1. Values by default.
  2. Keys by default.
  3. Keys and values by default.
  4. The user has to specify.

Answer: Option B.

  1. The number of mappers running in a map-reduce program is:
  1. Decided by the programmer.
  2. Automatically decided by the map-reduce framework.
  3. Equal to the number of file input splits.
  4. None.

Answer: Option C.

 

Leave a Reply

avatar
  Subscribe  
Notify of