Prepare for your Hadoop interview with these 30 essential questions covering basic, intermediate, and advanced topics. This guide is designed for freshers, candidates with 1-3 years of experience, and professionals with 3-6 years in Hadoop ecosystems. Each question includes clear, practical answers to help you succeed.
Basic Hadoop Interview Questions (1-10)
1. What is Hadoop?
Hadoop is an open-source framework designed for distributed storage and processing of large datasets across clusters of computers. It handles big data by providing high availability, scalability, and fault tolerance through its core components.[3][5]
2. What are the core components of Hadoop?
The two main components are HDFS (Hadoop Distributed File System) for storage and MapReduce for processing. HDFS manages data distribution while MapReduce handles parallel computation.[3][5]
3. What is HDFS?
HDFS is a distributed file system that stores data across multiple machines. It provides high-throughput access and fault tolerance through data replication.[2][5]
4. What is the default block size in HDFS?
The default block size in HDFS is 128MB in Hadoop 2.x, doubled from 64MB in Hadoop 1.x to handle larger datasets efficiently.[3]
5. What is the default replication factor in HDFS?
The default replication factor is 3, meaning each block is stored on three different nodes for fault tolerance.[3]
6. How does Hadoop ensure fault tolerance?
Hadoop ensures fault tolerance through data replication in HDFS and speculative execution in MapReduce, which reruns failed tasks on other nodes.[5][6]
7. What are the different modes in which Hadoop can run?
Hadoop can run in three modes: Standalone (single JVM), Pseudo-distributed (multiple processes on one machine), and Fully-distributed (across multiple machines).[5]
8. How do you copy an HDFS file to a local directory?
Use either hadoop fs -get or hadoop fs -copyToLocal commands to transfer files from HDFS to the local filesystem.[1]
9. What is a NameNode in HDFS?
NameNode is the master daemon that manages the HDFS namespace, tracks block locations, and directs clients to DataNodes for data access.[2][7]
10. What is a DataNode?
DataNode is the slave daemon that stores actual data blocks and handles read/write requests from clients under NameNode coordination.[2]
Intermediate Hadoop Interview Questions (11-20)
11. Explain the data replication strategy in HDFS.
First replica goes to a random node in rack A, second to a different rack B, and third to another node in rack B for optimal fault tolerance and network efficiency.[3]
12. What is MapReduce?
MapReduce is a programming model for processing large datasets in parallel. It splits data into blocks, processes via Mappers, shuffles, and reduces via Reducers.[2][5]
13. What is the role of a Combiner in MapReduce?
Combiner is a mini-reducer that runs on mapper output to reduce data transfer to reducers, improving efficiency by aggregating local key-value pairs.[6]
14. What are the differences between Hadoop 1.x and 2.x?
Hadoop 2.x introduces YARN for resource management (replacing JobTracker/TaskTracker), supports federation with multiple NameNodes, and scales to 10,000 nodes.[3]
|Criteria| Hadoop 1.x | Hadoop 2.x |
|--------|------------|------------|
|NameNode | Single | Federation |
|Block Size| 64MB | 128MB |
|Scalability| 4,000 nodes| 10,000 nodes |
15. How does a client write a file to HDFS?
Client contacts NameNode for DataNode locations, then pipelines the block directly to the first DataNode, which replicates to two others.[1]
16. What is YARN in Hadoop?
YARN (Yet Another Resource Negotiator) separates resource management from job scheduling/monitoring, using ResourceManager and NodeManager for better scalability.[3]
17. What is a Rack in Hadoop?
A rack is a group of nodes connected via a single network switch. Hadoop’s rack awareness optimizes data placement for fault tolerance and locality.[6]
18. How do you calculate Hadoop cluster size?
Cluster size = (Total data size × Replication factor) / (Block size × DataNodes × Available storage per node), accounting for compression and overhead.[1]
19. What is the role of Secondary NameNode?
Secondary NameNode periodically downloads FsImage and edits from NameNode, merges them into a new FsImage, and uploads it back for checkpointing.[7]
20. At Paytm, how would you estimate HDFS storage needs for 10TB raw data with 2x compression and replication factor 3?
Effective size = 10TB / 2 × 3 = 15TB required storage, plus metadata overhead.[1]
Advanced Hadoop Interview Questions (21-30)
21. What is speculative execution in Hadoop?
Speculative execution launches duplicate tasks for slow-running tasks to complete jobs faster, selecting the first successful output.[5]
22. Explain the shuffle and sort phase in MapReduce.
After mapping, intermediate key-value pairs are partitioned, sorted locally, shuffled across network to reducers, merged, and grouped by key.[2]
23. How many copy/write operations occur in shuffle/sort with m mappers and r reducers?
Each mapper writes to r reducers (m×r writes). Each reducer copies from m mappers (m×r copies), totaling 2mr operations.[1]
24. What are Hadoop configuration files and their purposes?
hdfs-site.xml configures HDFS daemons and replication. yarn-site.xml sets ResourceManager and NodeManager properties.[4]
25. How is security achieved in Hadoop?
Kerberos provides authentication via TGT tickets. Hadoop uses ACLs and POSIX permissions for authorization on files and directories.[2][4]
26. For a Zoho scenario, how do you set NodeManager memory on a 64GB node?
Deduct daemon memory (e.g., 1GB), allocate 80-90% of remaining for YARN containers, leaving OS overhead.[3]
27. What is HBase’s relation to CAP theorem in Hadoop ecosystem?
HBase prioritizes Availability and Partition Tolerance (AP), suitable for real-time workloads on HDFS.[2]
28. How do you access properties file copied to distributed cache in a MapReduce job?
Use DistributedCache API to add file, then access via FileSystem API or as a stream in mapper/reducer setup.[1]
29. In an Atlassian-like setup, how to create Hive table over HBase without data movement?
Use Hive storage handler for HBase with STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' for automatic synchronization.[1]
30. For Adobe-scale data, how does NameNode determine DataNode for writes?
NameNode uses rack awareness, load balancing, and metadata (blocks, replicas, locations) stored in memory to assign optimal DataNodes.[7]