Prepare for your Spark interview with this comprehensive guide featuring 30 essential questions and answers. Covering basic, intermediate, and advanced topics, these questions help freshers, candidates with 1-3 years of experience, and professionals with 3-6 years of experience master Spark concepts, practical applications, and real-world scenarios.
Basic Spark Interview Questions
1. What is Apache Spark?
Apache Spark is an open-source unified analytics engine for large-scale data processing. It provides high-level APIs in multiple languages and an optimized engine that supports general computation graphs.[1][3]
2. What are the key features of Spark?
Key features include high processing speed through in-memory computation, fault tolerance via RDD lineage, lazy evaluation, support for multiple languages, and integration with various data sources.[1][3]
3. Which programming languages does Spark support?
Spark supports Scala, Java, Python, and R for developing applications.[1][3]
4. What is a Resilient Distributed Dataset (RDD) in Spark?
RDD is the fundamental data structure in Spark representing an immutable, distributed collection of objects that can be processed in parallel. RDDs track lineage for fault tolerance.[2][4]
5. What are the two main types of operations on RDDs?
RDD operations are divided into transformations (like map, filter) that create new RDDs and actions (like collect, count) that return results to the driver.[2]
6. What is Spark Core?
Spark Core is the underlying general execution engine for Spark’s architecture. It provides in-memory computing capabilities and RDDs.[2][3]
7. Explain lazy evaluation in Spark.
Lazy evaluation means Spark transformations are not executed immediately but only when an action is called. This optimizes the execution plan.[2][3]
8. What is a Spark Driver?
The Spark Driver is the process that creates the SparkContext, defines the application logic, and coordinates with the cluster manager.[1][2]
9. How do you create an RDD in Spark?
RDDs can be created in two ways: parallelizing a collection in memory or loading data from external storage like HDFS.[2][4]
10. What is the difference between persist() and cache()?
Both store RDD in memory, but cache() uses the default storage level (MEMORY_ONLY), while persist() allows specifying different storage levels.[2][5]
Intermediate Spark Interview Questions
11. What are common transformations in Spark?
Common transformations include map(), filter(), flatMap(), union(), intersection(), and groupByKey(). These are lazy operations.[2]
12. What are common actions in Spark?
Common actions include reduce(), collect(), count(), first(), take(), and saveAsTextFile(). These trigger execution.[2]
13. Explain what a shuffle operation is in Spark.
Shuffle occurs when data needs to be redistributed across partitions, typically during operations like groupByKey() or join(). It is expensive due to disk I/O.[2]
14. What are accumulators in Spark?
Accumulators are variables that can only be added to, used for aggregating information across executors like counters. They are writable only by tasks.[2]
15. What storage levels are available for persisting RDDs?
Storage levels include MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, and their serialized and replicated variants.[5]
val rdd = sc.parallelize(Seq(1,2,3,4))
rdd.persist(StorageLevel.MEMORY_AND_DISK)
16. How does Spark achieve fault tolerance?
Spark achieves fault tolerance through RDD lineage, recomputing lost partitions by re-executing transformations from the original data.[3][4]
17. What is SparkContext?
SparkContext is the entry point for Spark functionality, representing the connection to a Spark cluster and used to create RDDs and broadcast variables.[2]
18. In a YARN cluster, do you need Spark on all nodes?
No, Spark runs on top of YARN using its resource management, so it doesn’t need to be installed on all nodes.[1]
19. What is the role of Executor Memory in Spark?
Executor Memory is the amount of memory allocated per executor for storing and processing data. Proper tuning prevents out-of-memory errors.[1]
20. How can you minimize data transfers in Spark?
Use broadcast variables for lookup tables, prefer reduceByKey() over groupByKey(), and control partitioning to reduce shuffling.[2]
Advanced Spark Interview Questions
21. Explain RDD lineage.
RDD lineage is the logical plan of transformations applied to create an RDD. It enables recomputation for fault tolerance.[2][4]
22. Scenario: At Zoho, you need to process log files from multiple sources. How would you handle skewed data in Spark?
Handle skewed data by salting keys (adding random prefixes), increasing partitions, or using custom partitioners to balance load.[2]
23. What is checkpointing in Spark?
Checkpointing truncates RDD lineage by saving intermediate RDDs to reliable storage, reducing recovery time for long lineages.[2]
24. How do you perform memory tuning in Spark?
Tune by adjusting executor memory, cores per executor, driver memory, and using off-heap memory. Monitor garbage collection.[2]
25. Scenario: In a Paytm payment processing pipeline, data arrives continuously. How would you implement near real-time processing?
Use Spark Streaming with micro-batches or Structured Streaming for continuous processing with fault tolerance.[3]
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
words.foreachRDD(rdd => { /* process */ })
26. What is the default parallelism in Spark?
Default parallelism is based on total cores: for standalone it’s total cores, for YARN it’s based on executor cores.[2]
27. Scenario: At Atlassian, handling large joins cause OOM. What optimization techniques would you apply?
Broadcast small tables, use bucketed joins, skew join handling, and increase executor memory with proper partitioning.[2]
28. How does Spark handle monitoring and logging?
Spark provides a web UI for monitoring jobs, stages, and storage. Logs are available in executor and driver logs.[2]
29. What are the core components of a Spark application?
Core components include Driver Program, Cluster Manager, Executors, and Tasks distributed across worker nodes.[2]
30. Scenario: For Adobe’s analytics platform, you need to achieve high availability. How would you configure Spark?
Configure high availability using ZooKeeper for master failover in standalone mode or leverage YARN/Mesos for resource management redundancy.[2]