Spark Fundamentals II

Take our free course

Spark Fundamentals II (for Cognizant)

Spark Fundamentals II

with James Priebe, Henry Quach

Audience:
Data scientists, engineers, or anyone who is interested in learning about Spark.

Time to complete:
4 hours

Available in:
English

This is the second course in the Big Data University Spark curriculum, and expands concepts discussed in Spark Fundamentals. This course covers Spark’s architecture and goes in-depth on how data is distributed and tasks are parallelized. Students will have a better understanding for how to optimize their data for joins, using Spark’s memory caching, and use the more advanced operations available in the API.

This course was developed with the support of:

MetiStreamLogo MetiStream, Inc. (metistream.com), experts in Apache Spark implementations and training
ibm-logo-blu.transparent_background IBM Analytics (ibm.com/analytics) helps you make better decisions by gleaning new insights from the volume and variety of big data.

 

Course Syllabus

Lesson 1   -- Introduction - web-based notebooks + lab environment. [15 mins]
Lab 1   -- Setting up and verifying the lab environment [15 mins]
Lesson 2   -- Architecture [30 mins]
Lab 2   -- Working with RDD operations (input, partitioning, parallelization, and repartitioning) [30 mins]
Lesson 3   -- Optimizing Transformations and Actions [30 mins]
Lab 3   -- Using RDD operations for optimal performance and scheduling [30 mins]
Lesson 4   -- Caching and Serialization [15 mins]
Lab 4   -- Using RDD persistence and memory tuning [15 mins]
Lesson 5   -- Development and Debugging [30 mins]
Lab 5   -- Using sbt, eclipse, or IntelliJ for unit testing and debugging [30 mins]

General Information

After completing this course, you should be able to:

  • Apache Spark architecture overview
  • Understanding input, partitioning, and parallelization
  • Optimizations for efficiently operating on and joining multiple datasets
  • Understanding how Spark instructions are translated into jobs and what causes multiple stages within a job
  • Efficiently using Spark’s memory caching for iterative processing
  • Developing, testing, and debugging Spark applications using SBT, Eclipse, and IntelliJ

Pre-requisites

None

Recommended skills prior to taking this course

Before taking this course, you should have the following background:

  • Have taken the Hadoop Fundamentals course on Big Data University
  • Have taken the Spark Fundamentals on Big Data University
  • Basic understanding of Apache Hadoop and Big Data.
  • Basic Linux Operating System knowledge
  • Basic understanding of the Scala, Python, or Java programming languages.

Grading Scheme

The minimum passing mark for the course is 60%, where the final test is worth 100% of the course mark. You have 3 attempts to take the test.