with James Priebe, Henry Quach
Data scientists, engineers, or anyone who is interested in learning about Spark.
Time to complete:
This is the second course in the Big Data University Spark curriculum, and expands concepts discussed in Spark Fundamentals. This course covers Spark’s architecture and goes in-depth on how data is distributed and tasks are parallelized. Students will have a better understanding for how to optimize their data for joins, using Spark’s memory caching, and use the more advanced operations available in the API.
This course was developed with the support of:
|MetiStream, Inc. (metistream.com), experts in Apache Spark implementations and training|
|IBM Analytics (ibm.com/analytics) helps you make better decisions by gleaning new insights from the volume and variety of big data.|
|Lesson 1||--||Introduction - web-based notebooks + lab environment. [15 mins]|
|Lab 1||--||Setting up and verifying the lab environment [15 mins]|
|Lesson 2||--||Architecture [30 mins]|
|Lab 2||--||Working with RDD operations (input, partitioning, parallelization, and repartitioning) [30 mins]|
|Lesson 3||--||Optimizing Transformations and Actions [30 mins]|
|Lab 3||--||Using RDD operations for optimal performance and scheduling [30 mins]|
|Lesson 4||--||Caching and Serialization [15 mins]|
|Lab 4||--||Using RDD persistence and memory tuning [15 mins]|
|Lesson 5||--||Development and Debugging [30 mins]|
|Lab 5||--||Using sbt, eclipse, or IntelliJ for unit testing and debugging [30 mins]|
After completing this course, you should be able to:
Before taking this course, you should have the following background:
The minimum passing mark for the course is 60%, where the final test is worth 100% of the course mark. You have 3 attempts to take the test.