How to Run Spark on Jupyter Notebook

Cat Meme

Introduction

This post is about setting up Spark to run on a Jupyter Notebook on a computer running Windows 10.

To follow along with this post, you should already have Java (JDK 8), Spark, Python, and Jupyter Notebook installed on your machine.

If you need help with those things, here are some helpful links:

How to Install Java (JDK 8)
How to Install Spark
How to Install Anaconda (Python & Jupyter Notebook)

What is Jupyter Notebook?

An open-source web app that can run code. Code is written in cells that are individually executed. Developers can execute a specific block of code without having to execute code from the start of the script.

PySpark

Open a Jupyter Notebook and select Python kernel

cmd> jupyter notebook

Python Kernel

Execute Spark code!

import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(appName = "myApp")
a = sc.parallelize(range(10))
a.collect()

Python Kerenel

Note - I tried doing this using Spark 2.4 and encountered the following error: Python worker failed to connect back. I switched to Spark 2.3.2 and was able to execute Spark code with no errors.

Spark - Scala

This step is to if you want to develop in Spark using Scala.

Install spylon-kernel using pip

Type the following in a command prompt:

cmd> pip install spylon-kernel

Create a kernel spec

This allows us to select a Scala kerenel in a Jupyter Notebook.

cmd> python -m spylon_kernel install

Open a Jupyter Notebook and select spylon-kernel.

Execute Spark code!

1 2	val a = sc.parallelize(1 to 9) a.collect()

Spark Scala

Note - I get a warning Unable to load native-hadoop library for your platform...using builtin-java classes where applicable. This will be an issue if you connect to a Hadoop cluster with Kerberos authentication but for our purposes, we can ignore this.

Happy developing!

Posted on: Wed 26 December 2018

Category: Spark