Quantcast
Channel: creechy iv: a new hope
Viewing all articles
Browse latest Browse all 20

Building Hadoop, Spark & Jupyter on macOS

0
0

So, my goal is simply this, run a Jupyter notebook with a pyspark kernel on macOS Monterey (on Apple Silicon) and read in/process Zstandard compressed JSON files (in Spark)

Now, Spark doesn’t apparently support Zstandard encoded files itself (except apparently internally), so, one needs to rely on Hadoop (native libraries) to do this.

I’ll attempt to summarize the steps I took to do this. First things first, I searched the interwebs to try to find someone who has done this already. Here’s a couple links that got me primed –

Setup

The first article, although a bit out-dated, seemed most straight forward, so I started with it. It relies heavily on Homebrew. First installed dependencies

brew install wget gcc autoconf automake libtool cmake snappy gzip bzip2 zlib openssl

Then build protobuf as it suggested. I didn’t want to install into /usr/local, but rather /opt/local instead

wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
tar -xzf protobuf-2.5.0.tar.gz
cd protobuf-2.5.0

./configure --prefix=/opt/local
make
make check
sudo make install

/opt/local/bin/protoc --version

Configuring

Now, onto Hadoop itself. First I checked out the branch I wanted

git clone https://github.com/apache/hadoop.git hadoop-2.9.1
cd hadoop-2.9.1
git checkout -b branch-2.9.1

Before I could build and compile correctly, I had to fix one file, I needed to add some lines to hadoop-common-project/hadoop-common/src/CMakeLists.txt at around line 25, I added the following

# stay cmake_minimum_required(VERSION 3.1 FATAL_ERROR) after , Join the line :
cmake_policy(SET CMP0074 NEW)

Building and installing

And now, build everything. This is the magic that worked for me

export OPENSSL_ROOT="/opt/homebrew/Cellar/openssl@1.0/1.0.2u"
env OPENSSL_ROOT_DIR="${OPENSSL_ROOT}/" ZLIB_ROOT="/opt/homebrew/Cellar/zlib/1.2.11" HADOOP_PROTOC_PATH="/opt/local/bin/protoc" mvn package -Pdist,native -DskipTests -Dtar -Drequire.openssl -Drequire.snappy -Drequire.zstd -Dopenssl.prefix="${OPENSSL_ROOT}"

And then just extact the tarball somewhere

cd /app
tar xzf ${SOURCE_DIR}/hadoop-dist/target/hadoop-2.9.1.tar.gz
cd hadoop-2.9.1

Wrangling macOS

I ran into some problems running the typical hadoop checknative command. I’ll spare you the gory details of figuring this out, but it has to do with macOS “System Integrity Protection” preventing dynamic native library lookups. This is a good article about it

The TL;DR is that as soon as macOS executes one if its trusted executables, like /bin/sh or /usr/bin/env, it cripples anything you might have done like setting DYLD_LIBRARY_PATH to dynamic library folders, and results in failure to load them.

The quick fix for a test is to use Homebrew’s version of bash

brew install bash
export PATH=/opt/homebrew/bin:$PATH

That should be enough to run hadoop checknative

export DYLD_LIBRARY_PATH=/work/app/hadoop-2.9.1/lib/native:/opt/homebrew/Cellar/snappy/1.1.9/lib/:/opt/homebrew/Cellar/zstd/1.5.2/lib/:/opt/homebrew/Cellar/openssl\@1.0/1.0.2u/lib/
bash bin/hadoop checknative

On to Spark & Jupyter

Ok, now the basic test works. And with the new found knowledge around macOS and SIP, what needs to be done to make Spark and Jupyter load the native libraries. Let’s start with Spark.

Spark

No need to really build Spark, just download a dist without Hadoop.

curl -L -O https://dlcdn.apache.org/spark/spark-3.1.3/spark-3.1.3-bin-without-hadoop.tgz
tar xzf spark-3.1.3-bin-without-hadoop.tgz

Now, if you look at scripts like bin/spark-shell, you’ll notice they all start with

#!/usr/bin/env bash

Now we know, with SIP enabled, macOS is gonna disable DYLD_LIBRARY_PATH when /usr/bin/env (and subsequently /bin/bash) are run. The strategy is going to be to prevent either of these from running. We already got a solution for bash, but there is no Homebrew version of /usr/bin/env, so we’ll have to build our own.

curl -O https://raw.githubusercontent.com/coreutils/coreutils/master/src/env.c
gcc -o env env.c
mkdir /app/bin
cp env /app/bin

And now, the unfortunate part, I decided to tweak the Spark scripts to run my version of env. Ok for now, I don’t really like to do this sort of thing cause inevitably I’ll forget when I upgrade a component.

cd spark-3.1.3-bin-without-hadoop/bin
sed -i .bak -e 's=/usr/bin/env=/app/bin/env=' $(grep -l /usr/bin/env *)

Moment of truth, lets see if we can start a spark-shell and have it load a Zstandard compressed file

export PATH=/opt/homebrew/bin:$PATH

export HADOOP_HOME=/app/hadoop-2.9.1/
export SPARK_HOME=/app/spark-3.1.3-bin-without-hadoop/
export SPARK_DIST_CLASSPATH=$(${HADOOP_HOME}/bin/hadoop classpath)

export DYLD_LIBRARY_PATH=${HADOOP_HOME}/lib/native:/opt/homebrew/Cellar/snappy/1.1.9/lib/:/opt/homebrew/Cellar/zstd/1.5.2/lib/:/opt/homebrew/Cellar/openssl\@1.0/1.0.2u/lib/

$SPARK_HOME/bin/spark-shell

Now lets, try

scala> val df = spark.read.json("/tmp/important-json-log.zst")
df: org.apache.spark.sql.DataFrame = [aid: bigint, app: struct<buid: string> ... 16 more fields]

scala> df.count()
res1: Long = 7023

Success! So, how about Jupyter?

Jupyter

Well, it turns out, with all of what we’ve done so far, we are in good shape to use Homebrew’s version jupyter. The startup script itself isn’t using the env trickery

#!/opt/homebrew/opt/python@3.9/bin/python3.9

So, using the same environment settings for Spark, should be enough

jupyter notebook --ip=localhost --notebook-dir=jupyter

Taadah!


Viewing all articles
Browse latest Browse all 20

Latest Images

Trending Articles





Latest Images