So, my goal is simply this, run a Jupyter notebook with a pyspark kernel on macOS Monterey (on Apple Silicon) and read in/process Zstandard compressed JSON files (in Spark)
Now, Spark doesn’t apparently support Zstandard encoded files itself (except apparently internally), so, one needs to rely on Hadoop (native libraries) to do this.
I’ll attempt to summarize the steps I took to do this. First things first, I searched the interwebs to try to find someone who has done this already. Here’s a couple links that got me primed –
- https://dev.to/zejnilovic/building-hadoop-native-libraries-on-mac-in-2019-1iee
- https://javamana.com/2021/07/20210702111416442o.html
Setup
The first article, although a bit out-dated, seemed most straight forward, so I started with it. It relies heavily on Homebrew. First installed dependencies
brew install wget gcc autoconf automake libtool cmake snappy gzip bzip2 zlib openssl
Then build protobuf
as it suggested. I didn’t want to install into /usr/local,
but rather /opt/local
instead
wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
tar -xzf protobuf-2.5.0.tar.gz
cd protobuf-2.5.0
./configure --prefix=/opt/local
make
make check
sudo make install
/opt/local/bin/protoc --version
Configuring
Now, onto Hadoop itself. First I checked out the branch I wanted
git clone https://github.com/apache/hadoop.git hadoop-2.9.1
cd hadoop-2.9.1
git checkout -b branch-2.9.1
Before I could build and compile correctly, I had to fix one file, I needed to add some lines to hadoop-common-project/hadoop-common/src/CMakeLists.txt at around line 25, I added the following
# stay cmake_minimum_required(VERSION 3.1 FATAL_ERROR) after , Join the line :
cmake_policy(SET CMP0074 NEW)
Building and installing
And now, build everything. This is the magic that worked for me
export OPENSSL_ROOT="/opt/homebrew/Cellar/openssl@1.0/1.0.2u"
env OPENSSL_ROOT_DIR="${OPENSSL_ROOT}/" ZLIB_ROOT="/opt/homebrew/Cellar/zlib/1.2.11" HADOOP_PROTOC_PATH="/opt/local/bin/protoc" mvn package -Pdist,native -DskipTests -Dtar -Drequire.openssl -Drequire.snappy -Drequire.zstd -Dopenssl.prefix="${OPENSSL_ROOT}"
And then just extact the tarball somewhere
cd /app
tar xzf ${SOURCE_DIR}/hadoop-dist/target/hadoop-2.9.1.tar.gz
cd hadoop-2.9.1
Wrangling macOS
I ran into some problems running the typical hadoop checknative command. I’ll spare you the gory details of figuring this out, but it has to do with macOS “System Integrity Protection” preventing dynamic native library lookups. This is a good article about it
The TL;DR is that as soon as macOS executes one if its trusted executables, like /bin/sh
or /usr/bin/env,
it cripples anything you might have done like setting DYLD_LIBRARY_PATH
to dynamic library folders, and results in failure to load them.
The quick fix for a test is to use Homebrew’s version of bash
brew install bash
export PATH=/opt/homebrew/bin:$PATH
That should be enough to run hadoop checknative
export DYLD_LIBRARY_PATH=/work/app/hadoop-2.9.1/lib/native:/opt/homebrew/Cellar/snappy/1.1.9/lib/:/opt/homebrew/Cellar/zstd/1.5.2/lib/:/opt/homebrew/Cellar/openssl\@1.0/1.0.2u/lib/
bash bin/hadoop checknative
On to Spark & Jupyter
Ok, now the basic test works. And with the new found knowledge around macOS and SIP, what needs to be done to make Spark and Jupyter load the native libraries. Let’s start with Spark.
Spark
No need to really build Spark, just download a dist without Hadoop.
curl -L -O https://dlcdn.apache.org/spark/spark-3.1.3/spark-3.1.3-bin-without-hadoop.tgz
tar xzf spark-3.1.3-bin-without-hadoop.tgz
Now, if you look at scripts like bin/spark-shell
, you’ll notice they all start with
#!/usr/bin/env bash
Now we know, with SIP enabled, macOS is gonna disable DYLD_LIBRARY_PATH
when /usr/bin/env
(and subsequently /bin/bash
) are run. The strategy is going to be to prevent either of these from running. We already got a solution for bash
, but there is no Homebrew version of /usr/bin/env
, so we’ll have to build our own.
curl -O https://raw.githubusercontent.com/coreutils/coreutils/master/src/env.c
gcc -o env env.c
mkdir /app/bin
cp env /app/bin
And now, the unfortunate part, I decided to tweak the Spark scripts to run my version of env
. Ok for now, I don’t really like to do this sort of thing cause inevitably I’ll forget when I upgrade a component.
cd spark-3.1.3-bin-without-hadoop/bin
sed -i .bak -e 's=/usr/bin/env=/app/bin/env=' $(grep -l /usr/bin/env *)
Moment of truth, lets see if we can start a spark-shell
and have it load a Zstandard compressed file
export PATH=/opt/homebrew/bin:$PATH
export HADOOP_HOME=/app/hadoop-2.9.1/
export SPARK_HOME=/app/spark-3.1.3-bin-without-hadoop/
export SPARK_DIST_CLASSPATH=$(${HADOOP_HOME}/bin/hadoop classpath)
export DYLD_LIBRARY_PATH=${HADOOP_HOME}/lib/native:/opt/homebrew/Cellar/snappy/1.1.9/lib/:/opt/homebrew/Cellar/zstd/1.5.2/lib/:/opt/homebrew/Cellar/openssl\@1.0/1.0.2u/lib/
$SPARK_HOME/bin/spark-shell
Now lets, try
scala> val df = spark.read.json("/tmp/important-json-log.zst")
df: org.apache.spark.sql.DataFrame = [aid: bigint, app: struct<buid: string> ... 16 more fields]
scala> df.count()
res1: Long = 7023
Success! So, how about Jupyter?
Jupyter
Well, it turns out, with all of what we’ve done so far, we are in good shape to use Homebrew’s version jupyter
. The startup script itself isn’t using the env
trickery
#!/opt/homebrew/opt/python@3.9/bin/python3.9
So, using the same environment settings for Spark, should be enough
jupyter notebook --ip=localhost --notebook-dir=jupyter
Taadah!