Then, move into the path docker scripts are in. After running spark in background process, livy server also be executed in background. And then, jupyter lab service starts. (3) The REST communication port of livy server is 8998, and have same of web ui. (1) Jupyter lab is also executed when the container created, no execution command is needed to run jupyter lab.

(2) If no errors found in the process above, you can access the spark web ui. livy 0.8.0 (alpha version. (3) Copy the URL in the console and access the jupyter lab service. https://hub.docker.com/search?q=hjben&type=image. (1) Using the sample code in this zip file, you can test the data pre-processing and clustering with spark. This shell files get some parameter from user, and then constructs cluster and executes some files needed. You can download them with docker pull command. (2) If the result of docker exec -it master bash -c livy-server status command, however, is livy server is not running, restart the livy with ./livy-start.sh command. built manually), spark_version: Version of spark (3.1.1 is available now), (The # of) workers: The number of spark workers (integer between 1 and 5), (CPU)core: The number of CPU core of each spark worker (integer 1 or above), mem (GiB): The amount of memory size of each spark worker (integer 1 or above. (1) When created containers, CLI used is occupied by foreground process of hadoop, Open a new CLI(=shell) at host to start jupyter lab separately. (5) There are volumes mounted on host path in each container for logs or data backup. Spark bin is installed in the spark container and shares jupyter-labs data through volume mount. The unit is GiB), jupyter_workspace_path: Host path for saving jupyter lab workspace data, (The # of) slaves: The number of hadoop slaves (integer between 1 and 5), spark_log_path: Host path for saving spark log, hdfs_path: Host path for saving hdfs data, hadoop_log_path: Host path for saving hadoop log, hadoop_version: Version of hadoop (3.3.0 is available now).

(7) If you have any problems, questions or bugs when using codes, just contact me. Host path must be set with compatibility of your test environment, through some parameters of ./compose-up.sh command. Its in the spark master container. (1) Livy is executed when the container created, no execution command is needed to run livy. (1) Download the shell script in spark/docker-script/hadoop folder at the github and move them to the path where docker commands are available. (2) In the first time of run hadoop, HDFS volume would be formatted to use them. (4) Cause no resource limit exists when creating containers, theres no problem when the sum of each workers core and memory may be exceed the host machines. framework The host paths youll use are be made by yourself, preparing when they may not be automatically generated on the host. (3) The reason of using jdk 1.8.0 is because of the compatibility with livy, The latest version is used in other things. (2) Import SparkSession and build a spark session. Token is changing value every time you run. The new one is a hadoop cluster with master and n slaves, and one spark master and jupyter-lab is added on. https://github.com/hjben/docker Sub-folders related on: jupyter-lab, spark. Set spark master to yarn. The solution is revise access host from master to spark in the sample code of livy. (1) Youre recommended to use machine with 16GB memory or above. If using yarn scheduler, the hadoop cluster is necessary.

(3) When created, spark session would be seen and managed in the spark web ui. jdk 1.8.0 is used because of livy. (4) The livy service is also available, but the position of livy server is changed to spark container. (3) When created, spark session would be seen and managed in the hadoop job web ui (localhost:8088). (4) All files related to this practice are in my github. The version of image maybe changed up, with update of the open-source version. (1) Download the shell script in spark/docker-script folder at the github and move them to the path where docker commands are available. Then hadoop service is started in the containers. (7) ./compose-up.sh command may mis-correct the generated docker-compose.yml file constructing another containers without compose-down command. The command will not be executed when lack of number of parameters or wrong input type detected. (2) Copy the URL in the console and access the jupyter lab service. If youre used to Dockerfile, you can revise the image with Dockerfile in github. (6) The docker images are in my docker hub. Token is changing value every time you run. parameters must be entered behind the command with one blank (space-bar) and arranged by the order below. Then a new cluster contains hadoop service is needed.

When using spark-submit command, you could work with spark container. Youre going to create spark session with yarn when spark needs data in HDFS. Open-source versions are like below. The address of web ui is localhost:8080. (2) Docker images are based on RedHat linux (CentOS 8) with bash shell, which is similar to real-world servers. This volumes are used for keep the data when docker containers are broken. Livy is a REST API server made for ease spark-job control at outside of the cluster. (5) This practice uses shell script files in folder named by docker-script, where are sub-folder of each folder in github. The version of image maybe changed up, with update of the open-source version. Construct Spark cluster composed of 1 master, n of slaves, and jupyter-lab using docker-compose. (1) Create a new notebook in the jupyter lab. So there might be some trouble with uploading the sample data to HDFS. (1) Using the sample code in this zip file, you can communicate with the livy server through REST API. You could create spark session by using the yarn scheduler of hadoop. Copy and Paste the address started with 127.0.0.1 to your web browser. A docker image is added for hadoop service. So if you want to change the number of hadoop slaves, youre recommended to run ./compose-down.sh first. If using yarn scheduler, spark session gazes the HDFS path when load a data. The address of spark master is spark://master:7077, and also be set with the number of cpu core and amount of memory space. (I worked in MacOS system). From now on, all codes running in the notebook is python code. (2) Import SparkSession and build a spark session. (1) Spark master and workers are run when the container created, you could save the time to enter spark execution command. (1) After the containers are built, Execute the command ./hadoop-start.sh start. (3) If you create spark containers, initialization and execution of spark will be done automatically. Each container uses host resource flexibly. (2) The sample data and code is able to use after unzipping the attached file and uploading jupyter lab. (3) Pyspark package for use spark with python, and some settings for linking with hadoop or livy are added in the jupyter lab container of spark cluster. Then the adress of livy web ui is localhost:8998. Copy and Paste the address started with 127.0.0.1 to your web browser. docker-compose.yml file also be generated automatically in same path, and deleted when the docker-compose is down. Get experience of pyspark session and spark-livy service. (2) With ./compose-up.sh command, docker network and containers are generated. (2) Enter /jupyter-start.sh command to run jupyter lab. Also, you need linux shell environment with docker and docker-compose installed. (6) Execute ./compose-down.sh command if you want to destroy all containers.