Thursday, April 30, 2020

Installing Apache SPARK, KAFKA & IDE for Scala

Installing Apache SPARK and an IDE for Scala on Windows OS

1. Install a JDK (Java Development Kit) from http://www.oracle.com/technetwork/java/javase/downloads/index.html

Keep track of where you installed the JDK; you’ll need that later.
DO NOT INSTALL THE LATEST RELEASE – INSTALL JAVA 8.
Spark is not compatible with Java 9 or newer. And BE SURE TO INSTALL JAVA TO A PATH WITH NO SPACES IN IT. Don’t accept the default path that goes into “Program Files” on Windows, as that has a space.

2. Download a pre-built version of Apache Spark 3.0.0 or 2.4.4 (depending on the version you want to use – Spark 3.0.0 or 2.4.4 from https://spark.apache.org/downloads.html

3. If necessary, download and install WinRAR so you can extract the .tgz file you downloaded. http://www.rarlab.com/download.htm

4. Extract the Spark archive, and copy its contents into C:\spark after creating that directory. You should end up with directories like c:\spark\bin, c:\spark\conf, etc.

5. Download winutils.exe from https://sundog–s3.amazonaws.com/winutils.exe and move it into a C:\winutils\bin folder that you’ve created. (note, this is a 64-bit application. If you are on a 32-bit version of Windows, you’ll need to search for a 32-bit build of winutils.exe for Hadoop.)

6. To trick windows commands that it is working on linux with hadoop, Create a c:\tmp\hive directory, and cd into c:\winutils\bin, and run winutils.exe chmod 777 c:\tmp\hive

7. Open the c:\spark\conf folder, and make sure “File Name Extensions” is checked in the “view” tab of Windows Explorer. Rename the log4j.properties.template file to log4j.properties. Edit this file (using Wordpad or something similar) and change the error level from INFO to ERROR for log4j.rootCategory

8. Right-click your Windows menu, select Control Panel, System and Security, and then System. Click on “Advanced System Settings” and then the “Environment Variables” button.

9. Add the following new USER variables:
i. SPARK_HOME c:\spark
ii. JAVA_HOME (the path you installed the JDK to in step 1, for example C:\JDK)
iii. HADOOP HOME c:\winutils

9. Add the following paths to your PATH user variable:
%SPARK_HOME%\bin
%JAVA_HOME%\bin

10. Close the environment variable screen and the control panels.

11. Install the latest Scala IDE from http://scala-ide.org/download/sdk.html

12. Test it out!

1. Open up a Windows command prompt in administrator mode.
2. Enter cd c:\spark and then dir to get a directory listing.
3. Look for a text file we can play with, like README.md or CHANGES.txt
4. Enter spark-shell
5. At this point you should have a scala> prompt. If not, double check the steps above.
6. Enter val rdd = sc.textFile(“README.md”) (or whatever text file you’ve found) Enter rdd.count()
7. You should get a count of the number of lines in that file! Congratulations, you just ran your first Spark program!
8. Hit control-D to exit the spark shell, and close the console window So you’ve got everything set up!

Installing Apache KAFKA on Windows OS
  1. Download and Setup Java 8 JDK
  2. Download the Kafka binaries from https://kafka.apache.org/downloads
As of Saturday, 11-April-2020 latest Kafka version 2.4.1 available that I installed.
  1. Extract Kafka at the root of C:\
  2. Setup Kafka bins in the Environment variables section by editing Path
Note all kafka windows .bat files are in directory c:\kafka-2.12-2.4.1\bin\windows
  1. Try Kafka commands using kafka-topics.bat (for example)
  2. Edit Zookeeper & Kafka configs using NotePad++ https://notepad-plus-plus.org/download/
    1. zookeeper.properties: dataDir=C:/kafka_2.12-2.0.0/data/zookeeper (yes the slashes are inversed)
    2. server.properties: log.dirs=C:/kafka_2.12-2.0.0/data/kafka (yes the slashes are inversed)
  3. Start Zookeeper in one command line: zookeeper-server-start.bat config\zookeeper.properties
  4. Start Kafka in another command line: kafka-server-start.bat config\server.properties
If you want to do any mini project on data streaming using Apache Kafka and Spark then contact me on samidataengineer@gmail.com