Get Started
#
What is Apache SparkApache Spark is a distributed processing engine, which is actively developed in open, and used in thousands of enterprises. Spark exposes powerful APIs in multiple languages like - Scala, Java, Python, SQL and R to process large volume of data. It can handle petabytes of data and even streaming data.
If you're getting impatient, then lets jump into installing Spark and writing our first "Hello Spark" program.
#
Install Gigahex Data PlatformFor this guide, we will be using Gigahex Data Platform to install Apache Spark. Gigahex is a free to use data plaftform that provides a simple interface to install and manage different open source distributed systems like - Apache Spark and Kafka.
#
Install on MacOSYou can run Gigahex Data Platform on MacOS, using the following instructions.
- Install the dependencies - postgresql and Java 11 for the server.
- Install the Gigahex platform
#
Install on WindowsYou can install Gigahex on Windows after enabling WSL2. Follow the below instructions to setup WSL2 and Ubuntu distro and install Gigahex.
Install WSL 2. After installation, you would need to reboot your system to continue with the installation
On reboot, login to Ubuntu shell, setup your password and run the below command to install the dependencies - postgresql and JRE 11
- Now with the dependencies installed, set up Gigahex Data platform
#
Install on UbuntuYou can install Gigahex on Ubuntu as per the instructions below.
- Run the below command to install the dependencies - postgresql and Java Runtime 11
- Now with the dependencies installed, set up Gigahex Data platform
#
Start Gigahex Data PlatformOnce you've the setup ready, run the following command to start the services, and use the admin credentials generated to login to the platform.
Open the browser and login with the credentials provided above.
- Once you've logged in, you will be asked to create a workspace for saving all the clusters, that you would be creating.
With Gigahex up and running, lets install Apache Spark and run our first program. Click on the Create Cluster button to launch the cluster wizard, to choose the latest Spark version that you want to install.
Click on Save button to proceed to the next screen, from where you can start the cluster, as shown below.
On clicking of the Start button, with play icon, the platform will download the spark package and setup in your local environment, ready to experiment with.
#
Verify the Spark installationInteractive shell or REPL(Read evaluate print loop) allows us to quickly test our programs, without the need to use any IDE.
- Scala
- Python
Now you can start the spark-shell
to execute Scala programs against the
standalone cluster.
Now you can start the pyspark
to execute Python programs against the
standalone cluster.