Install Hadoop 3.X : Setting up a Single Node Hadoop Cluster

Updated: Aug 24, 2019

Hadoop cluster is a collection of independent commodity hardware connected through a dedicated network(LAN) to work as a single centralized data processing resource. You can configure Hadoop Cluster in two modes; pseudo-distributed mode and fully-distributed mode.


Pseudo-Distributed Mode is also known as a single-node cluster where both NameNode and DataNode will be running on the same machine. HDFS will be used for storage and all the Hadoop daemons are configured on a single node. The fully-distributed mode is also known as the production phase of Hadoop where Name node and Data nodes will be configured on different machines and data will be distributed across data nodes.


In this article, we’ll look at the step by step instructions to install Hadoop in pseudo-distributed mode on CentOS 7.

Step 1 : Create Hadoop User

Create a new user with all root privileges. this user is to perform admin tasks of hadoop.

Start by logging in to your CentOS server as the root user.


Use the adduser command to add a new user to your system.

$ adduser hduser

Use the passwd command to update the new user's password.

$ passwd hduser

By default, on CentOS, members of the wheel group have sudo privileges.

$ usermod -aG wheel hduser

Step 2 : Installation of Java


Download and Install Oracle Java 8 JDK using the below commands

$ curl -L -b "oraclelicense=a" -O https://download.oracle.com/otn/java/jdk/8u212-b10/59066701cf1a433da9770636fbc4c9aa/jdk-8u212-linux-x64.rpm?AuthParam=1556006078_87220ee9f4a8e59beeeb3ff97c646447
$ sudo yum localinstall jdk-8u212-linux-x64.rpm

(Or)

Download the Java SE Development Kit 8u212 file from the Oracle website: jdk-8u212-linux-x64.rpm

https://www.oracle.com/technetwork/java/javaee/downloads/jdk8-downloads-2133151.html
$ sudo yum localinstall jdk-8u212-linux-x64.rpm

Setting up JAVA Environment Variables

Execute below command to edit the ~/.bashrc files

 $ sudo gedit ~/.bashrc 

Add the below variables to the file and save it


$ sudo  source ~/.bashrc

If you have multiple Java versions installed on the server you can change the default version using the alternatives system utility:

$ sudo alternatives --config java

To change the default Java version, just enter the number(JDk1.8.0_212) when prompted and hit Enter


Step 3 : Setup SSH


Install OpenSSH Server:

Hadoop requires SSH access to all the nodes configured in the cluster. For single-node setup of Hadoop, you need to configure SSH access to the localhost.

To install the server and client type: (Cent OS)

$ yum -y install openssh-server openssh-clients

then

$ sudo apt-get install openssh-client openssh-server

Start the service:

$ systemctl enable sshd.service
$ systemctl start sshd.service

Make sure port 22 is opened:

$ apt install net-tools
$ netstat -tulpn | grep :22

OpenSSH Server Configuration:

Edit /etc/ssh/sshd_config

$ gedit /etc/ssh/sshd_config

To enable root logins, edit or add as follows:

PermitRootLogin yes

Save and close the file. Restart sshd:

$ systemctl restart sshd.service 

Set up password-less SSH:

Generating a new SSH public and private key pair on your local computer is the first step towards authenticating with a remote server without a password:

$ ssh-keygen -t rsa -P ''
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 
$ chmod 0600 ~/.ssh/authorized_keys

Now check if you can SSH to the localhost without a passphrase by below command, shown as follows:

$ ssh localhost

Step 4 : Download and Configure Hadoop


Download the Hadoop-3.2.0 tar file from the apache website:

$ wget -c -O hadoop.tar.gz http://www-eu.apache.org/dist/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz
$ mkdir /usr/local/hadoop
$ chmod -R 755 /usr/local/hadoop
$ tar -xzvf /root/hadoop.tar.gz
$ mv /root/hadoop-3.2.0/* /usr/local/hadoop


Step 5 : Configure XML & Environment files


Add HADOOP_HOME environment variable pointing to your Hadoop installation and add the path to bin. That will help you to run Hadoop commands from anywhere.

Edit $HOME/.bashrc file by adding the Hadoop environment variables.


$ sudo gedit ~/.bashrc

$ sudo source ~/.bashrc

In order to run Hadoop, it should know the location of Java installed on your system.

add the Java and Hadoop environment variable in the hadoop-env.sh file.

$ cd $HADOOP_HOME/etc/hadoop
$ gedit hadoop-env.sh

You need to change XML files placed inside /etc/hadoop directory with in our Hadoop installation folder. XML files that are to be changed and changes required are listed here.

Create directory for NameNode and DataNode:

$ mkdir -p /usr/local/hadoop/hadoop_store/tmp
$ chmod -R 755 /usr/local/hadoop/hadoop_store/tmp
$ mkdir -p /usr/local/hadoop/hadoop_store/namenode
$ mkdir -p /usr/local/hadoop/hadoop_store/datanode
$ chmod -R 755 /usr/local/hadoop/hadoop_store/namenode
$ chmod -R 755 /usr/local/hadoop/hadoop_store/datanode

You can override the default settings used to start Hadoop by changing these files.

$ gedit $HADOOP_HOME/etc/hadoop/core-site.xml

1.core-site.xml : Configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce.


$ gedit $HADOOP_HOME/etc/hadoop/yarn-site.xml

2.yarn-site.xml : Configuration settings for ResourceManager and NodeManager.

$ gedit $HADOOP_HOME/etc/hadoop/hdfs-site.xml

3.hdfs-site.xml : Configuration settings for HDFS daemons, the Namenode, the secondary Namenode and the data nodes.


 $ gedit $HADOOP_HOME/etc/hadoop/mapred-site.xml 

4.mapred-site.xml : Configuration settings for MapReduce Applications.


Configure HDFS - workers

Edit workers file to include localhost as data node as well.

$ gedit $HADOOP_HOME/etc/hadoop/workers
localhost

Step 6 : Start Hadoop Daemons


Change the directory to /usr/local/hadoop/sbin

$ cd /usr/local/hadoop/sbin

Format the namenode.

$ hadoop namenode -format

Start NameNode daemon and DataNode daemon.

 $ start-dfs.sh

Start yarn daemons.

$ start-yarn.sh

(Or)

$ start-all.sh

Start History server

$ mapred --daemon start historyserver

Use the jps command to verify that all the daemons are running.

$ jps

6168 ResourceManager

6648 Jps

5997 SecondaryNameNode

5758 DataNode

5631 NameNode

6294 NodeManager

6168 ResourceManager

6648 Jps

5997 SecondaryNameNode

5758 DataNode

5631 NameNode


You are Single node Hadoop Cluster is ready!.

NameNode – http://localhost:50070/
ResourceManager – http://localhost:8088/

About Data Science Authority

Data Science Authority is a company engaged in Training, Product Development and Consulting in the field of Data science and Artificial Intelligence. It is built and run by highly qualified professionals with more than 10 years of working experience in Data Science. DSA’s vision is to inculcate data thinking in to individuals irrespective of domain, sector or profession and drive innovation using Artificial Intelligence.


Data Science Authority | Data Science Training in Hyderabad

105 views

FOLLOW US ON

  • Facebook Social Icon
  • Twitter Social Icon
  • LinkedIn Social Icon

ADDRESS

Gachibowli, Hyderabad, Telangana, India

©2019  Data Science Authority