Creating a virtual hadoop cluster with VirtualBox, PfSense, Cloudera Manager and CentOS 6

Setting up a virtual Hadoop cluster can soon become a deployment and maintenance nightmare. Especially when it comes to network configuration, Hadoop seems to be very picky. To cope with this problem, we will describe in this article how concerns (network and actual processing) can be divided in a simple manner. We will use a separate guest OS for network configuration, attach it to the host OS via bridged networking, and run the Hadoop cluster on an internal VirtualBox network.

Doing so, we have our own network infrastructure for Hadoop. Depending on the requirements, this network can be protected from outside access. The use case I had in mind when writing this up, was the ability to run a virtualized Hadoop cluster while on the road. In an ideal setup, you don't want to be dependent on an external network infrastructure. If you would use bridged networking, for all your Hadoop nodes, this would be the case. The very simple set-up would come at too high a price when mobility is involved.

The described solution is quite elegant and can be used with the Cloudera Manager (which was also a requirement). The naive solution would be to deploy CDH3 VM's, but as it turned out, there were some issues that needed to be tackled first. Starting from scratch gave the clearest picture as to what went wrong. At the same time, it gave the opportunity to strip down the nodes from memory and disk space requirements (which is also a good thing if you have only 8GB RAM and a 512 GB SSD disk to use).

The following sections describe how to set up each of the used components.
 

Setting up VirtualBox VM's

 

When settting up VirtualBox VM's, it is best to make a planning first and log all the changes you make to your template VM's. This is especially important if you will be cloning VM's - all settings will be copied, and in order for a hadoop cluster to work, all hosts need different network settings (hostname, mac-address). So you will need to keep track of the configuration changes in your template VM's. This is also important if you use the [Reset all mac addresses on all network cards] option when cloning a VM. This option will overwrite your network settings (a backup is saved on the guest system) in the best case. In the worst case, it will disable your network card on already-installed VM's.

When setting up VirtualBox VM's you may consider the following options:

  • Storage: if you have a SSD disk, you can signal this to VirtualBox
  • System: look up minimal system requirements for the PfSense system (there is no use in running a 512 MB RAM, 8GB disk system if we can do with 128MB RAM and 1 GB of disk space.
  • Network: use a bridged adapter for eth0 and an internal network with the name vbox.macbook (or something else - try to avoid .local, as this may interfere with some protocols) for eth1.
  • NAT: we will configure NAT on the PfSense system. No need for NAT here.

Setting up PfSense

 

PfSense is a firewall and networking solution which allows you to delegate all network and security related issues to a dedicated host and focus on the cluster setup itself. You can download it at www.pfsense.org

In order to not complicate things, we start with the PfSense factory settings (DHCP and DNS forwarding enabled ) and we do some minor changes to set up our network.

  • [System] => [General Setup]: select a hostname and use the domain of the internal network you configured in VirtualBox, for example:

    • [Hostname]: pfsense

    • [Domain]: vbox.macbook

    • Do not change the DNS servers, we expect to receive them from a well configured DHCP-server on the WAN interface. Should you require a DNS server, you can use the OpenDNS servers at 208.67.222.222 or 208.67.220.220 (there is a third one which I always fail to remember)

  • [Interfaces] => [LAN]:

    • select static [IP address] for your LAN interface, for example 192.168.1.1. Make sure that you do not use VirtualBox's Host-Only network setting. The idea is to use PfSense as a DHCP server and a DNS forwarder. If you use Host-Only instead of internal network, the DHCP server will fail to work. For some reason - even with DHCP disabled in the Host-Only network setting - DHCP leases are not distributed the way they should be.

  • [Services] => [DHCP server]:

    • select a [Range], for example 192.168.1.100 to 192.168.1.254

    • use the [Domain name] from your VirtualBox internal network

  • [Services] => [DNS forwarder]

    • Check [Enable DNS forwarder]

    • Check [Register DHCP leases in DNS forwarder]

    • Check [Register DHCP static mappings in DNS forwarder]

Setting up CentOS hosts

 

  1. In the setup wizard, select a hostname, for example hadoop-node-00.vbox.macbook

  2. Click [Configure Network], and edit the wired interface [System eth0]

  3. Check [Connect automatically]

  4. In the tab [IP4 settings], fill out your host's name, without the domain name, for example hadoop-node-00, in the [DHCP client ID] text box

 

Adaptations for Cloudera Manager

 

  • Change the file /etc/redhat-release, and make sure 'Linux' is in the existing string. If this is not the case, Cloudera Manager will fail to install because it does not recognize the distribution

  • Make sure that SELinux is disabled in /etc/selinux/config, SELINUX=disabled

  • Make sure that the DHCP_HOSTNAME variable is set in /etc/sysconfig/network, for example DHCP_HOSTNAME=hadoop-node-00, for the first node

  • Check that your /etc/resolv.conf contains the following lines:

    • search vbox.macbook

    • nameserver 192.168.1.1

  • You should use VirtualBox's internal network domain name and the static ip address of your PfSense host

  • Make sure that your firewall is shut down on the host systems. This is easiest achieved by running system-config-firewall-tui and disable the firewall. You can check firewall activity quickly by running telnet - ping (ICMP) can sometimes pass the firewall, while telnet is forbidden. or iptables -L will show you your iptables status.

  • On CentOS minimal distributions, you can use ​chkconfig iptables off

Allowing PfSense to be managed from your Host OS

Normally this is not encouraged: http://doc.pfsense.org/index.php/How_can_I_access_the_webGUI_from_the_WAN%3F but since the idea is to economize on resources (so we can have more resources for Hadoopnodes), we do not want to run a desktop instance inside the VirtualBox network, and require access from the WAN interface.

  • Add a NAT rule, (there is an option to automatically add an associated filter rule) in PfSense with the following properties:
    • [Interface]: WAN
    • [Protocol]: TCP
    • [Destination]=>[Type]: WAN
    • [Destination port range]: from 443 to 443
    • [Redirect target IP]: 192.168.1.1
    • [Redirect target port]: 443
  • Selecting [Create new associated filter rule] in the [Filter rule association] option will automatically generate a firewall rule to allow the NAT traffic.

This is the result:

Allowing your Hadoop cluster to be managed from your host OS

 

Apart from the memory footprint, there are other advantages to having your cluster accessible from outside the firewalled network.

  • The first one is, of course, that you are able to use the setup in a live environment: remote access from your host OS is, conceptually, not different from remote acessing your cluster from any other place.
  • The second one is that you do not need to connect your hard drives to the VM's: with remote access you can upload your files from the Hue interface. From an architectural point of view, you decouple storage and VM, which is always a nice thing.

To gain access to your hadoop services, adding NAT rules will unfortunately not suffice. You may gain access to your server management site, but fine-grained control and reporting for each separate node will not be possible because of name resolution issues (links keep pointing to resources behind the firewall) . A reverse proxy is required to rewrite all links contained in the management interface. In the following section we describe how to set up squid's reverse proxy to resolve this issue.

(to be continued)

 

 

Kb Type: