Creating a virtual hadoop cluster with VirtualBox, PfSense, Cloudera Manager and CentOS 6
Setting up a virtual Hadoop cluster can soon become a deployment and maintenance nightmare. Especially when it comes to network configuration, Hadoop seems to be very picky. To cope with this problem, we will describe in this article how concerns (network and actual processing) can be divided in a simple manner. We will use a separate guest OS for network configuration, attach it to the host OS via bridged networking, and run the Hadoop cluster on an internal VirtualBox network.
Doing so, we have our own network infrastructure for Hadoop. Depending on the requirements, this network can be protected from outside access. The use case I had in mind when writing this up, was the ability to run a virtualized Hadoop cluster while on the road. In an ideal setup, you don't want to be dependent on an external network infrastructure. If you would use bridged networking, for all your Hadoop nodes, this would be the case. The very simple set-up would come at too high a price when mobility is involved.
The described solution is quite elegant and can be used with the Cloudera Manager (which was also a requirement). The naive solution would be to deploy CDH3 VM's, but as it turned out, there were some issues that needed to be tackled first. Starting from scratch gave the clearest picture as to what went wrong. At the same time, it gave the opportunity to strip down the nodes from memory and disk space requirements (which is also a good thing if you have only 8GB RAM and a 512 GB SSD disk to use).
The following sections describe how to set up each of the used components.
Setting up VirtualBox VM's
When settting up VirtualBox VM's, it is best to make a planning first and log all the changes you make to your template VM's. This is especially important if you will be cloning VM's - all settings will be copied, and in order for a hadoop cluster to work, all hosts need different network settings (hostname, mac-address). So you will need to keep track of the configuration changes in your template VM's. This is also important if you use the [Reset all mac addresses on all network cards] option when cloning a VM. This option will overwrite your network settings (a backup is saved on the guest system) in the best case. In the worst case, it will disable your network card on already-installed VM's.
When setting up VirtualBox VM's you may consider the following options:
Setting up PfSense
PfSense is a firewall and networking solution which allows you to delegate all network and security related issues to a dedicated host and focus on the cluster setup itself. You can download it at www.pfsense.org
In order to not complicate things, we start with the PfSense factory settings (DHCP and DNS forwarding enabled ) and we do some minor changes to set up our network.
Setting up CentOS hosts
Adaptations for Cloudera Manager
Allowing PfSense to be managed from your Host OS
Normally this is not encouraged: http://doc.pfsense.org/index.php/How_can_I_access_the_webGUI_from_the_WAN%3F but since the idea is to economize on resources (so we can have more resources for Hadoopnodes), we do not want to run a desktop instance inside the VirtualBox network, and require access from the WAN interface.
This is the result:
Allowing your Hadoop cluster to be managed from your host OS
Apart from the memory footprint, there are other advantages to having your cluster accessible from outside the firewalled network.
To gain access to your hadoop services, adding NAT rules will unfortunately not suffice. You may gain access to your server management site, but fine-grained control and reporting for each separate node will not be possible because of name resolution issues (links keep pointing to resources behind the firewall) . A reverse proxy is required to rewrite all links contained in the management interface. In the following section we describe how to set up squid's reverse proxy to resolve this issue.
(to be continued)