Apache Drill with ZooKeeper - Install on Ubuntu 16.04

bogotobogo.com site search:

Apache drill

Here is the description from Apache Drill:

Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. A single query can join data from multiple datastores.
For example, you can join a user profile collection in MongoDB with a directory of event logs in Hadoop.

picture source : Drilling into Data with Apache Drill

What is Drillbit?

Drillbit is Apache drill's daemon that runs on each node of the cluster.
It uses ZooKeeper for all communication in the cluster and maintains cluster membership.
It is responsible for accepting requests from the client, processing the queries, and returning results to the client.
Drillbit that receives the request from the client is called foreman.

Prerequisite

Java 7 or 8 should be installed before we use Drill. Let's check:

$ java -version
openjdk version "1.8.0_111"
OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14)
OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)

If not installed, we need to install Java:

$ sudo apt-get update

$ sudo apt-get install default-jdk

Install drill on Ubuntu 16.04

Download from Apache Drill site, and extract it to /usr/local/:

$ wget http://www-us.apache.org/dist/drill/drill-1.8.0/apache-drill-1.8.0.tar.gz

$ sudo tar xvzf apache-drill-1.8.0.tar.gz -C /usr/local

Now, we can start Drill.

Start Drill in embedded mode

To start the Drill shell in embedded mode, we want to use the drill-embedded command. Internally, the command uses a jdbc connection string and identifies the local node as the ZooKeeper node.

Before we start Drill, let's navigate to the Drill installation directory (/usr/local/apache-drill-1.8.0) and issue the following command to start the Drill shell:

root@laptop:/usr/local/apache-drill-1.8.0# bin/drill-embedded
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Nov 10, 2016 10:27:25 PM org.glassfish.jersey.server.ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
apache drill 1.8.0 
"a little sql for your nosql"
0: jdbc:drill:zk=local>

Embedded mode requires less configuration & it is preferred for testing purpose.

As we can see from the message, we got 0: jdbc:drill:zk=local> prompt which suggests the drill-embedded command uses a jdbc connection string and identifies the local node as the ZooKeeper node.

Here is the description about the prompt:

0 is the number of connections to Drill, which can be only one in embedded node.
jdbc is the connection type.
zk=local zk=local means the local node substitutes for the ZooKeeper node. In other words, Drill in embedded mode does not require installation of ZooKeeper.

Note that drillbit (Drill daemon) starts automatically in embedded mode.

Now the Drill shell is running:

$ netstat -nlpt|grep 8047
tcp6   0   0 :::8047   :::*  LISTEN      -

Let's open the Drill Web Console that is available at localhost:8047:

To exit the Drill shell, issue !quit command:<./p>

0: jdbc:drill:zk=local> !quit
Closing: org.apache.drill.jdbc.impl.DrillConnectionImpl
root@laptop:/usr/local/apache-drill-1.8.0/bin#

Install Zookeeper

To install Drill on nodes in the cluster, we need to configure a cluster ID and add Zookeeper information.

So, we need install Apache ZooKeeper which coordinates and synchronizes configuration information of nodes of a distributed system.

Since the ZooKeeper package is available in Ubuntu's default repositories, we can install it using apt-get:

$ sudo apt-get install zookeeperd
...
update-alternatives: using /etc/zookeeper/conf_example to provide /etc/zookeeper/conf (zookeeper-conf) in auto mode
Setting up zookeeperd (3.4.8-1) ...
...

Once installed, ZooKeeper will be started as a daemon automatically. By default, it listens on port 2181:

$ sudo netstat -nlpt | grep ':2181'
tcp6   0    0 :::2181   :::*   LISTEN    13827/java

At the telnet prompt, type in ruok and press ENTER. If everything's fine, ZooKeeper will say imok and end the Telnet session.

$ telnet localhost 2181
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
ruok
imokConnection closed by foreign host.

Or we can check the status:

# /usr/share/zookeeper/bin/zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /etc/zookeeper/conf/zoo.cfg
Mode: standalone

We may want to a user (here, ubuntu) to the zookeeper group so that we can write zookeeper_server.pid under /var/lib/zookeeper directory (called dataDir), which is needed when we try to stop zookeeper process via the "pid":

$ sudo usermod -a -G zookeeper ubuntu

$ groups ubuntu
ubuntu : ubuntu adm dialout cdrom floppy sudo audio dip video plugdev netdev lxd zookeeper

Running ZooKeeper in standalone mode is convenient for evaluation and development, and testing. In production, however, we should run ZooKeeper in replicated mode. A replicated group of servers in the same application is called a quorum, and in replicated mode, all servers in the quorum have copies of the same configuration file.

The directory structure of the ZooKeeper looks like this:

Stop the ZooKeeper that's been running when we installed it and run again:

# /usr/share/zookeeper/bin/zkServer.sh stop

# /usr/share/zookeeper/bin/zkServer.sh start
ZooKeeper JMX enabled by default
Using config: /etc/zookeeper/conf/zoo.cfg
Starting zookeeper ... STARTED

Note: ZooKeeper configuration file is /etc/zookeeper/conf/zoo.cfg.

Configure Drill in distributed mode

Here is the directory structure of the installed Drill:

In conf/drill-override.conf, we need to specify the Drill cluster ID (cluster-id), and provide ZooKeeper host names and port numbers in zk.connect to configure a connection to our ZooKeeper quorum.

If we install Drill on multiple nodes, assign the same cluster ID to each Drill node so that all Drill nodes share the same ID. The default ZooKeeper port on the open source version of Apache Drill is 2181:

 drill.exec:{
  cluster-id: "<mydrillcluster>",
  zk.connect: "<zkhostname1>:<port>,<zkhostname2>:<port>,<zkhostname3>:<port>"
 }

Our configuration :

drill.exec: {
  cluster-id: "drillbits1",
  #zk.connect: "localhost:2181"
  zk.connect: "54.183.167.85:2181"
}

Start Drill in distributed mode

In distributed mode, drill runs on one or more nodes in a clustered environment. Running a ZooKeeper quorum is required.

To use Drill in distributed mode, we first need to start a Drill daemon (Drillbit) on each node in the cluster. So, before attempting to connect a client we should start the Drillbit using drillbit.sh command:

Note that we can use the drillbit.sh command to perform other tasks as well such as:

Check the status of the Drillbit.
Stop or restart a Drillbit.
Configure a Drillbit to restart by default.

The syntax of drillbit.sh command looks like this:

drillbit.sh [--config ] (start|stop|status|restart|autorestart)

Let's start it.

We have two distributed nodes:

server.1: 54.183.167.85 (private: 172.31.2.106)
server.2: 54.67.74.118 (private: 172.31.13.111)

We run Drill on each servers:

ubuntu@ip-172-31-2-106:/usr/local/apache-drill-1.8.0$ sudo su
root@ip-172-31-2-106:/usr/local/apache-drill-1.8.0#

root@ip-172-31-2-106:/usr/local/apache-drill-1.8.0# ./bin/drillbit.sh start
Starting drillbit, logging to /usr/local/apache-drill-1.8.0/log/drillbit.out

Same on the 2nd server:

ubuntu@ip-172-31-13-111:/usr/local/apache-drill-1.8.0# sudo ./bin/drillbit.sh start

To run Zookeepers, we use the same command on each machine:

ubuntu@ip-172-31-2-106:~$ sudo /usr/share/zookeeper/bin/zkServer.sh start

ubuntu@ip-172-31-13-111:~$ sudo /usr/share/zookeeper/bin/zkServer.sh start

The configurations ((/usr/local/apache-drill-1.8.0/conf/drill-override.conf) for all Drill nodes are the same :

drill.exec: {
  cluster-id: "drillbits1",
  zk.connect: "54.183.167.85:2181,54.67.74.118:2181"
}

Same configurations for Zookeeper server 1 & 2 (/etc/zookeeper/conf/zoo.cfg):

# the port at which the clients will connect
clientPort=2181

# specify all zookeeper servers
server.1=54.183.167.85:2888:3888
server.2=54.67.74.118:2888:3888

The server ids for Zookeeper were set in /etc/zookeeper/conf/myid, just a unique numbers: for server #1 : 1, and for server #2 : 2.