Presto And Hive Tutorial
Presto is an interesting technology. It isn’t really a database - its more of a query engine.
Presto connects to multiple data sources and allows you to query them at the same time. This is a concept known as “federated queries”, and it makes transferring data from, say, a production PostgreSQL database, to a Hive data warehouse, very easy! It has tight integration with Hive, which makes it a great alternative to traditional Data warehouse products like Vertica or Redshift.
Best of all, it is open source, and free!
In this tutorial, we will use AWS services to create a single node Presto cluster, and connect it to a managed Hive data warehouse service from AWS called AWS Glue.
Step 1 - Create an EC2 instance
We will be creating a Presto node from scratch, using an EC2 instance running Ubuntu. There are easier ways to do this, but due to fragmentation in the Presto ecosystem, there isn’t a single official source of AMIs or Docker images. This makes it hard to recommend anything other than building it yourself.
Anyways, lets sign into the EC2 console, and launch a new instance:
- Ubuntu X86
- R5 Large for 16 GB RAM
- How Presto works is it loads all the data you need into RAM before processing. Because of this, its important to have a decent amount of ram.
- Create new IAM role
- Create a new role
- Attach policies
- EC2
- AWS Glue Service Role
- S3 full access
- Name it whatever you want. I choose
presto_role
- EC2
- Attach policies
- Go back to configure instance
- Refresh
- Use that role
- Create a new role
- Give it 50 gig root storage just so we definitely won’t run out of space
- No tags
- No networking changes
- Create a key pair and download. Name it
test_pair
, or whatever else you want, just remember the name. I am calling it test pair 3
Step 2 - SSH into the EC2 instance
- Get the SSH key you download into the
~/.ssh
directory- If you’re wondering what
.ssh
is, its a hidden directory, in your home directory, that holds SSH keys.- SSH keys are a file on your computer that proves you are who you say you are, and is the most common way to access computers in the cloud. Basically, unless we configure an SSH key, we won’t be able to access the server we just spun up.
- SSH keys are only supposed to be read by you, not other users on the same computer. If you get error that the file isn’t secure enough - we can fix that by going to the
~/.ssh
directory and running the following command:chmod 400 test_pair_3.pem
- If you’re wondering what
- SSH in
- Get public IP address from EC2 dashboard
ssh -i ssh key, ubuntu@IP_ADDRESS
Step 3 - Install Presto dependencies
Presto needs Python and Java to function. We can install those with Ubuntu’s built in package manager, apt
.
sudo apt update
sudo apt -y install python
sudo apt -y install default-jre
Step 4 - Install Presto
Now it’s time to download and install Presto onto the server we just started. Start by getting the download link from the Presto download page.
- You can use the command below to download Presto, or swap out the link with whatever you got from the download page to make sure you’re using the latest version of Presto.
- Unzip the file you download
tar -xvzf presto-server-0.244.1.tar.gz
Cofigure Presto
There are 3 files we need to alter to get Presto working in general:
config.properties
- This file configures options for the Presto Cluster. We only have a single node at this point, but these options are very relevant when you start to scale to multiple nodes.jvm.config
- This file alters the behavior of the JVM (Presto runs on Java). A common mistake is to not configure a key RAM variable,-Xmx
.node.properties
- This file sets a few variables specific to the node that you are currently working on.
Also, we need to configure a file hive.properties
to allow Presto to connect to Hive, which is the data source we will be working with. Remember, Presto isn’t a database itself!
Steps
- Go into the directory you just unzipped with
cd presto-server-0.244.1
- Create a new directory called
etc
, which will be used for configuration purposes withmkdir etc
- Go into the directory
cd etc
- Create a file called
config.properties
, and open it with the Vim text editor with the commandvi config.properties
- Press
i
to go into edit mode, and insert the following code. You can learn more about what’s going on here at the Presto documentation.
Press :wq
to write the file and quit Vim.
- Go through the same process to create a file called
jvm.config
with the following contents:
- Once again for the
node.properties
file
Now create a directory called catalog and go inside of it with mkdir catalog
, then cd catalog
- And once again use Vim to insert the following contents into
hive.properties
:
Note the line hive.metastore=glue
. That line makes Presto use AWS’s glue service as its Hive metastore.
Normally we would have to spin up a Hadoop and Hive cluster to manage the data Presto is querying, but instead we will let AWS do all that for us in an S3 bucket. If you want to learn more about what’s going on here, check out the Presto Hive Connector documentation.
Run Presto
Presto is now all configured. Let’s open up a few tabs with screen to run Presto, as well as view its logs, and access it with the CLI.
Start a screen
screen
Start the Presto program
cd ../../bin
sudo ./launcher start
Open logs to ensure Presto is running
- Create new tab
ctrl-c c
cd /var/presto/data/var/log
tail 1000 -f server.log
Install Presto CLI
Let’s install the Presto CLI to run SQL commands! We can install it anywhere, so let’s just install it on our home directory.
- Create new tab
ctrl-c c
- Go home by running
cd
- Get the Presto CLI download link from the Presto download page
- Download the CLI to our home directory with
- Give ourselves the ability to execute the file with
chmod +x presto
Run the Presto CLI
The CLI is basically just a binary that we can run. Do that with ./presto --server localhost:8080
Create S3 bucket
OK, we now have the Presto server running, as well as the CLI. The last thing we need to do before we start running commands is give the Hive data warehouse a place to store data, namely, an S3 bucket!
- Go to AWS S3 console
- Create bucket, name it whatever you want, but remember the name.
Create a schema and a table
Create schema
Create table
Insert some data
Query the data
Last, check out the S3 console!
If everything went as planned, Presto, through Hive (which is managed by AWS Glue), will have dynamically created, then inserted data into an S3 bucket in the ORC format. The ORC format is a column-oriented data storage format, part of what makes Presto so fast!
From here, you can start to experiment with Presto’s other connectors to drop real amounts of useful data into an S3 bucket to connect a BI application to run big data analytics.
I hope this was useful to you. Reach out if you find any errors!