Presto And Hive Tutorial

This video goes through each step and explains a few things along the way.

Presto is an interesting technology. It isn’t really a database - its more of a query engine.

Presto connects to multiple data sources and allows you to query them at the same time. This is a concept known as “federated queries”, and it makes transferring data from, say, a production PostgreSQL database, to a Hive data warehouse, very easy! It has tight integration with Hive, which makes it a great alternative to traditional Data warehouse products like Vertica or Redshift.

Best of all, it is open source, and free!

In this tutorial, we will use AWS services to create a single node Presto cluster, and connect it to a managed Hive data warehouse service from AWS called AWS Glue.

Step 1 - Create an EC2 instance

We will be creating a Presto node from scratch, using an EC2 instance running Ubuntu. There are easier ways to do this, but due to fragmentation in the Presto ecosystem, there isn’t a single official source of AMIs or Docker images. This makes it hard to recommend anything other than building it yourself.

Anyways, lets sign into the EC2 console, and launch a new instance:

Ubuntu X86
R5 Large for 16 GB RAM
- How Presto works is it loads all the data you need into RAM before processing. Because of this, its important to have a decent amount of ram.
Create new IAM role
- Create a new role
  - Attach policies
    - EC2
      - AWS Glue Service Role
      - S3 full access
    - Name it whatever you want. I choose presto_role
- Go back to configure instance
- Refresh
- Use that role
Give it 50 gig root storage just so we definitely won’t run out of space
No tags
No networking changes
Create a key pair and download. Name it test_pair, or whatever else you want, just remember the name. I am calling it test pair 3

Step 2 - SSH into the EC2 instance

Get the SSH key you download into the ~/.ssh directory
- If you’re wondering what .ssh is, its a hidden directory, in your home directory, that holds SSH keys.
  - SSH keys are a file on your computer that proves you are who you say you are, and is the most common way to access computers in the cloud. Basically, unless we configure an SSH key, we won’t be able to access the server we just spun up.
- SSH keys are only supposed to be read by you, not other users on the same computer. If you get error that the file isn’t secure enough - we can fix that by going to the ~/.ssh directory and running the following command: chmod 400 test_pair_3.pem
SSH in
- Get public IP address from EC2 dashboard
- ssh -i ssh key, ubuntu@IP_ADDRESS

Step 3 - Install Presto dependencies

Presto needs Python and Java to function. We can install those with Ubuntu’s built in package manager, apt.

sudo apt update
sudo apt -y install python
sudo apt -y install default-jre

Step 4 - Install Presto

Now it’s time to download and install Presto onto the server we just started. Start by getting the download link from the Presto download page.

You can use the command below to download Presto, or swap out the link with whatever you got from the download page to make sure you’re using the latest version of Presto.

curl -O https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.244.1/presto-server-0.244.1.tar.gz

Unzip the file you download tar -xvzf presto-server-0.244.1.tar.gz

Cofigure Presto

There are 3 files we need to alter to get Presto working in general:

config.properties - This file configures options for the Presto Cluster. We only have a single node at this point, but these options are very relevant when you start to scale to multiple nodes.
jvm.config - This file alters the behavior of the JVM (Presto runs on Java). A common mistake is to not configure a key RAM variable, -Xmx.
node.properties - This file sets a few variables specific to the node that you are currently working on.

Also, we need to configure a file hive.properties to allow Presto to connect to Hive, which is the data source we will be working with. Remember, Presto isn’t a database itself!

Steps

Go into the directory you just unzipped with cd presto-server-0.244.1
Create a new directory called etc, which will be used for configuration purposes with mkdir etc
Go into the directory cd etc
Create a file called config.properties, and open it with the Vim text editor with the command vi config.properties
Press i to go into edit mode, and insert the following code. You can learn more about what’s going on here at the Presto documentation.

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=5GB
query.max-total-memory-per-node=5GB
discovery-server.enabled=true
discovery.uri=http://localhost:8080

Press :wq to write the file and quit Vim.

Go through the same process to create a file called jvm.config with the following contents:

-server
-Djdk.attach.allowAttachSelf=true
-Xmx14G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError

Once again for the node.properties file

node.environment=testing
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/var/presto/data

Now create a directory called catalog and go inside of it with mkdir catalog, then cd catalog

And once again use Vim to insert the following contents into hive.properties:

connector.name=hive-hadoop2
hive.metastore=glue
hive.allow-drop-table=true
hive.allow-rename-table=true
hive.s3-file-system-type=presto
hive.s3.max-client-retries=50
hive.s3.max-error-retries=50
hive.s3.max-connections=1000
hive.s3.connect-timeout=5m
hive.s3.socket-timeout=5m

Note the line hive.metastore=glue. That line makes Presto use AWS’s glue service as its Hive metastore.

Normally we would have to spin up a Hadoop and Hive cluster to manage the data Presto is querying, but instead we will let AWS do all that for us in an S3 bucket. If you want to learn more about what’s going on here, check out the Presto Hive Connector documentation.

Run Presto

Presto is now all configured. Let’s open up a few tabs with screen to run Presto, as well as view its logs, and access it with the CLI.

Start a screen

screen

Start the Presto program

cd ../../bin
sudo ./launcher start

Open logs to ensure Presto is running

Create new tab ctrl-c c
cd /var/presto/data/var/log
tail 1000 -f server.log

Install Presto CLI

Let’s install the Presto CLI to run SQL commands! We can install it anywhere, so let’s just install it on our home directory.

Create new tab ctrl-c c
Go home by running cd
Get the Presto CLI download link from the Presto download page
Download the CLI to our home directory with

curl https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.244.1/presto-cli-0.244.1-executable.jar -o presto 

Give ourselves the ability to execute the file with chmod +x presto

Run the Presto CLI

The CLI is basically just a binary that we can run. Do that with ./presto --server localhost:8080

Create S3 bucket

OK, we now have the Presto server running, as well as the CLI. The last thing we need to do before we start running commands is give the Hive data warehouse a place to store data, namely, an S3 bucket!

Go to AWS S3 console
Create bucket, name it whatever you want, but remember the name.

Create a schema and a table

Create schema

CREATE SCHEMA hive.my_schema WITH (location = 's3://YOUR_BUCKET_NAME/');

Create table

CREATE TABLE hive.my_schema.my_table (
  id   int,
  name varchar
);

Insert some data

insert into hive.my_schema.my_table values (1, 'cal');

Query the data

select * from hive.my_schema.my_table

Last, check out the S3 console!

If everything went as planned, Presto, through Hive (which is managed by AWS Glue), will have dynamically created, then inserted data into an S3 bucket in the ORC format. The ORC format is a column-oriented data storage format, part of what makes Presto so fast!

From here, you can start to experiment with Presto’s other connectors to drop real amounts of useful data into an S3 bucket to connect a BI application to run big data analytics.

I hope this was useful to you. Reach out if you find any errors!