Presto And Hive Tutorial

This video goes through each step and explains a few things along the way.

Presto is an interesting technology. It isn’t really a database - its more of a query engine.

Presto connects to multiple data sources and allows you to query them at the same time. This is a concept known as “federated queries”, and it makes transferring data from, say, a production PostgreSQL database, to a Hive data warehouse, very easy! It has tight integration with Hive, which makes it a great alternative to traditional Data warehouse products like Vertica or Redshift.

Best of all, it is open source, and free!

In this tutorial, we will use AWS services to create a single node Presto cluster, and connect it to a managed Hive data warehouse service from AWS called AWS Glue.

Step 1 - Create an EC2 instance

We will be creating a Presto node from scratch, using an EC2 instance running Ubuntu. There are easier ways to do this, but due to fragmentation in the Presto ecosystem, there isn’t a single official source of AMIs or Docker images. This makes it hard to recommend anything other than building it yourself.

Anyways, lets sign into the EC2 console, and launch a new instance:

Step 2 - SSH into the EC2 instance

Step 3 - Install Presto dependencies

Presto needs Python and Java to function. We can install those with Ubuntu’s built in package manager, apt.

Step 4 - Install Presto

Now it’s time to download and install Presto onto the server we just started. Start by getting the download link from the Presto download page.

curl -O https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.244.1/presto-server-0.244.1.tar.gz

Cofigure Presto

There are 3 files we need to alter to get Presto working in general:

Also, we need to configure a file hive.properties to allow Presto to connect to Hive, which is the data source we will be working with. Remember, Presto isn’t a database itself!

Steps

coordinator=true node-scheduler.include-coordinator=true http-server.http.port=8080 query.max-memory=5GB query.max-memory-per-node=5GB query.max-total-memory-per-node=5GB discovery-server.enabled=true discovery.uri=http://localhost:8080

Press :wq to write the file and quit Vim.

-server -Djdk.attach.allowAttachSelf=true -Xmx14G -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+UseGCOverheadLimit -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError -XX:+ExitOnOutOfMemoryError
node.environment=testing node.id=ffffffff-ffff-ffff-ffff-ffffffffffff node.data-dir=/var/presto/data

Now create a directory called catalog and go inside of it with mkdir catalog, then cd catalog

connector.name=hive-hadoop2 hive.metastore=glue hive.allow-drop-table=true hive.allow-rename-table=true hive.s3-file-system-type=presto hive.s3.max-client-retries=50 hive.s3.max-error-retries=50 hive.s3.max-connections=1000 hive.s3.connect-timeout=5m hive.s3.socket-timeout=5m

Note the line hive.metastore=glue. That line makes Presto use AWS’s glue service as its Hive metastore.

Normally we would have to spin up a Hadoop and Hive cluster to manage the data Presto is querying, but instead we will let AWS do all that for us in an S3 bucket. If you want to learn more about what’s going on here, check out the Presto Hive Connector documentation.

Run Presto

Presto is now all configured. Let’s open up a few tabs with screen to run Presto, as well as view its logs, and access it with the CLI.

Start a screen

Start the Presto program

Open logs to ensure Presto is running

Install Presto CLI

Let’s install the Presto CLI to run SQL commands! We can install it anywhere, so let’s just install it on our home directory.

curl https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.244.1/presto-cli-0.244.1-executable.jar -o presto

Run the Presto CLI

The CLI is basically just a binary that we can run. Do that with ./presto --server localhost:8080

Create S3 bucket

OK, we now have the Presto server running, as well as the CLI. The last thing we need to do before we start running commands is give the Hive data warehouse a place to store data, namely, an S3 bucket!

Create a schema and a table

Create schema

CREATE SCHEMA hive.my_schema WITH (location = 's3://YOUR_BUCKET_NAME/');

Create table

CREATE TABLE hive.my_schema.my_table ( id int, name varchar );

Insert some data

insert into hive.my_schema.my_table values (1, 'cal');

Query the data

select * from hive.my_schema.my_table

Last, check out the S3 console!

If everything went as planned, Presto, through Hive (which is managed by AWS Glue), will have dynamically created, then inserted data into an S3 bucket in the ORC format. The ORC format is a column-oriented data storage format, part of what makes Presto so fast!

From here, you can start to experiment with Presto’s other connectors to drop real amounts of useful data into an S3 bucket to connect a BI application to run big data analytics.

I hope this was useful to you. Reach out if you find any errors!