How to become a database developer, database engineer, or DBA.

Intro

In this article, we will cover the specific skills you need to become a database developer, where to learn those skills, and how to showcase those skills to potential employers. Other common job titles with similar responsibilities include database engineer, or database administrator (DBA).

In my opinion, to be a database developer, you should:

Why should you trust me?

I am currently employed as a database engineer, and have interviewed candidates for database engineering positions.

Write complex SQL queries

To contribute to a team of software engineers as a database specialist, you need to be able to write advanced SQL queries. At a minimum, you should be able to understand what every line in the following query is doing without needing to refer to Google.

select date_trunc('month', sales.sale_date) as month, sales_people.salesperson_name as salesperson, count(*) as num_sales from sales left join sales_people on sales.salesperson_id = sales_people.salesperson_id where sales_people.city in ('Boston', 'New York') group by 1, 2 having count(*) > 100 order by 1, 2;

SQL is not a very large programming language. In my opinion, a smart, motivated person can learn how to write the above query in a couple months by doing SQL problems on HackerRank. In fact, if you commit yourself to doing SQL problems all day every day, I believe you can become somewhat proficient at writing select statements in just a few weeks.

To learn and showcase your ability to write complex SQL queries, I recommend solving database problems on HackerRank to earn a Gold SQL Badge… No cheating, though!

Deploy a database to the cloud

You ability to work with databases depends on your ability to provision computers and deploy database software onto those computers. While there are many options to deploy “managed” databases (such as AWS RDS or GCP Cloud SQL), you also need to be able to deploy a database onto your own computer for testing purposes.

Once you’ve learned how to query data with SQL, a great next step is to install an open source database such as Postgres on your own computer, insert sample data, and run a few queries. It is also a good idea to understand authentication and authorization, so try your hand at creating users, passwords, and granting control of certain databases and tables to certain users.

Setting up a database can be a pain point for beginners, here is a tutorial on setting a local PostgreSQL database with Docker

I recommend deploying the official Postgres Docker image to your local laptop, and completing a few of the database exercises from PostgreSQL Exercises. Once you’re comfortable, try doing deploying Postgres on the cloud!.

Have basic programming skills

Unless you are working with data warehouses to do big data analysis / business intelligence, you are probably going to be supporting the development of an application. Generally speaking, most modern applications have the following components:

Generally speaking, the front end talks to the backend, and the backend talks to the database. If you want to contribute to a project with other developers on it (most projects in the real world involve multiple team members), you should be able to discuss basic programming concepts with those developers. If you already have a programming language in mind, because your organization uses it, you should go with that programming language. If not, it’s hard to go wrong with Python or Javascript.

To learn and showcase your ability to write Python, Javascript, or some other programming language, I recommend solving those problems on HackerRank to earn at least a Silver Badge… No cheating, though!

Understand data modeling concepts

There are two popular types of data modeling used in production at most companies.

Relational data modeling

Relational data modeling is the most common type of data modeling. It was invented (or at least popularized) by Edgar Codd in the late 1960’s. This paradigm has stood the test of time as being flexible, useful, and easy to work with - you NEED to understand how to model data relationally to work with databases.

The following video jokes about people who are obsessed with MongoDB and other non-relational databases.

A relational data model essentially looks like a bunch of spreadsheets (called tables) that are connected together with shared columns known as “foreign keys.” These shared columns join the tables together in various ways, allowing you to connect credit card payments to customers, product sales to salespeople, professors to classrooms, etc.

I recommend reading this article on relational data modeling as an introduction to relational data modeling, and then completing this course on relational databases on Data Camp.

Non-relational data modeling

Relational databases are extremely important, but they aren’t the only solution in a good database engineer’s toolkit. The two most common reasons for using a non-relational database (commonly known as NoSQL database) are:

Scaling issues (many NoSQL databases are built to run on multiple computers at the same time - a concept known as horizontal scaling). The data isn’t a natural fit for a relational model (logging HTTP requests with JSON in them is a great use case for key value pair databases, for example). The dominant NoSQL data model is key value pairs, such as MongoDB or AWS DynamoDB. Another common data model is the graph model, of which Neo4j is a common choice. We will only go over key value pairs in this article, as graph databases are a fairly niche choice at this point in time.

Key value pairs

If you know what a Python dictionary, JSON object, or Java hashmap looks like, you know what a key value pair looks like. If not, here is an example:

{ 'name': 'Cal Mitchell', 'occupation': 'Database Engineer', 'hobbies': [ 'Surfing', 'Salsa dancing', 'Disc Golf' ], 'education': { 'Undergrad': 'La Salle University', 'Masters': 'Boston University', 'Databases graduate certificate': 'Stanford University' } }

The top level keys in this key value map are: name, occupation, hobbies, and education.

As you can see, there are more keys in education: Undergrad, Masters, and Databases graduate certificate - this is known as a nested key value pair. Making a nested key value pair creates additional complexity, but it also increases the ability of the data structure to hold more information. Making tradeoffs between complexity and completeness of information is a key part of creating good key value pair data structures.

I highly recommend taking this course on MongoDB at Data Camp to learn about key value pair data modeling, and completing a few exercises in Leetcode’s Hash Table problem set.

Improve database performance with indexes

Knowing when to add an index to a table is an important part of working with databases. An index can be thought of as a table of contents for a database - it allows you to find certain rows within a table without scanning the entire table. A fairly common index type that is fairly easy to understand is a “block range index”, or BRIN for short.

To understand what a BRIN does, you need to have a basic understanding of how a database actually stores data. Most relational databases store data using small (maybe 8 KBs) files in designated folders on your computer. For example, let’s say you are querying a table called customers, which has a column named customer_ id. If you try to look up a customer by their ID while that table doesn’t have any indexes, the database may have to scan every single file in the database for the customer that you’re looking for. There may be thousands, millions, or in some cases even billions of files to scan! Wouldn’t it be nice if the computer just knew which file to look at?

Enter the BRIN. It is a small data structure built to make queries on tables with incrementing primary keys faster! An incrementing key can be something like customer ID, where the first customer in the database has an ID of 1, the second customer has an ID of 2, etc. The BRIN stores the minimum and maximum value of a given column, such as customer ID, for each file. This makes it so the database engine can more quickly find the file with the right customer, instead of having to look through every one.

Putting indexes on a table can speed up a relational database by orders of magnitude. Read operations can become thousands of times faster! That being said, adding an index to a table creates a small performance overhead on write / update operations, but the cost is usually worth it.

I recommend reading Geek For Geek’s article on Database Indexing to learn the basics of indexing, and then reading PostgreSQL’s index documentation to learn about the various types of indexes available in modern databases.

Final project - stand up business intelligence software

Redash is a great, free option for charting software.

Your final project is spinning up a database (I recommend Postgres), uploading data into it, then setting up charting software (Redash is a great, free option) to create business intelligence reports. For example:

Spin up an instance of Postgres in Amazon RDS.

Conclusion / checklist

Depending on how much time you put into it, you should be able to get through the above material in as little as a two months. Along the way, you’ve learned valuable skills, picked up badges from HackerRank and certificates from Data Camp, and created visualizations which you can share in blog posts. All of these things will look great on a resume when applying for entry level database engineering jobs, and will give you something constructive to talk about during job interviews.

Here is a checklist containing all of the information from this article. Go through it one by one!

I hope this article has helped you understand the work required to become a database developer. Keep hacking!