How to become a database developer, database engineer, or DBA.
Intro
In this article, we will cover the specific skills you need to become a database developer, where to learn those skills, and how to showcase those skills to potential employers. Other common job titles with similar responsibilities include database engineer, or database administrator (DBA).
In my opinion, to be a database developer, you should:
- Be able to write complex SQL queries.
- Be able to deploy a database to the cloud.
- Have basic programming skills with an object oriented programming language like Python, Java, Ruby, etc.
- Understand relational and non-relational data modeling concepts.
- Be able to improve database performance by understanding query plans and creating indexes.
- Stand up business intelligence software to create visualizations from SQL queries.
Why should you trust me?
I am currently employed as a database engineer, and have interviewed candidates for database engineering positions.
Write complex SQL queries
To contribute to a team of software engineers as a database specialist, you need to be able to write advanced SQL queries. At a minimum, you should be able to understand what every line in the following query is doing without needing to refer to Google.
SQL is not a very large programming language. In my opinion, a smart, motivated person can learn how to write the above query in a couple months by doing SQL problems on HackerRank. In fact, if you commit yourself to doing SQL problems all day every day, I believe you can become somewhat proficient at writing select
statements in just a few weeks.
To learn and showcase your ability to write complex SQL queries, I recommend solving database problems on HackerRank to earn a Gold SQL Badge… No cheating, though!
Deploy a database to the cloud
You ability to work with databases depends on your ability to provision computers and deploy database software onto those computers. While there are many options to deploy “managed” databases (such as AWS RDS or GCP Cloud SQL), you also need to be able to deploy a database onto your own computer for testing purposes.
Once you’ve learned how to query data with SQL, a great next step is to install an open source database such as Postgres on your own computer, insert sample data, and run a few queries. It is also a good idea to understand authentication and authorization, so try your hand at creating users, passwords, and granting control of certain databases and tables to certain users.
Setting up a database can be a pain point for beginners, here is a tutorial on setting a local PostgreSQL database with Docker
I recommend deploying the official Postgres Docker image to your local laptop, and completing a few of the database exercises from PostgreSQL Exercises. Once you’re comfortable, try doing deploying Postgres on the cloud!.
Have basic programming skills
Unless you are working with data warehouses to do big data analysis / business intelligence, you are probably going to be supporting the development of an application. Generally speaking, most modern applications have the following components:
- A “front end”, the part of the program that users actually see. The most common front ends are websites and smartphone applications. The most common languages for building front ends are HTML / CSS / Javascript (for websites), Swift / Objective C (for iOS), and Java / Kotlin (for Android).
- A “back end”, the part of the the part of the program that accepts input from the front end (like user form submissions, or web searches), queries the database for information, then returns the information to the front end for formatting and presentation. The most common languages for building backends are Python, Java, GoLang, Ruby, and Javascript. There are a million programming languages out there, though. Check out the Stack Overflow Survey to see the latest and greatest trends on programming languages adoption!
- A database, the part of the program where data is stored. The most common databases are Postgres, MySQL, Oracle, and Microsoft SQL Server.
Generally speaking, the front end talks to the backend, and the backend talks to the database. If you want to contribute to a project with other developers on it (most projects in the real world involve multiple team members), you should be able to discuss basic programming concepts with those developers. If you already have a programming language in mind, because your organization uses it, you should go with that programming language. If not, it’s hard to go wrong with Python or Javascript.
To learn and showcase your ability to write Python, Javascript, or some other programming language, I recommend solving those problems on HackerRank to earn at least a Silver Badge… No cheating, though!
Understand data modeling concepts
There are two popular types of data modeling used in production at most companies.
Relational data modeling
Relational data modeling is the most common type of data modeling. It was invented (or at least popularized) by Edgar Codd in the late 1960’s. This paradigm has stood the test of time as being flexible, useful, and easy to work with - you NEED to understand how to model data relationally to work with databases.
The following video jokes about people who are obsessed with MongoDB and other non-relational databases.
A relational data model essentially looks like a bunch of spreadsheets (called tables) that are connected together with shared columns known as “foreign keys.” These shared columns join the tables together in various ways, allowing you to connect credit card payments to customers, product sales to salespeople, professors to classrooms, etc.
I recommend reading this article on relational data modeling as an introduction to relational data modeling, and then completing this course on relational databases on Data Camp.
Non-relational data modeling
Relational databases are extremely important, but they aren’t the only solution in a good database engineer’s toolkit. The two most common reasons for using a non-relational database (commonly known as NoSQL database) are:
Scaling issues (many NoSQL databases are built to run on multiple computers at the same time - a concept known as horizontal scaling). The data isn’t a natural fit for a relational model (logging HTTP requests with JSON in them is a great use case for key value pair databases, for example). The dominant NoSQL data model is key value pairs, such as MongoDB or AWS DynamoDB. Another common data model is the graph model, of which Neo4j is a common choice. We will only go over key value pairs in this article, as graph databases are a fairly niche choice at this point in time.
Key value pairs
If you know what a Python dictionary, JSON object, or Java hashmap looks like, you know what a key value pair looks like. If not, here is an example:
The top level keys in this key value map are: name
, occupation
, hobbies
, and education
.
As you can see, there are more keys in education
: Undergrad
, Masters
, and Databases graduate certificate
- this is known as a nested key value pair. Making a nested key value pair creates additional complexity, but it also increases the ability of the data structure to hold more information. Making tradeoffs between complexity and completeness of information is a key part of creating good key value pair data structures.
I highly recommend taking this course on MongoDB at Data Camp to learn about key value pair data modeling, and completing a few exercises in Leetcode’s Hash Table problem set.
Improve database performance with indexes
Knowing when to add an index to a table is an important part of working with databases. An index can be thought of as a table of contents for a database - it allows you to find certain rows within a table without scanning the entire table. A fairly common index type that is fairly easy to understand is a “block range index”, or BRIN for short.
To understand what a BRIN does, you need to have a basic understanding of how a database actually stores data. Most relational databases store data using small (maybe 8 KBs) files in designated folders on your computer. For example, let’s say you are querying a table called customers
, which has a column named customer_ id
. If you try to look up a customer by their ID while that table doesn’t have any indexes, the database may have to scan every single file in the database for the customer that you’re looking for. There may be thousands, millions, or in some cases even billions of files to scan! Wouldn’t it be nice if the computer just knew which file to look at?
Enter the BRIN. It is a small data structure built to make queries on tables with incrementing primary keys faster! An incrementing key can be something like customer ID, where the first customer in the database has an ID of 1
, the second customer has an ID of 2
, etc. The BRIN stores the minimum and maximum value of a given column, such as customer ID, for each file. This makes it so the database engine can more quickly find the file with the right customer, instead of having to look through every one.
Putting indexes on a table can speed up a relational database by orders of magnitude. Read operations can become thousands of times faster! That being said, adding an index to a table creates a small performance overhead on write / update operations, but the cost is usually worth it.
I recommend reading Geek For Geek’s article on Database Indexing to learn the basics of indexing, and then reading PostgreSQL’s index documentation to learn about the various types of indexes available in modern databases.
Final project - stand up business intelligence software
Redash is a great, free option for charting software.
Your final project is spinning up a database (I recommend Postgres), uploading data into it, then setting up charting software (Redash is a great, free option) to create business intelligence reports. For example:
Spin up an instance of Postgres in Amazon RDS.
- Upload a free public dataset that you find interesting into the database. [Here is a great list of free public datasets](https://github.com/awesomedata/awesome-public-datasets!
- Spin up an instance of Redash with an AWS AMI.
- Write a few queries, create a few visualizations, then write a blog post about it! Be sure to link to your blog post in your resume.
- Completing a project like this will show potential employers that you can set up a database, write SQL, and have the ability to manage cloud infrastructure. Maybe you’ll even learn a few interesting things by querying the dataset in the process!
Conclusion / checklist
Depending on how much time you put into it, you should be able to get through the above material in as little as a two months. Along the way, you’ve learned valuable skills, picked up badges from HackerRank and certificates from Data Camp, and created visualizations which you can share in blog posts. All of these things will look great on a resume when applying for entry level database engineering jobs, and will give you something constructive to talk about during job interviews.
Here is a checklist containing all of the information from this article. Go through it one by one!
- Learn SQL.
- How to learn: Complete HackerRank SQL problems
- How to show employers: Earn HackerRank’s gold badge in SQL -Time to complete: Three weeks to three months.
- Deploy a database to your computer, and the cloud.
- How to learn: The documentation on Dockerhub is a great place to start. Here is a tutorial on spinning up a Postgres Docker container. Once it is up and running, I recommend doing exercises from pgexercises.com. Once you’ve done that, deploy a database to the cloud.
- How to show employers: The best way to do this is to complete the business intelligence final project.
- Estimated time to complete: One day to one week.
- Learn basic programming skills.
- How to learn: Complete HackerRank problems in the programming language of your choice (I recommend Javascript or Python).
- How to show employers: Earn HackerRank’s silver badge in the language of your choice.
- Estimated time to complete: Three weeks to three months
- Understand relational and non-relational data modeling concepts.
- How to learn: Complete Data Camp’s courses on relational databases and MongoDB (a popular non-relational database).
- How to show employers: Data Camp offers a Statement of accomplishment when you finish any of their courses.
- Estimated time to complete: 2 days to two weeks.
- Improve database performance with indexes.
- How to learn: Read about indexes in Geeks for Geeks article, and Postgres’ documentation.
- How to show employers: Pick a fairly large dataset for your final business intelligence project, so it can have its query performance improved by indexes. Talk about how you improved certain queries with indexes!
- Estimated time to complete: 1 day to 1 week.
- Business intelligence final project.
- Steps to complete:
- Set up database.
- Provision a managed database server on a cloud platform (I recommend Postgres on RDS).
- Connect to database via command line.
- Upload data (pick something that interests you).
- Run a few queries with the command line.
- Spin up a server running charting software. I recommend Redash.
- Connect charting software to your database.
- Run interesting queries, create visualizations.
- Share via a blog post, be ready to talk about the queries and data on interviews!
- Steps to complete:
- Start applying for jobs. You’re a great entry level candidate at this point!
I hope this article has helped you understand the work required to become a database developer. Keep hacking!