You already use database normalization concepts to avoid food poisoning.

Dr Frank Canovatchel better be so proud of me.*

This article has been updated. Please see the BCNF section for the content update.

A few of my cohort and I from Flatiron School still bump our group chat every so often. The topic this fall was someone’s move to a position which requires normalizing a database and doing some frontend work to merge a few customer support and relationship management tools together. We got into a discussion of what data normalization actually entails and I figure that plenty of junior developers out there could benefit from my three years of database architecture and administration being condensed down into a single article. 🙂

In my professional experience, there’s only a couple of scenarios where the exact specifications of data normalization matter and most of them have to do with describing a statement of quality to a customer or asking if a potential hire has at least heard of it. Just like syntax can be referenced later, I just reference the exact specifications of normalization when I need it and focus on understanding the problems that normalization aims to prevent. That way I can build a sense of what “looks right” and what I would “want to investigate with a business user to ensure this represents reality”, which is how I’ve approached this topic in interviews (where I’ve landed the job, even!).

So to summarize normalization as a whole, understand that the goal is the storing of data in an efficient way without duplicating it or making it difficult to later reference for analysis.

A relational database describes relations between tables as constraints so that insertions, updates, and deletes affect all related records. Failing to satisfy a constraint (for example that a record table B should be deleted if its related record in table A is deleted) should cause a transaction in a properly normalized database to fail, and be rolled back. The query can be rewritten to satisfy the constraint (delete record in table A and all references in dependent tables) and it should then succeed.

I used our work copy of GPT-4 to summarize normalization because database architecture was a long time ago. I’ll denote sections I obtained from work copy of GPT-4 with preformatted text.

Enter the Pizza Kitchen

Eventually, in this conversation, I landed on an example that should make sense to most people. Concepts that this analogy rely on are an understanding of how flatbread pizzas are assembled and cooked, and kitchen food safety practices that stress limiting crossing different foods to avoid contaminating them.

 You are in a pizza kitchen.

Before you are the bins of doughs and topping ingredients.

You have an order for a mushroom pizza. You open the mushroom bin and spoon out some mushrooms to put on the pizza. In addition to the mushrooms, you’ve obtained olives. Because the bins were cross contaminated, these ingredients can be considered examples of data that was not properly normalized. Some sort of insertion anomaly caused olives to be entered into the mushroom bin, and this damaged the integrity of the mushroom contents. Now, olives and mushrooms will both be returned for selections from the mushroom bin. This will continue to be true until the bin is emptied and cleaned.

A database can (and should) be configured to reject changes if they don’t meet certain criteria, and that would be like if the mushroom bin could have spoken to the prep cook and said “Um, hi, I see you’re filling the olive bin but my lid is loose and you need to shut that before you get olives in me because I’m busy being the mushroom bin right now and you shouldn’t put olives in me. Yeah, thanks. Cheers. Catch you when you have a mushroom order. :)”

Databases tend to say something like “Transaction failed, foreign key constraint” and I get that it’s not personal but I would love to see the pendulum swing from errors that don’t take feelings into account and have lots of information, away from errors that only take feelings into account and don’t have any information, to a polite and friendly error that also includes lots of information. Maybe I’m just a squishy person but it’s always more encouraging when an error message frames itself with remediation advice and encourages the user to try again once the issue is resolved.

First Normal Form 1NF

Anyway, normalization. The mushroom bin with olives in it has failed the basic requirement for normalization, (First Normal Form, also written as 1NF), which is that one column has one type of data in every row that references this column. We expected mushrooms, which we can do certain operations with. Olives have different cooking roles, they taste different, and customers with allergies to either now must forgo mushrooms and olives at this establishment unless there are some mushrooms and olives in two other separate bins that haven’t touched each other.

Data should be stored in a type that matches how it will be used – dates should be stored as a date or datetime so that time calculations should be applied, not as a string or an integer. When you have to update multiple tables as part of a record insertion, ensuring that all the records related to this insertion are updated correctly and accurately as part of the transaction is a normalization concern. If a transaction were to partially succeed, you would then have an inaccurate record. Or perhaps a bin that is 2/3 mushrooms and 1/3 olives.

Allowing data of the wrong type to be entered into a column instead of checking on insert means that olives or peppers could end up in the bin that is meant for mushrooms. 1NF, in that way, protects against insertion anomalies which introduce data of the wrong type. Accidentally offsetting the column intended to be entered to or mismatch in typing between the form and the database would be ways this could happen accidentally, and a database resists the problems that come with a column of mixed types when it is in 1NF.

Second Normal Form 2NF

2NF has to do with understanding what a relational database considers to be an entity and what attributes describe that entity. A whole table basically describes all entities of type – so a table is kind of like a class entity, and each record or row describes a specific instance of that entity. All the columns are attributes of that entity. A database that is properly normalized to 2NF resists data duplication, because it’s understood which tables are going to contain the data columns that describe whatever you’re storing. In a pizza kitchen, this would be like having one shelf for the resting dough, one shelf for prepped veggies, and one shelf for prepped meat toppings. Each are in their own bins (records) but you don’t put them on the same shelves, because in a pizza kitchen that would risk cross-contamination and in both a database and a pizza kitchen it makes logical sense to for example place vegetables together so that vegetables can be selected together.

Third Normal Form 3NF

3NF has to do with describing relationships between tables, which in a relational database tends to be achieved by having a column that references another table, typically by a numeric identifier that tracks a record in several related tables. This in the kitchen would be analogous to like ensuring that pizzas are not prepped and cooked without a customer to order them, and that orders for a customer all go together. 3NF is where you start hearing about transactions and that’s also why I’m bringing up customers with multiple orders. Ensuring that records in all related tables all get updated, all get inserted, all get deleted or the whole transaction fails means that you can be sure that your whole database represents reality. In a pizza kitchen, if you have a customer who ordered three pies, two salads, a garlic knots and four sodas, you wouldn’t give them two pies and three sodas. You would wait until their order was fully prepped and ready to go, and then give it to the customer. The only acceptable reason in a pizza kitchen to partially fulfill an order (transaction) is being out of an ingredient, which is where the analogy gets weaker, but a database that resists these partial transactions protects the integrity and accuracy of all the data stored there. And likewise if you’re constantly out of some ingredient customers are going to leave reviews to that effect and you won’t look like a good pizza kitchen.

Boyce-Codd Normal Form BCNF and Fourth Normal Form 4NF

BCNF and 4NF cares about what is used as the identifier – there’s a reason why a sequentially increasing integer numeric identifier was The Preferred Identifier and that’s because it’s space efficient and unlikely to be changed by whatever is being described.

Assigning a number to your customers protects against having to update your customer database when your customer changes their phone number and ohhhhh noooo someone else had that phone number before and they passed away or something and now you’ve got a weird merge of these two different customer accounts. That’s a problem that should never happen and picking the correct identifier prevents it.

Update: The way this article was originally written reflected some outdated knowledge I had. With thanks to Izzy Fuller : The current preferred identifier is a Universally Unique Identifier (UUID). This prevents malicious attackers from gaining insight into how the data is organized, or from misusing an authenticated session to view the account details of another user by changing parameters to increase or decrease the user ID provided. These identifiers are a standard bit length, so they do take more space on disk than a sequential integer, but the security benefits are worth the extra small amount of space.

“There is a story about BofA who had a huge security hole in their website that someone discovered. After logging into their account, they went into the URL and change the Account Number portion of the URL to the next higher number. Because their token was already authenticated and the authorization logic had a vulnerability, this use was just suddenly in someone else’s online banking dashboard.”

Izzy Fuller

A note on multi-value keys: because it’s technically possible to use multiple columns in aggregate as a primary key, that kind of decision should be made only by like a tier 3 database admin/architect who has exhausted all other options and hates the business analyst who is requiring them to do this.


Fifth Normal Form 5NF


There’s a stylistic argument to be made, in my opinion, about whether 5NF is necessary. Having bigger tables used to be a bigger issue when individual hard disks were lower capacity and slower. Smaller tables were easier to stripe across disks. But they require more joins. Bigger tables that aggregate more about an entity can still have performance hits to read but because the amount and complexity of joins is fewer some people prefer not going that far.
 Also I’m working in a database at work right now that isn’t normalized and has a ton of tiny tables with a very “you can’t get there from here” kind of approach to joining so I’m starting to see the appeal of the “structure on read” crowd.

Vocabulary And Casual Definitions

  • Primary Key
    • A unique identifier of each record in a relational database
  • Foreign Key
    • A unique identifier of each record in one table that indicates which record in another table these records are related to.
  • Relationship
    • A formal relationship between tables, typically denoted by a column in one of the databases which stores the primary unique identifier of a different, related table. Uses Foreign Keys in this way.
  • Insertion anomaly
    • Data that has erroneously and successfully been inserted into a database. May be inaccurate, meant to be in another table, improperly imported, or have other integrity issues.
  • Transaction
    • Related actions on a database that affect updating, inserting, or deleting records in multiple tables.
  • Normalization
    • It's a process used in database design to organize data to reduce redundancy and improve data integrity. It involves dividing a database into tables and defining relationships between them according to rules known as normal forms.
  • Normal Form
    • A set of requirements that a database meets to ensure that data is organized properly and not doing things like storing the same exact thing in multiple tables, thus making it difficult to keep the data updated accurately.
    • Every normal form is cumulative and includes the requirements of the ones that came before.
    • Basically, if a database has a quirk of how the data is stored that makes it annoying to query because you have to figure out which table has the correct version of column_in_multiple_tables or any other kind of fist through monitor syndrome, it’s probably a symptom of not being normalized.
  • First Normal Form | 1NF
    • Ensures that each column contains atomic values and each entry in a column is of the same data type.
  • Second Normal Form | 2NF
    • Achieved when a database is in 1NF and all non-key attributes are fully functionally dependent on the primary key.
  • Third Normal Form | 3NF
    • Achieved when a database is in 2NF and all the attributes are only dependent on the primary key, removing transitive dependency.
  • Boyce-Codd Normal Form | BCNF
    • An advanced version of the Third Normal Form (3NF).
    • For every functional dependency (X → Y), X should be a superkey.
    • In BCNF, every determinant must be a candidate key. BCNF addresses certain anomalies that 3NF does not handle, particularly when there are overlapping candidate keys.
  • Fourth Normal Form | 4NF
    • Deals with multi-valued dependencies.
    • Achieved when a database is in Boyce-Codd Normal Form (BCNF) and has no multi-valued dependencies, except for those that are trivial.
    • A multi-valued dependency occurs when one attribute in a table uniquely determines another attribute, independent of other attributes. 4NF ensures that these dependencies are properly separated into different tables to eliminate redundancy and maintain data integrity..
  • Fifth Normal Form | 5NF , also known as Project-Join Normal Form |PJNF
    • Addresses issues related to join dependencies.
    • Achieved when a database is in Fourth Normal Form (4NF) and its tables cannot be decomposed into smaller tables without losing information, except by trivial join dependencies.
    • 5NF ensures that data is structured in such a way that it can be reconstructed from smaller tables without any loss of information, eliminating redundancy and maintaining data integrity.

Updates:

  • Moved vocabulary to end of article
  • Updated identifier section to describe the move to UUIDs
*He can have 100% of the ad revenue on this page, as the proverbial kickback promised by all students who graduated under his expert instruction. :)
(This is a joke, I don’t run ads on my website)

Thank you for reading. Please let me know if there are any clarifications I can make or further questions I can answer either down in the comments, on LinkedIn, or hit me up on Mastodon.

Related Posts