Skip to content

Databases

A database is an organized collection of data stored allowing for search, retrieval and modification of the said data. The DBMS is the system that serves as an interface between the database and its end users or programs. It manages data storage, the database engine and the database schema to ensure data security, integrity and concurrency control. Based on architectural design there are two types of database:

  • Relational (SQL): These are schema driven and divide data into format tables consisting of rows(tuples) and columns(attributes).
  • Non-Relational (NoSQl): These are schema agnostic databases which store data in non-tabular data models to store unstructured or semi-structured data, optimizing for high throughput, low latency, horizontal scalability, etc.

Relational Databases

Relational databases are those types of databases that are used to store and retrieve data that is related to each other. In it data is structured into formal mathematical relations commonly materialized as tables:

  • Tables (Relations): A table contains data about a specific entity type such as Users.
  • Rows (Tuples): A single, implicit data item in a table. Each row represents a discrete instance of the entity.
  • Columns (Attributes): Properties or fields defining the data type and constraints for the data stored within that specific vertical slice of the table.

Integrity and Keys

To maintain data accuracy, consistency and valid relationships, an RDBMS enforces strict constraint via keys:

  • Primary Key: A column that is filled with keys to uniquely identify each row. These cannot be null and must be immutable.
  • Foriegn Key: A column in a table that references the primary key in another table. These are used to signify relation between two tables and enforce referential integrity.
  • Unique Constraint: Ensures all values in a column are distinct, though unlike a PK it can accept null.
  • Check: used to evaluate boolean expression before doing data modification.

Data Normalization

Normalization is the systematic process of structuring a database to reduce redundancy and improve data integrity by eliminating anomalies. It is divided into progressive stages called Normal Forms:

  • First Normal Form: For a table to be in the first normal form, it must meet the following criteria:
    • a single cell must not hold more than one value (atomicity)
    • there must be a primary key for identification
    • no duplicated rows or columns
    • each column must have only one value for each row in the table
  • Second Normal Form: The 1NF only eliminates repeating groups, not redundancy. That’s why there is 2NF. A table is said to be in 2NF if it meets the following criteria:
    • it’s already in 1NF
    • has no partial dependency. That is, all non-key attributes are fully dependent on a primary key.
  • Third Normal Form: When a table is in 2NF, it eliminates repeating groups and redundancy, but it does not eliminate transitive partial dependency. This means a non-prime attribute (an attribute that is not part of the candidate’s key) is dependent on another non-prime attribute. This is what the third normal form (3NF) eliminates. So, for a table to be in 3NF, it must:
    • be in 2NF
    • have no transitive partial dependency.

The ACID Transactional Model

An RDBMS relies on transactions—logical units of work consisting of one or more SQL statements. To ensure reliability despite system crashes or concurrent execution, transactions must adhere to the ACID properties:

  • Atomicity: The "all-or-nothing" property. If any single statement within a transaction fails, the entire transaction is aborted, and the database is rolled back to its pre-transaction state.
  • Consistency: A transaction can only transition the database from one valid, consistent state to another, maintaining all schema rules, constraints, and triggers.
  • Isolation: Ensures that concurrently executing transactions do not interfere with each other. The execution of one transaction must be invisible to others until it is fully committed. This is managed via isolation levels (e.g., Read Committed, Repeatable Read, Serializable) using locking mechanisms or MVCC (Multi-Version Concurrency Control).
  • Durability: Guarantees that once a transaction commits, its changes survive permanently in non-volatile storage, even in the event of a total system or power failure (typically achieved via write-ahead logging).

Storage Mechanism

To avoid scanning every row sequentially (a Full Table Scan), an RDBMS builds secondary data structures called indexes. The most common structure is the B+ Tree which is a self-balancing N-ary search trees. Internal nodes store routing keys to guide search operations, while all actual data pointers or row IDs are stored exclusively in the leaf nodes. Leaf nodes are also linked sequentially, allowing for highly efficient O(log n) point lookups and fast range scans

When a SQL query is submitted, the RDBMS compiler passes it to a cost-based Query Optimizer. The optimizer evaluates statistics stored by the database (like data distribution and index cardinality) to determine the most computationally efficient execution plan such as deciding whether to use index seeks, index scans, or table scans, and selecting join algorithms like Hash Joins, Merge Joins, or Nested Loops.

Scaling

Paritioning

Partitioning is the process of splitting a single large table into smaller, more manageable pieces called partitions, while keeping them inside the same database instance. The database server manages the splitting automatically; to the application layer, the table still appears as a single entity. It is typically implemented horizontally (splitting rows) based on partition key.

  • Range Partitioning: Rows are assigned to partitions based on a range of values (e.g., OrderDate by month or year).
  • List Partitioning: Rows are assigned based on a explicit list of values (e.g., Region column matching 'US', 'EU').
  • Hash Partitioning: A hashing algorithm is applied to the partition key to distribute rows evenly across a fixed number of partitions.

Sharding

Sharding takes the concept of horizontal partitioning and extends it across multiple independent physical database servers (nodes). Each node contains a subset of the data, and no single node holds the entire database.

  • A Shard Key determines which physical node stores a specific row.
  • An application router or middleware uses the shard key to route read and write requests to the correct database instance.
  • Shards do not communicate with each other natively; they are completely isolated, shared-nothing architectures.

Federation

Federation scales a system by splitting databases up by functional or business domains across distinct database servers.Instead of splitting a single table across nodes (like Sharding), Federation isolates entirely separate tables or microservices into their own dedicated database instances.Eg: a monolithic database is broken into three distinct federated databases: a User_DB, an Orders_DB, and a Products_DB.

Replication

Replication is the process of copying data from one database to another. Replication is used to increase availability and scalability of databases. Replication topologies dictate how data writes are propagated through the system

  • Master - Slave: All write operations (Insert, Update, Delete) must be sent to a single designated node called the leader. The leader records the changes locally and broadcasts them to all followers (replicas). Followers only process read queries.
  • Master - Master: Multiple nodes act as leaders, accepting write operations simultaneously. Each leader acts as a follower to the other leaders, syncing changes asynchronously. Can handle high write volumes but are really complex and each node increases conflict resolution and the time it takes.
  • Decentralized: There is no central leader. The application writes directly to multiple peer nodes concurrently. It relies on a quorum system to determine if a write or read is successful. Highly resilient to node failures; optimal for high write availability but high risk of reading stale data.

The timing of how data is copied from a leader to a follower changes the balance between data safety and performance:

  • Synchronous Replication: The leader writes data locally and waits for a confirmation from all nodes. It maintains strict consistency at the cost of latency.
  • Asynchronous Replication: The leader writes data locally and immediately returns a "success" status to the client application. The data replication payload is sent to the followers in the background. This prioritizes low latency and high throughput over strict consistency of data.

Non-Relational Databases

It is designed to store, retrieve, and manage non-tabular, distributed data. Unlike relational databases (RDBMS), NoSQL systems do not rely on a fixed, predefined tabular schema and are architected to scale horizontally across commodity hardware

The BASE Consistency Model

NoSQL databases generally trade immediate consistency for high availability and performance, adhering to the BASE model:

  • Basically Available: The system remains operational and responsive to requests, even during partial node failures or network partitions.
  • Soft State: The state of the data can change over time without explicit user interaction due to background replica synchronization.
  • Eventual Consistency: The data across all distributed nodes will converge and become identical eventually, provided no new updates are made to that specific data point.

Core NoSQL Data Models

NoSQL databases are categorized into four primary architectural types based on their underlying storage mechanism:

Document Stores

Data is encapsulated in semi-structured, self-describing formats called documents (typically JSON, BSON, or XML). Documents eliminate the need for strict schemas, allowing different records to contain completely different fields or nested hierarchies

  • Documents are indexed by a unique key. The database engine can parse and query internal attributes directly.
  • There are good for content management systems, e-commerce product catalogs, etc.
  • Eg: Mongo, Couchbase.

Key-Value Stores

The simplest data model where data is stored as an associative array of unique identifiers (keys) paired with opaque data blobs (values). The database has no awareness of the internal structure of the value.

  • Uses a distributed hash table for $O(1)$ read and write performance
  • Good for session caching, real-time leaderboards, shopping carts.
  • Eg: Redis, Amazon DynamoDB, Memcached

Wide-Column (Column-Family) Stores

Data is stored in rows, but instead of a fixed set of columns, each row can have a dynamic number of columns. Related columns are grouped into "column families" and stored together on disk, optimizing data retrieval by column rather than by row.

  • A sparse, multidimensional distributed map indexed by a row key, column key, and timestamp
  • Good for high-throughput time-series logging, heavy write loads, massive analytics architectures.
  • Eg: Apache Cassandra, Apache HBase, Google Cloud Bigtable

Graph Databases

Focuses on the relationships between data entities. It uses graph structures consisting of nodes (entities), edges (relationships), and properties (key-value pairs attached to nodes or edges).

  • Implements index-free adjacency, allowing pointer-chasing traversal of millions of relationships in constant time, bypassing expensive relational join operations.
  • Good for social networks, fraud detection networks, recommendation engines, knowledge graphs.
  • Eg: Neo4j, Amazon Neptune

Scaling

NoSQL scales out horizontally by distributing data across a cluster of independent nodes via two core mechanisms:

  • Sharding (Horizontal Partitioning): The dataset is split into smaller, manageable pieces called shards based on a partition key. A hashing algorithm determines which shard stores a specific record, distributing the read/write load uniformly across the cluster.
  • Replication: Each shard is copied across multiple physical nodes. This ensures high availability and fault tolerance; if a master node fails, a replica node is promoted to prevent data loss.
Relational (SQL)Non-Relational (NoSQL)
Tabular Data Model: Stores data in structured tables consisting of rows and columns.Diverse Data Models: Stores data as documents, key-value pairs, wide-columns, or graphs.
Static Schema: Requires a predefined, rigid schema before data can be inserted.Dynamic Schema: Schema-agnostic or schema-on-read, allowing flexible, unstructured data.
High Normalization: Minimizes data redundancy through strict relational mapping (1NF, 2NF, 3NF).Denormalized Structure: Duplicates or nests data within records to optimize read performance.
SQL Interface: Uses standardized Structured Query Language (SQL) for data manipulation.Custom APIs: Uses object-oriented APIs, key-value lookups, or specialized query languages.
Vertical Scalability: Scales by upgrading CPU, RAM, or storage capacities on a single server node.Horizontal Scalability: Scales by partitioning and distributing data across a cluster of servers.
ACID Compliance: Guarantees strict reliability (Atomicity, Consistency, Isolation, Durability).BASE Properties: Prioritizes availability and performance using Eventual Consistency.
Native Joins: Executes highly efficient table joins using foreign key constraints.Application-Layer Joins: Handles relationships via nesting or references resolved by application code.
Transactional Focus: Optimized for OLTP, financial processing, and predictable data structures.Operational Focus: Optimized for big data analytics, real-time logging, IoT, and rapid scaling.

references: