Databases

Mark all completed

Relational Databases (SQL)

What is the difference between SQL and NoSQL databases?

SQL (Relational): Structured data in tables with a fixed schema. Uses SQL for queries. Strong ACID guarantees. Examples: PostgreSQL, MySQL, SQLite.
NoSQL: Flexible/dynamic schemas. Designed for scale and specific data models. Examples: MongoDB (document), Redis (key-value), Cassandra (wide-column), Neo4j (graph).

Choose SQL when: data is relational, consistency is critical, complex queries are needed.

Choose NoSQL when: schema is dynamic, you need horizontal scale, or data fits a non-tabular model.

What problems does this solve?

Frames the entire DB landscape — almost every DB interview starts here.

What are ACID properties?

Atomicity: A transaction is all-or-nothing. If any part fails, the whole transaction is rolled back.
Consistency: A transaction brings the database from one valid state to another. Constraints are never violated.
Isolation: Concurrent transactions execute as if they were serial. Changes are not visible to others until committed.
Durability: Once committed, the transaction persists even after a crash (written to disk / WAL).

What problems does this solve?

ACID is the foundation of reliable relational databases — understanding it demonstrates you know when to trust a DB and when not to.

What is database normalisation and what are the normal forms?

Normalisation: Organising columns and tables to reduce redundancy and improve integrity.
1NF: Each column holds atomic values; no repeating groups.
2NF: 1NF + every non-key column is fully dependent on the entire primary key.
3NF: 2NF + no transitive dependencies (non-key columns not dependent on other non-key columns).
BCNF: Stronger form of 3NF; every determinant is a candidate key.

Denormalisation (trading redundancy for read speed) is often used in read-heavy or analytics workloads.

What problems does this solve?

Demonstrates schema design fundamentals and awareness of trade-offs between write efficiency and read performance.

What is a database index and how does it work?

An index is a data structure (usually a B-tree or hash) that speeds up data retrieval without scanning every row.
B-tree index: Default in most RDBMS. Good for range queries and equality.
Hash index: O(1) lookup for exact equality; does not support range queries.
Composite index: Index over multiple columns. The order of columns matters (leftmost prefix rule).
Covering index: Index contains all columns needed by a query, avoiding a table lookup.

Trade-off: Indexes speed up reads but slow down writes (INSERT/UPDATE/DELETE must update the index).

What problems does this solve?

Indexing is the single most impactful optimisation in most production systems — interviewers test this heavily.

Explain the different types of SQL JOINs.

INNER JOIN: Returns rows where there is a match in both tables.
LEFT JOIN: All rows from the left table; NULLs for non-matching right rows.
RIGHT JOIN: All rows from the right table; NULLs for non-matching left rows.
FULL OUTER JOIN: All rows from both tables; NULLs where there's no match.
CROSS JOIN: Cartesian product — every row from left × every row from right.
SELF JOIN: A table joined with itself (e.g., employee → manager hierarchy).

What problems does this solve?

JOINs are core to relational querying — misunderstanding them causes incorrect results or severe performance issues.

What is a database transaction and how do you use one?

A transaction is a unit of work that is executed atomically.
Controlled with BEGIN, COMMIT, and ROLLBACK.
Isolation levels (from lowest to strictest): Read Uncommitted → Read Committed → Repeatable Read → Serializable.
Higher isolation prevents anomalies (dirty reads, non-repeatable reads, phantom reads) at the cost of concurrency.

What problems does this solve?

Shows understanding of concurrency control and data safety in multi-user systems.

What is a database View?

A view is a virtual table based on a stored SQL query. It does not store data itself.
Benefits: simplifies complex queries, provides a security layer (hide columns), encapsulates business logic.
Materialised view: A cached snapshot of a query result stored on disk. Must be refreshed. Used for expensive aggregations.

What problems does this solve?

Views show you can abstract complexity and think about security and reusability in schema design.

What are Stored Procedures and when would you use them?

Precompiled SQL routines stored in the database, executed with a single call.
Pros: Reduced network round-trips, reusable logic, can enforce business rules at DB level.
Cons: Harder to version-control, tightly coupled to DB vendor, can obscure business logic from app code.
Modern preference often favours keeping logic in the application layer (ORM/query builders) for better testability.

What problems does this solve?

Knowing both sides of the stored procedure debate shows architectural maturity.

What is the N+1 query problem?

Occurs when you fetch a list of N items, then issue 1 additional query per item to fetch related data — resulting in N+1 total queries.
Example: Load 10 posts, then load comments for each post = 11 queries instead of 2.
Fix: Use JOINs or eager loading (include/with in ORMs).
Detectable via query logging (e.g., Django Debug Toolbar, Bullet gem, Prisma query logs).

What problems does this solve?

One of the most common ORM-related performance bugs — essential knowledge for any backend/full-stack role.

How do you optimise a slow SQL query?

Use EXPLAIN / EXPLAIN ANALYZE to inspect the query execution plan.
Add indexes on columns used in WHERE, JOIN, and ORDER BY clauses.
Avoid SELECT * — fetch only needed columns.
Avoid functions on indexed columns in WHERE (defeats the index).
Use pagination (LIMIT / OFFSET) or keyset pagination for large result sets.
Consider caching (Redis/Memcached) for frequently read, rarely changed data.
Denormalise or use materialised views for read-heavy analytics queries.

What problems does this solve?

Query optimisation is a core DBA/backend skill — shows you can diagnose and fix real-world performance issues.

Distributed Databases & NoSQL

What is the CAP Theorem?

States that a distributed system can guarantee at most 2 of 3:
Consistency (C): Every read receives the most recent write.
Availability (A): Every request gets a response (not necessarily the latest data).
Partition Tolerance (P): The system continues to operate despite network partitions.

Since network partitions are unavoidable, the real choice is between CP (e.g., HBase, MongoDB) and AP (e.g., Cassandra, CouchDB).

What problems does this solve?

CAP is the foundational theorem for reasoning about distributed database trade-offs — a staple of system design interviews.

What is database sharding?

Sharding splits a dataset across multiple database instances (shards), each holding a subset of the data.
Range sharding: Rows split by value range (e.g., users A–M on shard 1).
Hash sharding: Row assigned to shard via a hash function. Distributes evenly but makes range queries harder.
Directory-based: A lookup table maps records to shards. Flexible but adds overhead.
Challenges: Cross-shard queries, re-sharding, hot spots, no JOIN across shards.

What problems does this solve?

Sharding is the primary technique for horizontal scaling — understanding its trade-offs is critical for large-scale system design.

What is database replication?

Replication copies data from a primary node to one or more replicas.
Primary-Replica (Master-Slave): Writes go to primary; reads can be distributed to replicas. Replicas may lag.
Multi-Primary (Multi-Master): Writes accepted on multiple nodes. Increases write availability but requires conflict resolution.
Synchronous replication: Primary waits for replica to confirm write — strong consistency, higher latency.
Asynchronous replication: Primary doesn't wait — lower latency, potential data loss on failure.

What problems does this solve?

Replication is fundamental to high availability and read scaling — often the first step before sharding.

What is a Document Database and when would you use it?

Stores data as self-contained documents (JSON/BSON). Schema-flexible — each document can have different fields.
Examples: MongoDB, CouchDB, Firestore.
Best for: Content management, catalogues, user profiles, event data — where data shapes vary per record.
Avoid when: You need complex multi-document transactions or heavily relational data.

What problems does this solve?

MongoDB is one of the most popular NoSQL databases — understanding its model is expected for full-stack and backend roles.

What is a Key-Value Store?

Simplest NoSQL model — stores data as key → value pairs.
Examples: Redis, Memcached, DynamoDB (also supports document model).
Best for: Caching, sessions, rate limiting, leaderboards, pub/sub messaging.
Redis supports richer data structures: strings, lists, sets, sorted sets, hashes, streams.

What problems does this solve?

Redis is ubiquitous in production stacks — knowing its model and use cases is expected.

What is a Wide-Column Store?

Stores data in tables with rows and dynamic columns, optimised for writes and range scans by row key.
Examples: Apache Cassandra, HBase, Google Bigtable.
Best for: Time-series data, analytics, IoT, event logs — massive write throughput at scale.
Data is modelled around query patterns, not entities (denormalise for each query).

What problems does this solve?

Cassandra is widely used at scale — shows awareness of write-optimised distributed storage.

What is a Graph Database?

Stores data as nodes (entities) and edges (relationships), both of which can have properties.
Examples: Neo4j, Amazon Neptune, ArangoDB.
Best for: Social networks, recommendation engines, fraud detection, knowledge graphs — where relationships between entities are the primary concern.
Graph queries (e.g., Cypher, Gremlin) are far more expressive than SQL JOINs for deeply connected data.

What problems does this solve?

Demonstrates knowledge of specialised data models beyond the document/relational dichotomy.

What is a Time-Series Database?

Optimised for storing and querying data points indexed by time.
Examples: InfluxDB, TimescaleDB (PostgreSQL extension), Prometheus.
Features: automatic data retention policies, downsampling, efficient compression, fast time-range queries.
Best for: Metrics, monitoring, IoT sensor data, financial tick data.

What problems does this solve?

Time-series DBs are common in observability/monitoring stacks — shows broader database ecosystem knowledge.

How does full-text search work in databases?

Full-text search uses an inverted index — maps each word to the documents containing it.
PostgreSQL: Built-in full-text search via tsvector / tsquery.
Elasticsearch / OpenSearch: Dedicated search engines built on Lucene. Better for complex relevance scoring, facets, and large scale.
Features: stemming, stop words, fuzzy matching, relevance scoring (TF-IDF, BM25).
Syncing data from a primary DB to Elasticsearch is a common architecture (CDC or dual writes).

What problems does this solve?

Full-text search is a frequent requirement — understanding when to use native DB search vs dedicated engines is important.

What are database caching strategies?

Cache-aside (lazy loading): App checks cache first; on miss, reads from DB and populates cache.
Write-through: Writes go to cache and DB simultaneously. Cache always up to date. Slower writes.
Write-behind (write-back): Write to cache immediately, flush to DB asynchronously. Fast writes, risk of data loss.
Read-through: Cache sits in front of DB; cache handles DB reads on miss.
TTL (Time-to-live): Entries expire automatically to prevent stale data.

What problems does this solve?

Caching is the most impactful way to reduce DB load — every production system uses it.

Advanced & Production Concerns

What is connection pooling?

Maintains a pool of pre-established DB connections that are reused rather than opened/closed per request.
Opening a DB connection is expensive (handshake, auth, memory). Pooling amortises this cost.
Tools: PgBouncer (PostgreSQL), HikariCP (Java), Sequelize/Prisma built-in pooling.
Key settings: min/max pool size, connection timeout, idle timeout.
Without pooling, high-traffic apps can exhaust DB connection limits.

What problems does this solve?

Connection pooling is essential for any production web app — shows operational awareness beyond just writing queries.

What are database migrations and how do you manage them?

Migrations are versioned scripts that apply incremental schema changes to a database.
Managed by tools like Flyway, Liquibase, Prisma Migrate, Alembic, or framework-specific tools (Rails, Django).
Best practices: Never edit a committed migration; always write rollback migrations; test in staging before production.
Zero-downtime migrations: Add column as nullable first → backfill → add constraint → remove old column.

What problems does this solve?

Schema management is a critical production concern — breaking migrations can take down entire services.

What is an ORM and what are its trade-offs?

An ORM (Object-Relational Mapper) maps database tables to objects in code, abstracting raw SQL.
Examples: Prisma, Sequelize, TypeORM (JS/TS), SQLAlchemy (Python), Hibernate (Java).
Pros: Reduces boilerplate, type safety, automatic migrations, portability across DBs.
Cons: Can generate inefficient queries, hides complexity, harder to optimise, "magic" behaviour.
For complex queries, mixing raw SQL via the ORM's escape hatch is common.

What problems does this solve?

ORMs are nearly universal in modern stacks — showing awareness of their trade-offs is a sign of experience.

What are key database security practices?

SQL Injection prevention: Always use parameterised queries / prepared statements. Never string-interpolate user input into SQL.
Principle of least privilege: Each app user should only have the permissions it needs (e.g., read-only for reporting).
Encryption: Encrypt data at rest (TDE) and in transit (TLS/SSL).
Audit logging: Log all access and DDL changes.
Secrets management: Never hardcode credentials; use environment variables or secret managers (Vault, AWS Secrets Manager).

What problems does this solve?

SQL injection is one of the most common and devastating attack vectors — every developer must understand prevention.

What backup and recovery strategies should you know?

Full backup: Complete snapshot of the database. Slowest to produce, simplest to restore.
Incremental backup: Only changes since the last backup. Faster, less storage, more complex to restore.
WAL archiving (PostgreSQL): Archive Write-Ahead Log files for point-in-time recovery (PITR).
RTO (Recovery Time Objective): How long until the system is back online.
RPO (Recovery Point Objective): How much data loss is acceptable.
Test your backups — a backup never tested is not a backup.

What problems does this solve?

Understanding backup and recovery shows operational maturity — data loss is catastrophic for any business.

What is eventual consistency?

A consistency model where replicas are not guaranteed to be identical at any given moment, but will converge to the same state given enough time with no new updates.
Common in AP systems (e.g., Cassandra, DynamoDB, CouchDB).
Use cases: DNS, social media likes/counts, shopping carts — where temporary inconsistency is acceptable.
Strong consistency: All reads reflect the latest write. Harder to achieve in distributed systems.

What problems does this solve?

Understanding when eventual consistency is acceptable vs when strong consistency is required is key to distributed system design.

What is optimistic vs pessimistic locking?

Optimistic locking: Assumes conflicts are rare. Read data with a version number; on update, check the version hasn't changed. Retry if it has. No locks held.
Pessimistic locking: Assumes conflicts are likely. Acquires a lock before reading (SELECT FOR UPDATE). Blocks other writers. Prevents conflicts but reduces throughput.
Optimistic is better for low-contention scenarios; pessimistic for high-contention (e.g., inventory reservations).

What problems does this solve?

Concurrency control is one of the trickiest aspects of database programming — understanding both strategies is essential.

What is database partitioning?

Partitioning divides a single large table into smaller, more manageable pieces, usually within the same DB instance (unlike sharding, which is across instances).
Range partitioning: By date range (e.g., one partition per month for logs).
List partitioning: By discrete values (e.g., by country code).
Hash partitioning: By hash of a column for even distribution.
Benefits: faster queries (partition pruning), easier archiving, smaller index sizes.

What problems does this solve?

Partitioning is common for large tables (billions of rows) and shows knowledge of advanced schema design.

What are managed Database-as-a-Service (DBaaS) options?

Cloud-managed databases remove the operational burden of provisioning, patching, backups, and failover.
SQL: Amazon RDS/Aurora, Google Cloud SQL, Azure Database, Supabase (PostgreSQL).
NoSQL/Document: MongoDB Atlas, Firebase Firestore, Amazon DynamoDB.
Cache: Amazon ElastiCache, Redis Cloud, Upstash.
Trade-off: Convenience and scalability vs vendor lock-in and cost at scale.

What problems does this solve?

Most modern apps use managed databases — knowing the ecosystem shows real-world operational awareness.

How do you choose the right database for a use case?

Data model: Relational? Document? Graph? Time-series? Key-value?
Consistency requirements: Strong ACID? Eventual consistency acceptable?
Scale: Read-heavy? Write-heavy? Need horizontal sharding?
Query patterns: Complex joins? Aggregations? Full-text search? Simple lookups?
Operational maturity: Team experience, managed service availability, cost.
Often the right answer is multiple databases (polyglot persistence): e.g., PostgreSQL for primary data + Redis for caching + Elasticsearch for search.

What problems does this solve?

This synthesis question tests whether you can apply all the knowledge above to real architectural decisions.