.. _data-structures:
Data structures
===============
.. raw:: html
This page covers fundamental data structure design principles commonly used in scientific computing, systems programming, and algorithm implementation. This section is co-authored with Claude.
Fundamental concepts
--------------------
What is a data structure?
A data structure is a way of organizing and storing data to enable efficient access and modification. Different data structures are optimized for different operations (insertion, deletion, searching, sorting).
What is time complexity?
Time complexity describes how the runtime of an algorithm grows with input size, typically expressed using Big O notation:
- O(1): Constant time, independent of input size
- O(log n): Logarithmic time, grows slowly (e.g., binary search)
- O(n): Linear time, proportional to input size
- O(n log n): Linearithmic time (efficient sorting)
- O(n²): Quadratic time, grows rapidly with input size
What is space complexity?
Space complexity describes how much memory an algorithm or data structure requires relative to the input size. Trade-offs between time and space complexity are common in algorithm design.
Summary table
^^^^^^^^^^^^^
.. list-table::
:header-rows: 1
:widths: 20 15 15 15 35
* - Data Structure
- Access
- Insert
- Delete
- Use Case
* - Array/Vector
- O(1)
- O(n)
- O(n)
- Random access, cache-friendly
* - Linked List
- O(n)
- O(1)
- O(1)
- Frequent insertions, sparse data
* - Stack
- O(n)
- O(1)
- O(1)
- LIFO, function calls, backtracking
* - Queue
- O(n)
- O(1)
- O(1)
- FIFO, BFS, task scheduling
* - Heap
- O(n)
- O(log n)
- O(log n)
- Priority queue, k-largest/smallest
* - Hash Table
- O(1)
- O(1)
- O(1)
- Fast lookups, caching
* - BST (balanced)
- O(log n)
- O(log n)
- O(log n)
- Sorted data, range queries
* - Graph (adj list)
- O(degree)
- O(1)
- O(degree)
- Networks, relationships
Array based structures
----------------------
Arrays
^^^^^^
What is an array?
A contiguous block of memory storing elements of the same type. Arrays provide O(1) random access by index but fixed size.
**Memory layout visualization**::
Index: 0 1 2 3 4
┌──────┬──────┬──────┬──────┬──────┐
data[] │ 3.14 │ 2.71 │ 1.41 │ 5.00 │ 7.89 │
└──────┴──────┴──────┴──────┴──────┘
Address: 0x100 0x108 0x110 0x118 0x120
(each double = 8 bytes)
Access data[2]: O(1)
Address calculation: base_address + (index × element_size)
0x100 + (2 × 8) = 0x110
.. code-block:: cpp
// C++ fixed-size array
double data[5] = {3.14, 2.71, 1.41, 5.00, 7.89};
double x = data[2]; // O(1) access: 1.41
// Dynamic array
double* data = new double[n]; // Heap allocation
delete[] data;
What are the advantages of arrays?
- O(1) random access by index
- Cache-friendly (contiguous memory)
- Simple and efficient for fixed-size data
- Predictable memory layout
What are the disadvantages of arrays?
- Fixed size (in C-style arrays)
- Expensive insertion/deletion in middle (O(n))
- No bounds checking (in C/C++)
Dynamic arrays (vectors)
^^^^^^^^^^^^^^^^^^^^^^^^^
What is a dynamic array?
A resizable array that grows automatically when capacity is exceeded. In C++, this is ``std::vector``. It provides array-like performance with dynamic sizing.
**Vector growth visualization**::
Initial state (capacity=2):
┌─────┬─────┐
│ 3.14│ 2.71│
└─────┴─────┘
size=2, capacity=2
After push_back(1.41) - capacity exceeded, reallocate:
Step 1: Allocate new array (capacity = 2 × 2 = 4)
┌─────┬─────┬─────┬─────┐
│ │ │ │ │
└─────┴─────┴─────┴─────┘
Step 2: Copy existing elements
┌─────┬─────┬─────┬─────┐
│ 3.14│ 2.71│ │ │
└─────┴─────┴─────┴─────┘
Step 3: Add new element
┌─────┬─────┬─────┬─────┐
│ 3.14│ 2.71│ 1.41│ │
└─────┴─────┴─────┴─────┘
size=3, capacity=4
Old array is deallocated
.. code-block:: cpp
#include
std::vector data;
data.push_back(3.14); // O(1) amortized
data.push_back(2.71);
double x = data[0]; // O(1) access
data.resize(1000); // Change size
How does a vector handle growth?
When capacity is exceeded, the vector allocates a new larger array (typically 1.5x or 2x the current size), copies all elements, and deallocates the old array. This makes ``push_back`` O(1) amortized time.
.. code-block:: cpp
std::vector data;
data.reserve(1000); // Pre-allocate to avoid reallocation
When should you use vectors vs arrays?
- Use ``std::vector`` for dynamic sizing, automatic memory management
- Use fixed arrays for small, compile-time known sizes
- Use ``std::array`` for type-safe fixed-size containers
Linked structures
-----------------
Linked lists
^^^^^^^^^^^^
What is a linked list?
A sequence of nodes where each node contains data and a pointer to the next node. Unlike arrays, linked lists store elements non-contiguously in memory.
**Singly linked list visualization**::
head
↓
┌────┬────┐ ┌────┬────┐ ┌────┬────┐
│3.14│ ●──┼───→│2.71│ ●──┼───→│1.41│ ●──┼───→ NULL
└────┴────┘ └────┴────┘ └────┴────┘
Node 1 Node 2 Node 3
(0x200) (0x350) (0x1A0)
Each node: [data | next pointer]
Memory locations are scattered (not contiguous)
**Doubly linked list visualization**::
head
↓
┌────┬────┬────┐ ┌────┬────┬────┐ ┌────┬────┬────┐
│NULL│3.14│ ●──┼───→│ ●←─┤2.71│ ●──┼───→│ ●←─┤1.41│NULL│
└────┴────┴────┘ └────┴────┴────┘ └────┴────┴────┘
←────────────────────────
Each node: [prev | data | next]
.. code-block:: cpp
struct Node {
double data;
Node* next;
};
// Creating a linked list
Node* head = new Node{3.14, nullptr};
head->next = new Node{2.71, nullptr};
head->next->next = new Node{1.41, nullptr};
**Insertion at front (O(1))**::
Before inserting 5.00:
head
↓
┌────┬────┐ ┌────┬────┐
│3.14│ ●──┼───→│2.71│NULL│
└────┴────┘ └────┴────┘
After inserting 5.00 at front:
head
↓
┌────┬────┐ ┌────┬────┐ ┌────┬────┐
│5.00│ ●──┼───→│3.14│ ●──┼───→│2.71│NULL│
└────┴────┘ └────┴────┘ └────┴────┘
New node Old head
**Accessing element at index i (O(n))**::
To access index 2, must traverse:
head → Node 0 → Node 1 → Node 2
(No direct address calculation like arrays)
What are the advantages of linked lists?
- O(1) insertion/deletion at front
- No reallocation needed when growing
- Can grow to any size (limited by available memory)
- Efficient for frequent insertions/deletions
What are the disadvantages of linked lists?
- O(n) access by index (must traverse)
- Extra memory for pointers (overhead)
- Poor cache performance (non-contiguous)
- More complex memory management
When are linked lists used in scientific computing?
Linked lists are rarely used in numerical computing due to poor cache performance, but they appear in:
- Sparse matrix storage (linked list of non-zero entries)
- Memory allocation algorithms
- Graph representations (adjacency lists)
- Undo/redo stacks in applications
Stack (LIFO)
------------
What is a stack?
A Last-In-First-Out (LIFO) data structure. Elements are added and removed from the same end (the "top"). Think of a stack of plates—you add and remove from the top.
**Stack operations visualization**::
Initial state: After push(3.14): After push(2.71):
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Empty │ │ 3.14 │ ← top │ 2.71 │ ← top
└─────────┘ └─────────┘ ├─────────┤
│ 3.14 │
└─────────┘
After push(1.41): After pop(): After pop():
┌─────────┐ ┌─────────┐ ┌─────────┐
│ 1.41 │ ← top │ 2.71 │ ← top │ 3.14 │ ← top
├─────────┤ ├─────────┤ └─────────┘
│ 2.71 │ │ 3.14 │
├─────────┤ └─────────┘
│ 3.14 │
└─────────┘
**Function call stack example**::
main() called: foo(5) called: bar(15) called:
┌──────────┐ ┌──────────┐ ┌──────────┐
│ main │ │ a=5 │ │ x=15 │
│ locals │ │ b=10 │ │ c=16 │
│ return │ │ return │ │ return │
└──────────┘ ├──────────┤ ├──────────┤
│ main │ │ a=5 │
│ locals │ │ b=10 │
│ return │ │ return │
└──────────┘ ├──────────┤
│ main │
│ locals │
│ return │
└──────────┘
Growing down →
**Key operations:**
- ``push(x)``: Add element to top, O(1)
- ``pop()``: Remove and return top element, O(1)
- ``peek()/top()``: View top element without removing, O(1)
How is a stack implemented?
Stacks can be implemented using arrays or linked lists:
.. code-block:: cpp
#include
std::stack s;
s.push(3.14); // Add to top
s.push(2.71);
double top = s.top(); // View top: 2.71
s.pop(); // Remove top
// Custom implementation with vector
std::vector stack;
stack.push_back(3.14); // push
double x = stack.back(); // peek
stack.pop_back(); // pop
Where are stacks used?
- Function call management (call stack)
- Expression evaluation (postfix notation)
- Backtracking algorithms (DFS, maze solving)
- Undo mechanisms in editors
- Recursive algorithm simulation
What is the function call stack?
When functions are called, the CPU uses a stack to store:
- Return addresses
- Function parameters
- Local variables
This enables function calls and returns in LIFO order. See the memory layout section in the C++ documentation for visualization.
Queue (FIFO)
------------
What is a queue?
A First-In-First-Out (FIFO) data structure. Elements are added at the back and removed from the front. Think of a line at a store—first person in line is served first.
**Queue operations visualization**::
Initial state:
Front Back
┌───────────────────────────────────────────────┐
│ Empty │
└───────────────────────────────────────────────┘
After enqueue(3.14):
Front Back
┌───────────────────────────────────────────────┐
│ 3.14 │
└───────────────────────────────────────────────┘
After enqueue(2.71):
Front Back
┌───────────────────────────────────────────────┐
│ 3.14 → 2.71 │
└───────────────────────────────────────────────┘
After enqueue(1.41):
Front Back
┌───────────────────────────────────────────────┐
│ 3.14 → 2.71 → 1.41 │
└───────────────────────────────────────────────┘
After dequeue() - removes 3.14:
Front Back
┌───────────────────────────────────────────────┐
│ 2.71 → 1.41 │
└───────────────────────────────────────────────┘
After dequeue() - removes 2.71:
Front Back
┌───────────────────────────────────────────────┐
│ 1.41 │
└───────────────────────────────────────────────┘
**Circular queue implementation** (efficient use of array)::
Array-based circular queue (capacity=5):
front=0, back=0 (empty)
┌───┬───┬───┬───┬───┐
│ │ │ │ │ │
└───┴───┴───┴───┴───┘
0 1 2 3 4
After enqueue(A), enqueue(B), enqueue(C):
front=0, back=3
┌───┬───┬───┬───┬───┐
│ A │ B │ C │ │ │
└───┴───┴───┴───┴───┘
↑ ↑
front back
After dequeue() twice, then enqueue(D), enqueue(E), enqueue(F):
front=2, back=0 (wraps around)
┌───┬───┬───┬───┬───┐
│ F │ │ C │ D │ E │
└───┴───┴───┴───┴───┘
↑ ↑
back front
**Key operations:**
- ``enqueue(x)`` or ``push(x)``: Add element to back, O(1)
- ``dequeue()`` or ``pop()``: Remove element from front, O(1)
- ``front()``: View front element, O(1)
How is a queue implemented?
.. code-block:: cpp
#include
std::queue q;
q.push(3.14); // Add to back
q.push(2.71);
double front = q.front(); // View front: 3.14
q.pop(); // Remove from front
Where are queues used?
- Task scheduling (CPU scheduling, print queues)
- Breadth-first search (BFS) in graphs
- Buffering (data streams, I/O)
- Simulation of real-world queues
- Message passing systems
Priority queue (heap)
---------------------
What is a priority queue?
A data structure where each element has a priority, and elements are removed in priority order (not insertion order). Typically implemented using a binary heap.
.. code-block:: cpp
#include
// Max heap (largest first)
std::priority_queue max_heap;
max_heap.push(3.14);
max_heap.push(2.71);
max_heap.push(5.0);
double largest = max_heap.top(); // 5.0
max_heap.pop();
// Min heap (smallest first)
std::priority_queue, std::greater> min_heap;
What is a binary heap?
A complete binary tree where each parent node has a value greater than or equal to (max heap) or less than or equal to (min heap) its children.
**Max heap tree visualization**::
100
/ \
80 90
/ \ / \
70 60 85 75
/ \ /
50 40 55
Properties:
- Parent ≥ all descendants (max heap)
- Complete binary tree (filled left to right)
- Root contains maximum value
**Min heap tree visualization**::
10
/ \
20 15
/ \ / \
30 25 40 18
/ \
50 45
Properties:
- Parent ≤ all descendants (min heap)
- Root contains minimum value
**Array representation of heap**::
Max heap tree: 100
/ \
80 90
/ \ / \
70 60 85 75
Stored as array:
Index: 0 1 2 3 4 5 6
┌────┬───┬───┬───┬───┬───┬───┐
Array:│100 │80 │90 │70 │60 │85 │75 │
└────┴───┴───┴───┴───┴───┴───┘
Relationships:
- Parent of i: (i-1)/2
- Left child of i: 2i+1
- Right child of i: 2i+2
Example: Element at index 1 (80)
- Parent: (1-1)/2 = 0 → 100
- Left child: 2(1)+1 = 3 → 70
- Right child: 2(1)+2 = 4 → 60
**Insert operation (bubble up)**::
Insert 95 into max heap:
Step 1: Add at end Step 2: Compare with parent (90)
100 100
/ \ / \
80 90 80 90
/ \ / \ / \ / \
70 60 85 75 70 60 85 75
/ /
95 95
95 > 90, swap!
Step 3: Swap and continue Step 4: Compare with parent (100)
100 100
/ \ / \
80 95 80 95
/ \ / \ / \ / \
70 60 85 75 70 60 85 75
/ /
90 90
95 < 100, done!
**Extract max operation (bubble down)**::
Remove root (100):
Step 1: Replace root with last Step 2: Compare with children
75 75
/ \ / \
80 95 80 95
/ \ / \ / \ / \
70 60 85 90 70 60 85 90
75 < max(80, 95), swap with 95!
Step 3: Continue bubbling down Step 4: Done (heap property restored)
95 95
/ \ / \
80 75 80 90
/ \ / \ / \ / \
70 60 85 90 70 60 85 75
75 < 90, swap! 75 > 85, done!
What are the time complexities for heap operations?
- Insert: O(log n) (add at end, then bubble up)
- Extract max/min: O(log n) (remove root, move last to root, bubble down)
- Peek max/min: O(1) (just view root)
- Build heap from array: O(n) (heapify algorithm)
Where are heaps used?
- Dijkstra's shortest path algorithm
- Heap sort (O(n log n) sorting)
- Finding k largest/smallest elements
- Job scheduling with priorities
- Huffman coding for compression
- Event-driven simulation
Practical example in scientific computing
.. code-block:: cpp
// Find k smallest distances in large dataset
std::priority_queue max_heap; // Keep k smallest
int k = 10;
for (double distance : large_dataset) {
if (max_heap.size() < k) {
max_heap.push(distance);
} else if (distance < max_heap.top()) {
max_heap.pop();
max_heap.push(distance);
}
}
// max_heap now contains k smallest distances
Hash tables
-----------
What is a hash table?
A data structure that maps keys to values using a hash function. Provides average O(1) insertion, deletion, and lookup.
**Hash table visualization**::
Hash function: hash(key) % table_size
Keys to insert: "pi"→3.14, "e"→2.71, "phi"→1.61, "sqrt2"→1.41
Step 1: Calculate hash values
hash("pi") = 17 → 17 % 8 = 1
hash("e") = 5 → 5 % 8 = 5
hash("phi") = 22 → 22 % 8 = 6
hash("sqrt2") = 29 → 29 % 8 = 5 (collision!)
Hash table (size=8):
Index │ Key │ Value
───────┼─────────┼───────
0 │ │
1 │ "pi" │ 3.14
2 │ │
3 │ │
4 │ │
5 │ "e" │ 2.71 ← collision at index 5
6 │ "phi" │ 1.61
7 │ │
**Collision handling - Chaining**::
Using linked lists at each index:
Index │ Bucket (linked list)
───────┼─────────────────────────────
0 │ → NULL
1 │ → ["pi", 3.14] → NULL
2 │ → NULL
3 │ → NULL
4 │ → NULL
5 │ → ["e", 2.71] → ["sqrt2", 1.41] → NULL
│ ↑ ↑
│ First Chained (collision)
6 │ → ["phi", 1.61] → NULL
7 │ → NULL
**Collision handling - Open Addressing (Linear Probing)**::
When collision occurs, probe next slots:
Try to insert "sqrt2" at index 5 (occupied):
Index │ Key │ Value │ Probing
───────┼─────────┼─────────┼─────────
0 │ │ │
1 │ "pi" │ 3.14 │
2 │ │ │
3 │ │ │
4 │ │ │
5 │ "e" │ 2.71 │ Occupied, try 5+1
6 │ "phi" │ 1.61 │ Occupied, try 5+2
7 │ "sqrt2" │ 1.41 │ Empty! Insert here
**Hash table lookup visualization**::
Looking up "sqrt2":
1. Calculate hash: hash("sqrt2") % 8 = 5
2. Check index 5: "e" ≠ "sqrt2", continue
3. Check index 6: "phi" ≠ "sqrt2", continue
4. Check index 7: "sqrt2" = "sqrt2", found! Return 1.41
.. code-block:: cpp
#include
std::unordered_map constants;
constants["pi"] = 3.14159;
constants["e"] = 2.71828;
double pi = constants["pi"]; // O(1) lookup
bool exists = constants.count("e"); // Check existence
How does a hash table work?
1. Hash function converts key to integer (hash code)
2. Hash code mapped to array index (hash code % table_size)
3. Value stored at that index
4. Collisions handled by chaining or open addressing
What are hash collisions?
When two different keys hash to the same index. Common solutions:
- **Chaining**: Store multiple elements at same index using linked list
- **Open addressing**: Find next available slot (linear probing, quadratic probing)
Where are hash tables used?
- Database indexing
- Caching (memoization)
- Symbol tables in compilers
- Counting frequencies
- Detecting duplicates
- Implementing dictionaries and sets
Trees
-----
Binary search tree (BST)
^^^^^^^^^^^^^^^^^^^^^^^^
What is a binary search tree?
A binary tree where for each node, all left descendants have smaller values and all right descendants have larger values. Enables efficient searching.
**BST structure visualization**::
Insert sequence: 50, 30, 70, 20, 40, 60, 80
50
/ \
30 70
/ \ / \
20 40 60 80
BST property: Left < Node < Right
- 20, 30 < 50 < 60, 70, 80
- 20 < 30 < 40
- 60 < 70 < 80
**BST search visualization**::
Search for 60:
Step 1: Start at root Step 2: 60 > 50, go right
50 50
/ \ / \
30 70 ✓ 30 70 ✓
/ \ / \ / \ / \
20 40 60 80 20 40 60 80
Step 3: 60 < 70, go left Step 4: Found!
50 50
/ \ / \
30 70 30 70
/ \ / \ / \ / \
20 40 60✓ 80 20 40 60✓ 80
↑ ↑
Found! Return 60
Comparisons: 3 (log₂ 7 ≈ 2.8)
**BST insert visualization**::
Insert 55 into existing BST:
Step 1: Start at root Step 2: 55 > 50, go right
50 50
/ \ / \
30 70 30 70
/ \ / \ / \ / \
20 40 60 80 20 40 60 80
Step 3: 55 < 70, go left Step 4: 55 < 60, go left
50 50
/ \ / \
30 70 30 70
/ \ / \ / \ / \
20 40 60 80 20 40 60 80
/
NULL
Step 5: Insert at NULL position
50
/ \
30 70
/ \ / \
20 40 60 80
/
55 ← New node inserted
**BST in-order traversal (gives sorted sequence)**::
50
/ \
30 70
/ \ / \
20 40 60 80
In-order: Left → Node → Right
Sequence: 20, 30, 40, 50, 60, 70, 80 (sorted!)
**BST delete visualization**::
Delete 30 (node with two children):
Original tree: Find successor (leftmost of right subtree):
50 50
/ \ / \
30✗ 70 30✗ 70
/ \ / \ / \ / \
20 40 60 80 20 40← 60 80
↑
Successor
Replace 30 with 40: Remove 40 from original position:
50 50
/ \ / \
40 70 40 70
/ \ / \ / / \
20 40 60 80 20 60 80
**Balanced vs unbalanced BST**::
Balanced BST: Unbalanced BST (worst case):
50 10
/ \ \
30 70 20
/ \ / \ \
20 40 60 80 30
\
Height: 3 40
Search: O(log n) \
50
Height: 5
Search: O(n)
**Properties:**
- Left subtree < node < right subtree
- In-order traversal gives sorted sequence
- Average O(log n) for search, insert, delete
- Worst case O(n) if tree becomes unbalanced
When is a BST useful?
- Maintaining sorted data with frequent insertions
- Range queries (find all values between x and y)
- Finding predecessor/successor
- Implementation of sets and maps
Balanced trees
^^^^^^^^^^^^^^
What is a balanced tree?
A tree that maintains approximately equal height in left and right subtrees, ensuring O(log n) operations even in worst case. Examples: AVL trees, Red-Black trees.
C++ ``std::map`` and ``std::set`` use balanced trees (typically Red-Black trees).
.. code-block:: cpp
#include