.. _data-structures: Data structures =============== .. raw:: html
This page covers fundamental data structure design principles commonly used in scientific computing, systems programming, and algorithm implementation. This section is co-authored with Claude. Fundamental concepts -------------------- What is a data structure? A data structure is a way of organizing and storing data to enable efficient access and modification. Different data structures are optimized for different operations (insertion, deletion, searching, sorting). What is time complexity? Time complexity describes how the runtime of an algorithm grows with input size, typically expressed using Big O notation: - O(1): Constant time, independent of input size - O(log n): Logarithmic time, grows slowly (e.g., binary search) - O(n): Linear time, proportional to input size - O(n log n): Linearithmic time (efficient sorting) - O(n²): Quadratic time, grows rapidly with input size What is space complexity? Space complexity describes how much memory an algorithm or data structure requires relative to the input size. Trade-offs between time and space complexity are common in algorithm design. Summary table ^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 20 15 15 15 35 * - Data Structure - Access - Insert - Delete - Use Case * - Array/Vector - O(1) - O(n) - O(n) - Random access, cache-friendly * - Linked List - O(n) - O(1) - O(1) - Frequent insertions, sparse data * - Stack - O(n) - O(1) - O(1) - LIFO, function calls, backtracking * - Queue - O(n) - O(1) - O(1) - FIFO, BFS, task scheduling * - Heap - O(n) - O(log n) - O(log n) - Priority queue, k-largest/smallest * - Hash Table - O(1) - O(1) - O(1) - Fast lookups, caching * - BST (balanced) - O(log n) - O(log n) - O(log n) - Sorted data, range queries * - Graph (adj list) - O(degree) - O(1) - O(degree) - Networks, relationships Array based structures ---------------------- Arrays ^^^^^^ What is an array? A contiguous block of memory storing elements of the same type. Arrays provide O(1) random access by index but fixed size. **Memory layout visualization**:: Index: 0 1 2 3 4 ┌──────┬──────┬──────┬──────┬──────┐ data[] │ 3.14 │ 2.71 │ 1.41 │ 5.00 │ 7.89 │ └──────┴──────┴──────┴──────┴──────┘ Address: 0x100 0x108 0x110 0x118 0x120 (each double = 8 bytes) Access data[2]: O(1) Address calculation: base_address + (index × element_size) 0x100 + (2 × 8) = 0x110 .. code-block:: cpp // C++ fixed-size array double data[5] = {3.14, 2.71, 1.41, 5.00, 7.89}; double x = data[2]; // O(1) access: 1.41 // Dynamic array double* data = new double[n]; // Heap allocation delete[] data; What are the advantages of arrays? - O(1) random access by index - Cache-friendly (contiguous memory) - Simple and efficient for fixed-size data - Predictable memory layout What are the disadvantages of arrays? - Fixed size (in C-style arrays) - Expensive insertion/deletion in middle (O(n)) - No bounds checking (in C/C++) Dynamic arrays (vectors) ^^^^^^^^^^^^^^^^^^^^^^^^^ What is a dynamic array? A resizable array that grows automatically when capacity is exceeded. In C++, this is ``std::vector``. It provides array-like performance with dynamic sizing. **Vector growth visualization**:: Initial state (capacity=2): ┌─────┬─────┐ │ 3.14│ 2.71│ └─────┴─────┘ size=2, capacity=2 After push_back(1.41) - capacity exceeded, reallocate: Step 1: Allocate new array (capacity = 2 × 2 = 4) ┌─────┬─────┬─────┬─────┐ │ │ │ │ │ └─────┴─────┴─────┴─────┘ Step 2: Copy existing elements ┌─────┬─────┬─────┬─────┐ │ 3.14│ 2.71│ │ │ └─────┴─────┴─────┴─────┘ Step 3: Add new element ┌─────┬─────┬─────┬─────┐ │ 3.14│ 2.71│ 1.41│ │ └─────┴─────┴─────┴─────┘ size=3, capacity=4 Old array is deallocated .. code-block:: cpp #include std::vector data; data.push_back(3.14); // O(1) amortized data.push_back(2.71); double x = data[0]; // O(1) access data.resize(1000); // Change size How does a vector handle growth? When capacity is exceeded, the vector allocates a new larger array (typically 1.5x or 2x the current size), copies all elements, and deallocates the old array. This makes ``push_back`` O(1) amortized time. .. code-block:: cpp std::vector data; data.reserve(1000); // Pre-allocate to avoid reallocation When should you use vectors vs arrays? - Use ``std::vector`` for dynamic sizing, automatic memory management - Use fixed arrays for small, compile-time known sizes - Use ``std::array`` for type-safe fixed-size containers Linked structures ----------------- Linked lists ^^^^^^^^^^^^ What is a linked list? A sequence of nodes where each node contains data and a pointer to the next node. Unlike arrays, linked lists store elements non-contiguously in memory. **Singly linked list visualization**:: head ↓ ┌────┬────┐ ┌────┬────┐ ┌────┬────┐ │3.14│ ●──┼───→│2.71│ ●──┼───→│1.41│ ●──┼───→ NULL └────┴────┘ └────┴────┘ └────┴────┘ Node 1 Node 2 Node 3 (0x200) (0x350) (0x1A0) Each node: [data | next pointer] Memory locations are scattered (not contiguous) **Doubly linked list visualization**:: head ↓ ┌────┬────┬────┐ ┌────┬────┬────┐ ┌────┬────┬────┐ │NULL│3.14│ ●──┼───→│ ●←─┤2.71│ ●──┼───→│ ●←─┤1.41│NULL│ └────┴────┴────┘ └────┴────┴────┘ └────┴────┴────┘ ←──────────────────────── Each node: [prev | data | next] .. code-block:: cpp struct Node { double data; Node* next; }; // Creating a linked list Node* head = new Node{3.14, nullptr}; head->next = new Node{2.71, nullptr}; head->next->next = new Node{1.41, nullptr}; **Insertion at front (O(1))**:: Before inserting 5.00: head ↓ ┌────┬────┐ ┌────┬────┐ │3.14│ ●──┼───→│2.71│NULL│ └────┴────┘ └────┴────┘ After inserting 5.00 at front: head ↓ ┌────┬────┐ ┌────┬────┐ ┌────┬────┐ │5.00│ ●──┼───→│3.14│ ●──┼───→│2.71│NULL│ └────┴────┘ └────┴────┘ └────┴────┘ New node Old head **Accessing element at index i (O(n))**:: To access index 2, must traverse: head → Node 0 → Node 1 → Node 2 (No direct address calculation like arrays) What are the advantages of linked lists? - O(1) insertion/deletion at front - No reallocation needed when growing - Can grow to any size (limited by available memory) - Efficient for frequent insertions/deletions What are the disadvantages of linked lists? - O(n) access by index (must traverse) - Extra memory for pointers (overhead) - Poor cache performance (non-contiguous) - More complex memory management When are linked lists used in scientific computing? Linked lists are rarely used in numerical computing due to poor cache performance, but they appear in: - Sparse matrix storage (linked list of non-zero entries) - Memory allocation algorithms - Graph representations (adjacency lists) - Undo/redo stacks in applications Stack (LIFO) ------------ What is a stack? A Last-In-First-Out (LIFO) data structure. Elements are added and removed from the same end (the "top"). Think of a stack of plates—you add and remove from the top. **Stack operations visualization**:: Initial state: After push(3.14): After push(2.71): ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Empty │ │ 3.14 │ ← top │ 2.71 │ ← top └─────────┘ └─────────┘ ├─────────┤ │ 3.14 │ └─────────┘ After push(1.41): After pop(): After pop(): ┌─────────┐ ┌─────────┐ ┌─────────┐ │ 1.41 │ ← top │ 2.71 │ ← top │ 3.14 │ ← top ├─────────┤ ├─────────┤ └─────────┘ │ 2.71 │ │ 3.14 │ ├─────────┤ └─────────┘ │ 3.14 │ └─────────┘ **Function call stack example**:: main() called: foo(5) called: bar(15) called: ┌──────────┐ ┌──────────┐ ┌──────────┐ │ main │ │ a=5 │ │ x=15 │ │ locals │ │ b=10 │ │ c=16 │ │ return │ │ return │ │ return │ └──────────┘ ├──────────┤ ├──────────┤ │ main │ │ a=5 │ │ locals │ │ b=10 │ │ return │ │ return │ └──────────┘ ├──────────┤ │ main │ │ locals │ │ return │ └──────────┘ Growing down → **Key operations:** - ``push(x)``: Add element to top, O(1) - ``pop()``: Remove and return top element, O(1) - ``peek()/top()``: View top element without removing, O(1) How is a stack implemented? Stacks can be implemented using arrays or linked lists: .. code-block:: cpp #include std::stack s; s.push(3.14); // Add to top s.push(2.71); double top = s.top(); // View top: 2.71 s.pop(); // Remove top // Custom implementation with vector std::vector stack; stack.push_back(3.14); // push double x = stack.back(); // peek stack.pop_back(); // pop Where are stacks used? - Function call management (call stack) - Expression evaluation (postfix notation) - Backtracking algorithms (DFS, maze solving) - Undo mechanisms in editors - Recursive algorithm simulation What is the function call stack? When functions are called, the CPU uses a stack to store: - Return addresses - Function parameters - Local variables This enables function calls and returns in LIFO order. See the memory layout section in the C++ documentation for visualization. Queue (FIFO) ------------ What is a queue? A First-In-First-Out (FIFO) data structure. Elements are added at the back and removed from the front. Think of a line at a store—first person in line is served first. **Queue operations visualization**:: Initial state: Front Back ┌───────────────────────────────────────────────┐ │ Empty │ └───────────────────────────────────────────────┘ After enqueue(3.14): Front Back ┌───────────────────────────────────────────────┐ │ 3.14 │ └───────────────────────────────────────────────┘ After enqueue(2.71): Front Back ┌───────────────────────────────────────────────┐ │ 3.14 → 2.71 │ └───────────────────────────────────────────────┘ After enqueue(1.41): Front Back ┌───────────────────────────────────────────────┐ │ 3.14 → 2.71 → 1.41 │ └───────────────────────────────────────────────┘ After dequeue() - removes 3.14: Front Back ┌───────────────────────────────────────────────┐ │ 2.71 → 1.41 │ └───────────────────────────────────────────────┘ After dequeue() - removes 2.71: Front Back ┌───────────────────────────────────────────────┐ │ 1.41 │ └───────────────────────────────────────────────┘ **Circular queue implementation** (efficient use of array):: Array-based circular queue (capacity=5): front=0, back=0 (empty) ┌───┬───┬───┬───┬───┐ │ │ │ │ │ │ └───┴───┴───┴───┴───┘ 0 1 2 3 4 After enqueue(A), enqueue(B), enqueue(C): front=0, back=3 ┌───┬───┬───┬───┬───┐ │ A │ B │ C │ │ │ └───┴───┴───┴───┴───┘ ↑ ↑ front back After dequeue() twice, then enqueue(D), enqueue(E), enqueue(F): front=2, back=0 (wraps around) ┌───┬───┬───┬───┬───┐ │ F │ │ C │ D │ E │ └───┴───┴───┴───┴───┘ ↑ ↑ back front **Key operations:** - ``enqueue(x)`` or ``push(x)``: Add element to back, O(1) - ``dequeue()`` or ``pop()``: Remove element from front, O(1) - ``front()``: View front element, O(1) How is a queue implemented? .. code-block:: cpp #include std::queue q; q.push(3.14); // Add to back q.push(2.71); double front = q.front(); // View front: 3.14 q.pop(); // Remove from front Where are queues used? - Task scheduling (CPU scheduling, print queues) - Breadth-first search (BFS) in graphs - Buffering (data streams, I/O) - Simulation of real-world queues - Message passing systems Priority queue (heap) --------------------- What is a priority queue? A data structure where each element has a priority, and elements are removed in priority order (not insertion order). Typically implemented using a binary heap. .. code-block:: cpp #include // Max heap (largest first) std::priority_queue max_heap; max_heap.push(3.14); max_heap.push(2.71); max_heap.push(5.0); double largest = max_heap.top(); // 5.0 max_heap.pop(); // Min heap (smallest first) std::priority_queue, std::greater> min_heap; What is a binary heap? A complete binary tree where each parent node has a value greater than or equal to (max heap) or less than or equal to (min heap) its children. **Max heap tree visualization**:: 100 / \ 80 90 / \ / \ 70 60 85 75 / \ / 50 40 55 Properties: - Parent ≥ all descendants (max heap) - Complete binary tree (filled left to right) - Root contains maximum value **Min heap tree visualization**:: 10 / \ 20 15 / \ / \ 30 25 40 18 / \ 50 45 Properties: - Parent ≤ all descendants (min heap) - Root contains minimum value **Array representation of heap**:: Max heap tree: 100 / \ 80 90 / \ / \ 70 60 85 75 Stored as array: Index: 0 1 2 3 4 5 6 ┌────┬───┬───┬───┬───┬───┬───┐ Array:│100 │80 │90 │70 │60 │85 │75 │ └────┴───┴───┴───┴───┴───┴───┘ Relationships: - Parent of i: (i-1)/2 - Left child of i: 2i+1 - Right child of i: 2i+2 Example: Element at index 1 (80) - Parent: (1-1)/2 = 0 → 100 - Left child: 2(1)+1 = 3 → 70 - Right child: 2(1)+2 = 4 → 60 **Insert operation (bubble up)**:: Insert 95 into max heap: Step 1: Add at end Step 2: Compare with parent (90) 100 100 / \ / \ 80 90 80 90 / \ / \ / \ / \ 70 60 85 75 70 60 85 75 / / 95 95 95 > 90, swap! Step 3: Swap and continue Step 4: Compare with parent (100) 100 100 / \ / \ 80 95 80 95 / \ / \ / \ / \ 70 60 85 75 70 60 85 75 / / 90 90 95 < 100, done! **Extract max operation (bubble down)**:: Remove root (100): Step 1: Replace root with last Step 2: Compare with children 75 75 / \ / \ 80 95 80 95 / \ / \ / \ / \ 70 60 85 90 70 60 85 90 75 < max(80, 95), swap with 95! Step 3: Continue bubbling down Step 4: Done (heap property restored) 95 95 / \ / \ 80 75 80 90 / \ / \ / \ / \ 70 60 85 90 70 60 85 75 75 < 90, swap! 75 > 85, done! What are the time complexities for heap operations? - Insert: O(log n) (add at end, then bubble up) - Extract max/min: O(log n) (remove root, move last to root, bubble down) - Peek max/min: O(1) (just view root) - Build heap from array: O(n) (heapify algorithm) Where are heaps used? - Dijkstra's shortest path algorithm - Heap sort (O(n log n) sorting) - Finding k largest/smallest elements - Job scheduling with priorities - Huffman coding for compression - Event-driven simulation Practical example in scientific computing .. code-block:: cpp // Find k smallest distances in large dataset std::priority_queue max_heap; // Keep k smallest int k = 10; for (double distance : large_dataset) { if (max_heap.size() < k) { max_heap.push(distance); } else if (distance < max_heap.top()) { max_heap.pop(); max_heap.push(distance); } } // max_heap now contains k smallest distances Hash tables ----------- What is a hash table? A data structure that maps keys to values using a hash function. Provides average O(1) insertion, deletion, and lookup. **Hash table visualization**:: Hash function: hash(key) % table_size Keys to insert: "pi"→3.14, "e"→2.71, "phi"→1.61, "sqrt2"→1.41 Step 1: Calculate hash values hash("pi") = 17 → 17 % 8 = 1 hash("e") = 5 → 5 % 8 = 5 hash("phi") = 22 → 22 % 8 = 6 hash("sqrt2") = 29 → 29 % 8 = 5 (collision!) Hash table (size=8): Index │ Key │ Value ───────┼─────────┼─────── 0 │ │ 1 │ "pi" │ 3.14 2 │ │ 3 │ │ 4 │ │ 5 │ "e" │ 2.71 ← collision at index 5 6 │ "phi" │ 1.61 7 │ │ **Collision handling - Chaining**:: Using linked lists at each index: Index │ Bucket (linked list) ───────┼───────────────────────────── 0 │ → NULL 1 │ → ["pi", 3.14] → NULL 2 │ → NULL 3 │ → NULL 4 │ → NULL 5 │ → ["e", 2.71] → ["sqrt2", 1.41] → NULL │ ↑ ↑ │ First Chained (collision) 6 │ → ["phi", 1.61] → NULL 7 │ → NULL **Collision handling - Open Addressing (Linear Probing)**:: When collision occurs, probe next slots: Try to insert "sqrt2" at index 5 (occupied): Index │ Key │ Value │ Probing ───────┼─────────┼─────────┼───────── 0 │ │ │ 1 │ "pi" │ 3.14 │ 2 │ │ │ 3 │ │ │ 4 │ │ │ 5 │ "e" │ 2.71 │ Occupied, try 5+1 6 │ "phi" │ 1.61 │ Occupied, try 5+2 7 │ "sqrt2" │ 1.41 │ Empty! Insert here **Hash table lookup visualization**:: Looking up "sqrt2": 1. Calculate hash: hash("sqrt2") % 8 = 5 2. Check index 5: "e" ≠ "sqrt2", continue 3. Check index 6: "phi" ≠ "sqrt2", continue 4. Check index 7: "sqrt2" = "sqrt2", found! Return 1.41 .. code-block:: cpp #include std::unordered_map constants; constants["pi"] = 3.14159; constants["e"] = 2.71828; double pi = constants["pi"]; // O(1) lookup bool exists = constants.count("e"); // Check existence How does a hash table work? 1. Hash function converts key to integer (hash code) 2. Hash code mapped to array index (hash code % table_size) 3. Value stored at that index 4. Collisions handled by chaining or open addressing What are hash collisions? When two different keys hash to the same index. Common solutions: - **Chaining**: Store multiple elements at same index using linked list - **Open addressing**: Find next available slot (linear probing, quadratic probing) Where are hash tables used? - Database indexing - Caching (memoization) - Symbol tables in compilers - Counting frequencies - Detecting duplicates - Implementing dictionaries and sets Trees ----- Binary search tree (BST) ^^^^^^^^^^^^^^^^^^^^^^^^ What is a binary search tree? A binary tree where for each node, all left descendants have smaller values and all right descendants have larger values. Enables efficient searching. **BST structure visualization**:: Insert sequence: 50, 30, 70, 20, 40, 60, 80 50 / \ 30 70 / \ / \ 20 40 60 80 BST property: Left < Node < Right - 20, 30 < 50 < 60, 70, 80 - 20 < 30 < 40 - 60 < 70 < 80 **BST search visualization**:: Search for 60: Step 1: Start at root Step 2: 60 > 50, go right 50 50 / \ / \ 30 70 ✓ 30 70 ✓ / \ / \ / \ / \ 20 40 60 80 20 40 60 80 Step 3: 60 < 70, go left Step 4: Found! 50 50 / \ / \ 30 70 30 70 / \ / \ / \ / \ 20 40 60✓ 80 20 40 60✓ 80 ↑ ↑ Found! Return 60 Comparisons: 3 (log₂ 7 ≈ 2.8) **BST insert visualization**:: Insert 55 into existing BST: Step 1: Start at root Step 2: 55 > 50, go right 50 50 / \ / \ 30 70 30 70 / \ / \ / \ / \ 20 40 60 80 20 40 60 80 Step 3: 55 < 70, go left Step 4: 55 < 60, go left 50 50 / \ / \ 30 70 30 70 / \ / \ / \ / \ 20 40 60 80 20 40 60 80 / NULL Step 5: Insert at NULL position 50 / \ 30 70 / \ / \ 20 40 60 80 / 55 ← New node inserted **BST in-order traversal (gives sorted sequence)**:: 50 / \ 30 70 / \ / \ 20 40 60 80 In-order: Left → Node → Right Sequence: 20, 30, 40, 50, 60, 70, 80 (sorted!) **BST delete visualization**:: Delete 30 (node with two children): Original tree: Find successor (leftmost of right subtree): 50 50 / \ / \ 30✗ 70 30✗ 70 / \ / \ / \ / \ 20 40 60 80 20 40← 60 80 ↑ Successor Replace 30 with 40: Remove 40 from original position: 50 50 / \ / \ 40 70 40 70 / \ / \ / / \ 20 40 60 80 20 60 80 **Balanced vs unbalanced BST**:: Balanced BST: Unbalanced BST (worst case): 50 10 / \ \ 30 70 20 / \ / \ \ 20 40 60 80 30 \ Height: 3 40 Search: O(log n) \ 50 Height: 5 Search: O(n) **Properties:** - Left subtree < node < right subtree - In-order traversal gives sorted sequence - Average O(log n) for search, insert, delete - Worst case O(n) if tree becomes unbalanced When is a BST useful? - Maintaining sorted data with frequent insertions - Range queries (find all values between x and y) - Finding predecessor/successor - Implementation of sets and maps Balanced trees ^^^^^^^^^^^^^^ What is a balanced tree? A tree that maintains approximately equal height in left and right subtrees, ensuring O(log n) operations even in worst case. Examples: AVL trees, Red-Black trees. C++ ``std::map`` and ``std::set`` use balanced trees (typically Red-Black trees). .. code-block:: cpp #include #include std::map sorted_map; // Keys kept sorted sorted_map["b"] = 2.0; sorted_map["a"] = 1.0; sorted_map["c"] = 3.0; // Iterate in sorted order: a, b, c for (const auto& [key, value] : sorted_map) { std::cout << key << ": " << value << std::endl; } Graphs ------ What is a graph? A collection of nodes (vertices) connected by edges. Graphs model relationships, networks, and connectivity. **Graph types visualization**:: Undirected graph: Directed graph (Digraph): A — B — C A → B → C | | | ↑ ↓ ↓ D — E — F D ← E → F Edges have no direction Edges have direction (arrows) Weighted graph: Complete graph (all connected): A --5-- B A /| |\ /|\ 2/ |3 1| \4 / | \ / | | \ / | \ D C --6-- E F B---C---D Edges have weights/costs Every node connected to all others **Graph representations**:: Example graph: A — B | | C — D Adjacency Matrix (2D array): A B C D ┌──────────────┐ A │ 0 1 1 0 │ B │ 1 0 0 1 │ C │ 1 0 0 1 │ D │ 0 1 1 0 │ └──────────────┘ matrix[i][j] = 1 if edge exists, 0 otherwise Space: O(n²), Edge check: O(1) Adjacency List (array of lists): A: [B, C] B: [A, D] C: [A, D] D: [B, C] Space: O(n + m), Edge check: O(degree) (n = nodes, m = edges) How are graphs represented? **Adjacency matrix** (2D array): .. code-block:: cpp // For n nodes std::vector> adj_matrix(n, std::vector(n, 0)); adj_matrix[i][j] = 1; // Edge from i to j // Space: O(n²), Edge check: O(1) **Adjacency list** (array of lists): .. code-block:: cpp std::vector> adj_list(n); adj_list[i].push_back(j); // Edge from i to j // Space: O(n + m), Edge check: O(degree) **Weighted graph with adjacency list**:: Graph with weights: A --5-- B /| |\ 2/ |3 1| \4 / | | \ D C --6-- E F Adjacency list representation: A: [(B,5), (C,3), (D,2)] B: [(A,5), (E,1), (F,4)] C: [(A,3), (E,6)] D: [(A,2)] E: [(B,1), (C,6)] F: [(B,4)] .. code-block:: cpp // Weighted graph struct Edge { int dest; double weight; }; std::vector> graph(n); graph[u].push_back({v, weight}); // Edge from u to v **Graph traversal visualization**:: Breadth-First Search (BFS): Depth-First Search (DFS): Start at A, visit level by level Start at A, go deep first A (0) A (1) / \ / \ B(1) C(1) B(2) C(5) / \ / \ D(2) E(2) D(3) E(6) / | F(3) F(4) Visit order: A,B,C,D,E,F Visit order: A,B,D,F,E,C Uses Queue Uses Stack (or recursion) **Real-world graph example**:: Molecular structure (benzene ring): C₁ — C₂ / \ C₆ C₃ \ / C₅ — C₄ Vertices: Carbon atoms Edges: Chemical bonds Properties: Ring structure, all nodes have degree 2 Where are graphs used in scientific computing? - Molecular structures (atoms as nodes, bonds as edges) - Neural networks (nodes as neurons, edges as connections) - Finite element meshes (nodes as points, edges as connections) - Dependency graphs (task scheduling) - Network analysis (social networks, protein interactions) Choosing the right data structure ---------------------------------- General guidelines ^^^^^^^^^^^^^^^^^^ For random access by index: Use ``std::vector`` or arrays For frequent insertions/deletions at ends: Use ``std::deque`` (double-ended queue) For LIFO access (Last-In-First-Out): Use ``std::stack`` For FIFO access (First-In-First-Out): Use ``std::queue`` For priority-based access: Use ``std::priority_queue`` (heap) For key-value lookups: Use ``std::unordered_map`` (hash table) for average O(1) Use ``std::map`` (balanced tree) for sorted keys For sorted data with frequent updates: Use ``std::set`` or ``std::map`` For large datasets in scientific computing: - Vectors for contiguous data (best cache performance) - Sparse matrices for mostly-zero data - Heaps for k-largest/k-smallest problems - Hash tables for fast lookups without order Performance considerations ^^^^^^^^^^^^^^^^^^^^^^^^^^ Cache performance matters: Modern CPUs are 100x faster at accessing contiguous memory (arrays, vectors) than scattered memory (linked lists, trees). For scientific computing, prefer arrays and vectors. Memory overhead: - Array/vector: minimal overhead - Linked list: 8-16 bytes per element (pointer overhead) - Hash table: typically 2x memory for good performance - Tree: 16-24 bytes per element (2-3 pointers) Trade-offs: - Fast insertion vs fast access - Memory usage vs time complexity - Simplicity vs flexibility Scientific computing applications ---------------------------------- .. list-table:: :header-rows: 1 :widths: 20 80 * - Data Structure - Scientific Use Cases * - Arrays - Fixed-size atomic coordinates, lattice parameters, transformation matrices, grid points in finite difference methods, wave function coefficients * - Vectors - Growing trajectory data, time series measurements, molecular dynamics positions over time, adaptive mesh refinement, particle tracking * - Linked Lists - Sparse matrix row storage, event chains in molecular dynamics, undo/redo operations in simulation software * - Stacks - Recursion in tree traversals, backtracking algorithms, depth-first search in molecular conformations, expression parsing * - Queues - Breadth-first search in crystal structures, task scheduling in parallel simulations, buffering data streams from instruments * - Heaps (Priority Queues) - Event-driven molecular dynamics, Monte Carlo sampling, A* pathfinding in energy landscapes, Dijkstra's algorithm for phonon dispersion * - Hash Tables - Atom type lookups by atomic number, symmetry operation caching, memoization in quantum chemistry, unique structure identification, element property databases * - Binary Search Trees - Range queries in spatial databases, ordered traversal of energy levels, interval trees for overlap detection * - Graphs - Crystal structure connectivity, molecular bond networks, finite element meshes, phonon dispersion relations, reaction pathways, protein folding landscapes Cache-friendly patterns for performance ---------------------------------------- Why cache locality matters ^^^^^^^^^^^^^^^^^^^^^^^^^^ Modern CPUs access memory in cache lines (typically 64 bytes). Accessing data sequentially is 100x faster than random access due to cache hits. .. code-block:: text CPU hierarchy (access time): L1 Cache: ~1 ns (32-64 KB per core) L2 Cache: ~3 ns (256-512 KB per core) L3 Cache: ~12 ns (8-32 MB shared) RAM: ~100 ns (8-64 GB) Random access breaks cache locality! Structure of arrays vs array of structures ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For numerical computations with SIMD vectorization: .. code-block:: cpp // Array of Structures (AoS) - BAD for vectorization struct Particle { double x, y, z; // position double vx, vy, vz; // velocity double mass; }; std::vector particles(1000000); // To update all x positions, CPU must skip over y, z, vx, vy, vz, mass // Cannot efficiently vectorize - data not contiguous! for (auto& p : particles) { p.x += p.vx * dt; // Strided memory access } .. code-block:: cpp // Structure of Arrays (SoA) - GOOD for vectorization struct ParticleSystem { std::vector x, y, z; // positions std::vector vx, vy, vz; // velocities std::vector mass; }; ParticleSystem particles; particles.x.resize(1000000); particles.vx.resize(1000000); // All x positions are contiguous - perfect for SIMD! // CPU can process 4-8 values simultaneously with AVX/AVX512 for (size_t i = 0; i < particles.x.size(); ++i) { particles.x[i] += particles.vx[i] * dt; // Contiguous memory } Performance comparison for 10^6 particles - **AoS**: 50 ms (cache misses, no vectorization) - **SoA**: 5 ms (cache-friendly, 8x SIMD speedup) - **Speedup**: 10x faster with SoA! Contiguous memory access patterns ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: cpp // GOOD: Row-major access (cache-friendly) std::vector> matrix(1000, std::vector(1000)); for (size_t i = 0; i < 1000; ++i) { for (size_t j = 0; j < 1000; ++j) { matrix[i][j] = compute(i, j); // Sequential in memory } } .. code-block:: cpp // BAD: Column-major access (cache-unfriendly) for (size_t j = 0; j < 1000; ++j) { for (size_t i = 0; i < 1000; ++i) { matrix[i][j] = compute(i, j); // Jumps 1000 elements each time! } } .. code-block:: cpp // BETTER: Use flat 1D arrays for 2D/3D data std::vector matrix(1000 * 1000); // Contiguous allocation for (size_t i = 0; i < 1000; ++i) { for (size_t j = 0; j < 1000; ++j) { size_t idx = i * 1000 + j; // Row-major indexing matrix[idx] = compute(i, j); } } Key principles for scientific computing 1. **Prefer arrays/vectors** over linked structures for numerical data 2. **Use SoA layout** when processing same property across many objects 3. **Access memory sequentially** in nested loops (row-major for C++) 4. **Align data** to cache line boundaries (64 bytes) for optimal performance 5. **Minimize pointer chasing** - each indirection costs ~100 cycles GPU and parallel computing considerations ------------------------------------------ GPU-friendly data structures ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 20 25 55 * - Data Structure - GPU Compatibility - Considerations * - Arrays - ✓ Excellent - Perfect for coalesced memory access, native CUDA support, minimal overhead * - Vectors (std::vector) - ✓ Good - Copy to device with ``cudaMemcpy``, use raw pointers in kernels, manage capacity carefully * - Hash Tables - ⚠ Challenging - Requires special GPU implementations (cuco library, custom kernels), atomic operations for concurrent access * - Linked Lists - ✗ Poor - Pointer chasing kills parallelism, each thread diverges, terrible memory access pattern * - Trees (BST, Heap) - ✗ Poor - Similar issues to linked lists, use GPU-specific implementations (thrust::priority_queue) * - Graphs - ⚠ Moderate - Use CSR (Compressed Sparse Row) format, adjacency arrays work well, avoid adjacency lists Memory coalescing patterns ^^^^^^^^^^^^^^^^^^^^^^^^^^^ GPUs achieve peak performance when threads in a warp (32 threads) access contiguous memory: .. code-block:: cpp // GOOD: Coalesced access (threads access adjacent elements) __global__ void add_vectors(double* a, double* b, double* c, int n) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < n) { c[idx] = a[idx] + b[idx]; // Thread 0: a[0], Thread 1: a[1], etc. } } // All 32 threads in warp access consecutive memory - FAST! .. code-block:: cpp // BAD: Strided access (threads skip elements) __global__ void transpose_naive(double* in, double* out, int width) { int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; // Reading is coalesced, but writing is strided! out[x * width + y] = in[y * width + x]; // Thread 0: out[0], out[width], out[2*width]... } // Threads in warp write to memory locations 'width' apart - SLOW! Structure of arrays on GPU ^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: cpp // Host code - SoA layout struct ParticleSystemGPU { double *x, *y, *z; // Device pointers double *vx, *vy, *vz; int num_particles; }; // Allocate device memory cudaMalloc(&particles.x, n * sizeof(double)); cudaMalloc(&particles.vx, n * sizeof(double)); // Kernel - perfect for coalescing! __global__ void update_positions(double* x, double* vx, double dt, int n) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < n) { x[idx] += vx[idx] * dt; // All threads in warp access x[0], x[1], x[2]... } } Parallel algorithm patterns ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: cpp // Pattern 1: Map (element-wise operations) // Perfect for GPU: O(n) work, no dependencies __global__ void square_elements(double* data, int n) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < n) data[idx] = data[idx] * data[idx]; } .. code-block:: cpp // Pattern 2: Reduce (sum, max, min) // Good for GPU with shared memory optimization __global__ void reduce_sum(double* input, double* output, int n) { __shared__ double shared[256]; int tid = threadIdx.x; int idx = blockIdx.x * blockDim.x + threadIdx.x; shared[tid] = (idx < n) ? input[idx] : 0.0; __syncthreads(); // Tree reduction in shared memory for (int s = blockDim.x / 2; s > 0; s >>= 1) { if (tid < s) shared[tid] += shared[tid + s]; __syncthreads(); } if (tid == 0) output[blockIdx.x] = shared[0]; } .. code-block:: cpp // Pattern 3: Scan (prefix sum) // Challenging but powerful for GPU // Use thrust::exclusive_scan() or CUB library When to use CPU vs GPU ^^^^^^^^^^^^^^^^^^^^^^^ **Use GPU when:** - Processing large arrays (> 10^5 elements) - Element-wise operations (map, element-wise math) - Regular memory access patterns - High arithmetic intensity (many FLOPs per memory access) - Data parallel algorithms (same operation on different data) **Use CPU when:** - Small data sets (< 10^4 elements) - Irregular memory access (trees, graphs, hash tables) - Lots of branching and conditionals - Sequential dependencies - Frequent host-device transfers (overhead dominates) Practical example for ptychography reconstruction ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Ptychography involves computing exit waves (probe × transmission) across many probe positions, followed by FFT to generate diffraction patterns. This is perfectly suited for GPU acceleration with proper data structures. .. code-block:: cpp // CPU: Use vectors for probe position management // Irregular scanning patterns, adaptive step sizes struct PtychographySetup { std::vector probe_x; // Probe positions (may be irregular) std::vector probe_y; std::vector measurement_order; // Adaptive ordering int num_positions; int probe_size; // e.g., 256x256 }; void generate_probe_positions(PtychographySetup& setup, double step_size) { // Complex logic: spiral scan, adaptive sampling, overlapping constraints // May skip regions, vary step size based on convergence for (int i = 0; i < setup.num_positions; ++i) { // Branching logic, unpredictable access patterns if (needs_refinement(i)) { add_extra_positions(setup, i); } } } .. code-block:: cpp // GPU: Use SoA arrays for exit wave and FFT computation // Regular, predictable, massively parallel across all probe positions #include struct ProbeDataGPU { cufftDoubleComplex* probe; // Incident probe (256×256 complex) cufftDoubleComplex* transmission; // Object transmission function cufftDoubleComplex* exit_waves; // probe × transmission for each position cufftDoubleComplex* diffraction; // FFT results for each position double* probe_positions_x; // Flattened probe coordinates double* probe_positions_y; int num_positions; int probe_width, probe_height; }; // Kernel: Compute exit wave = probe × transmission at each position __global__ void compute_exit_waves( cufftDoubleComplex* probe, cufftDoubleComplex* transmission, cufftDoubleComplex* exit_waves, double* probe_x, double* probe_y, int num_positions, int probe_width, int probe_height, int transmission_width) { int pos_idx = blockIdx.z; // Which probe position int y = blockIdx.y * blockDim.y + threadIdx.y; int x = blockIdx.x * blockDim.x + threadIdx.x; if (pos_idx >= num_positions || x >= probe_width || y >= probe_height) return; // Calculate position in transmission function int trans_x = (int)(probe_x[pos_idx]) + x; int trans_y = (int)(probe_y[pos_idx]) + y; // Flattened array indices int probe_idx = y * probe_width + x; int trans_idx = trans_y * transmission_width + trans_x; int exit_idx = pos_idx * probe_width * probe_height + probe_idx; // Complex multiplication: exit = probe × transmission // Perfect for coalescing - all threads in warp access adjacent memory cufftDoubleComplex p = probe[probe_idx]; cufftDoubleComplex t = transmission[trans_idx]; exit_waves[exit_idx].x = p.x * t.x - p.y * t.y; // Real part exit_waves[exit_idx].y = p.x * t.y + p.y * t.x; // Imaginary part } // Use cuFFT for batch FFT across all probe positions void compute_diffraction_patterns(ProbeDataGPU& data) { cufftHandle plan; int n[2] = {data.probe_height, data.probe_width}; int batch = data.num_positions; // Create batch FFT plan - process all positions in parallel cufftPlanMany(&plan, 2, n, NULL, 1, 0, NULL, 1, 0, CUFFT_Z2Z, batch); // Execute FFT on all exit waves simultaneously // GPU processes 100+ probe positions in parallel! cufftExecZ2Z(plan, data.exit_waves, data.diffraction, CUFFT_FORWARD); cufftDestroy(plan); } .. code-block:: cpp // Kernel: Compute error metric (compare with measured intensities) __global__ void compute_error( cufftDoubleComplex* diffraction, double* measured_intensity, double* error, int num_positions, int probe_size) { int pos_idx = blockIdx.y; int pixel_idx = blockIdx.x * blockDim.x + threadIdx.x; if (pos_idx >= num_positions || pixel_idx >= probe_size) return; int idx = pos_idx * probe_size + pixel_idx; // Compute intensity: |FFT|² double intensity = diffraction[idx].x * diffraction[idx].x + diffraction[idx].y * diffraction[idx].y; // Error metric (e.g., least squares) double diff = intensity - measured_intensity[idx]; error[idx] = diff * diff; // All threads process different probe positions in parallel } Performance breakdown - **Probe positions**: 100-1000 positions - **Probe size**: 256×256 complex numbers (1 MB per position) - **Total data**: 100 MB - 1 GB of exit waves - **GPU FFT**: ~1 ms per batch of 100 positions (cuFFT is highly optimized) - **CPU FFT**: ~100 ms (FFTW single-threaded) - **Speedup**: 100x with GPU + batch FFT Why this works well on GPU 1. **Arrays of complex numbers** - perfect memory layout for coalescing 2. **Same operation across all positions** - data parallelism 3. **Batch FFT** - cuFFT processes multiple FFTs simultaneously 4. **No branching** - all threads follow same code path 5. **High arithmetic intensity** - complex multiplication + FFT (many FLOPs per memory access) Key takeaways 1. **Arrays and vectors** are the only truly GPU-friendly structures 2. **Memory coalescing** is critical for GPU performance (32x speedup possible) 3. **SoA layout** enables vectorization on both CPU (SIMD) and GPU (coalescing) 4. **Minimize host-device transfers** - they're expensive (~10 GB/s vs 1000 GB/s device memory) 5. **Hybrid approach**: Use CPU for irregular algorithms, GPU for regular numerical kernels