What is indexing?
An index is a special data structure that stores a small, searchable reference to larger pieces of data. Think of it like the index at the back of a book: instead of flipping through every page to find a topic, you look up the topic in the index and it tells you the exact page number. In tech, an index points to the location of rows in a database table, files on a disk, or documents in a search engine, allowing the system to find what you need without scanning everything.
Let's break it down
- Key: The value you want to search for (e.g., a customer’s ID, a word in a document).
- Pointer/Reference: A tiny “address” that tells the system where the full record lives.
- Structure: Most indexes are built as trees (like B‑trees) or hash tables, which keep the keys sorted or grouped for fast lookup.
- Creation: When you tell a database or search engine to create an index, it reads the existing data, extracts the keys, and builds the structure, storing it separately from the main data.
Why does it matter?
Without an index, the system must examine every single record to answer a question-a process called a full table scan or linear search. That can be slow and wasteful, especially with millions of rows or documents. An index reduces the work to a few steps, turning minutes of waiting into milliseconds, which makes applications feel responsive and saves computing resources.
Where is it used?
- Relational databases (MySQL, PostgreSQL, SQL Server) to speed up SELECT queries.
- NoSQL stores (MongoDB, Elasticsearch) for fast lookups on specific fields or full‑text search.
- File systems (NTFS, ext4) use indexes to locate files quickly.
- Search engines (Google, Bing) build massive inverted indexes to map words to web pages.
- Programming languages often provide indexed collections like arrays or hash maps for quick access.
Good things about it
- Speed: Queries that use an index can be orders of magnitude faster.
- Efficiency: Less CPU and I/O are needed, freeing resources for other tasks.
- Scalability: Proper indexing lets applications handle growing data volumes without degrading performance.
- Flexibility: Different index types (B‑tree, hash, GiST, full‑text) can be chosen to match specific query patterns.
Not-so-good things
- Storage cost: Indexes take up extra disk space, sometimes as much as the data they reference.
- Write overhead: Every INSERT, UPDATE, or DELETE must also modify the index, slowing down write‑heavy workloads.
- Maintenance: Over time indexes can become fragmented and may need rebuilding or re‑optimizing.
- Complexity: Choosing the wrong columns or too many indexes can actually hurt performance rather than help.