| Image by Michael Dziedzic via Unsplash Copyright-free
In most of our articles, we are trying to spread the knowledge about the most common tools and techniques we use for tuning SQL, to make it more performant in regards to resource consumption and response time. While we try not to go too deep into database internals, we might not have spent enough time on some of the more basic concepts. These basic concepts are the key steps needed to be able to understand how databases work and how to write the code which is easy for the database to optimize and take advantage of its features or indexes.
In this article we will focus on:
- What is SQL?
- What is the SQL Engine and Query Optimizer?
- What are statistics and how are they used?
What is SQL?
Originally, the users of DBMSs (Database management systems) were programmers. Accessing the stored data required writing a program in a COBOL language. How data was stored depended on the hardware the vendor used and what data structures they used to keep the data. Simply put, you needed a developer to even read a table (if the data was physically stored in a table format :) ).
Allowing users to access data on an ad hoc basis required giving them a language to express their requests, without the need of knowing hardware. A single request to a database is called a query, and such a language is called a query language. Many query languages were developed for this purpose, but one of them has become the most popular: Structured Query Language, invented at IBM in the 1970s. SQL became an ANSI standard in 1986 and an ISO standard in 1987. It is used today in many DBMS.
Declarative query languages let users express what data to retrieve, letting the database engine underneath take care of seamlessly retrieving it. They function in a more general manner and require instructions on what task is to be completed, rather than the specifics on how to complete it. They deal with the results rather than the process, focusing less on the finer details of each task.
The most important paradigm to understand is: in SQL, you do not write the actual algorithm how to retrieve data (like in Java for example, but even newer OOP is adding now some declarative style i.e. steams), but you just make a query asking for the data. The database engines need to generate the algorithm to find the data for you.
What is Query Optimizer?
The Storage Engine and Query Optimizer, which are part of the Relational Engine, are two core components of a Database Engine. The Storage Engine is a software module used to create, read, and update data between the disk and memory, while still maintaining data integrity (rollback journals and write-ahead logs). Query Optimizer needs to analyze an SQL statement to determine the most efficient way to extract the requested data.
To process an SQL statement, a DBMS performs the following five steps:
- The DBMS first parses an SQL statement. It breaks the statement up into individual words (called tokens), makes sure that the statement has a valid verb and valid clauses, and so on. Syntax errors and misspellings can be detected in this step.
- The DBMS validates the statement. It checks the statement against the system catalogue. Do all the tables named in the statement exist in the database? Do all the columns exist and are the column names unambiguous? Does the user have the required privileges to execute the statement? Certain semantic errors can be detected in this step.
- The DBMS generates multiple access plans (including serial and parallel depending on settings) for the statement. The access plan is a logical representation of the steps that are required to carry out the statement and later compiled - it is the DBMS equivalent of executable code.
- The DBMS optimizes the access plan. It explores various ways to carry out the access plan. Can an index be used to speed a search? Should the DBMS first apply a search condition to Table A and then join it to Table B, or should it begin with the join and use the search condition afterwards? Can a sequential search through a table be avoided or reduced to a subset of the table? After exploring the alternatives, the DBMS chooses one of them. This is why SQL is a declarative language, as you don't need to know the algorithms how to join/filter/parse or retrieve the data in the most optimal way. You just have to know what data you need and leave the "how" to the DB Engine.
- The DBMS executes the statement by running the execution plan.
The process above is a cost-based optimization. For complex queries, big tables, or many access patterns (because of indexes or indexed views), the optimizer might take more time to produce the best possible execution plan. Still, at some point, it has to make the decision and prove a ‘’good enough plan (because we can't wait forever). The execution plan is compiled, cached, and saved in memory for later reuse. By caching execution plans, the DB is saving time and resources by not doing the first 4 steps above. As a real-life example, it’s not unusual to see 250+ milliseconds of CPU time spent just compiling a complex plan – which means you can only compile 4 queries per second, per CPU core. That’s when plan cache pollution due to non parameterized queries (too often because of wrong ORM configuration) can make a big difference on very busy OLTP systems.
The database supports more than 130 physical operators (like sort, hash, bitmap filter...) while executing an SQL that joins 2 tables. When creating the algorithm for the data access path, we are not telling which indexes or physical operations to use, or in which order to use them. We are just specifying what we want.
What are statistics and how are they used?
A Query Optimizer uses statistics to create query plans that improve query performance. Statistics for query optimization are binary large objects (BLOBs) that contain statistical information about the distribution of values in one or more columns of a table or indexed view. The Query Optimizer uses these statistics to estimate the cardinality, a number of rows to be retrieved/processed in the query result. These cardinality estimates enable the Query Optimizer to create a high-quality query plan. For example, depending on your predicates, the Query Optimizer could use cardinality estimates to choose the index seek operator instead of the more resource-intensive index scan operator if doing so improves query performance.
There are the following types of statistics:
- System-created statistics.
- Index statistics that are created together with the index.
- Statistics that the DBMS makes if no statistics are available for this column. If you are filtering a table on a column that does not have an index or statistics, the Engine will usually create one so that it would be able to determine the number of rows returned and create an execution plan (but this is not always the case).
- User-created statistics that we will focus on in the next chapter. These statistics, unlike system statistics (not index stats), can be made on multiple columns.
All relational databases have statistics, just some of the cloud-native databases do not expose this data to the end user like in the case of Redshift or Snowflake. It's also worth mentioning that this is rarely the case for NoSQL, so databases like MongoDB, which have a much simpler Query Optimizer, do not leverage statistics to come up with the best possible algorithm.
Statistics have two main components:
- Density vector, which gives a general overview of the cardinality of the data in columns.
- Histogram, which is created only on the leading column of the index or statistics. The histogram is created from reading a portion of the table (can be configured), and from that 200 values are sampled, for which we store the metadata shown in the picture below.
Density is the information about the number of duplicates in a given column or combination of columns and it is calculated as 1/(number of distinct values). The Query Optimizer uses densities to enhance cardinality estimates for queries that return multiple columns from the same table or indexed view. As the density decreases, the selectivity of a value increases. In general, this information is used by the Query Optimizer to know what amount of data will be retrieved after doing an aggregation (which groups data) or while doing a distinct. A density vector provides generic information about what is the density of a specific column or combination of columns.
A histogram measures the frequency of occurrence for each distinct value in a data set. The Query Optimizer computes the histogram on the column values in the first key column of the statistics object, selecting the column values by statistically sampling the rows or by performing a full scan of all rows in the table or view. If the histogram is created from a sampled set of rows, the stored totals for a number of rows and distinct values are estimates and do not need to be whole numbers.
To create the histogram, the Query Optimizer sorts the column values, computes the number of values that match each distinct column value and then aggregates the column values into a maximum of 200 contiguous histogram steps. Each histogram step includes a range of column values followed by an upper bound column value. The range includes all possible column values between boundary values, excluding the boundary values themselves. The lowest of the sorted column values is the upper boundary value for the first histogram step.
If we take the statistics from the picture above, we can show on a few simple examples how this works in practice:
- For the query like ‘’select * from dbo.Tiposdecambio where TCA_MONEDASCORRESPONSAL_ID=55’’ the Engine would know it needs to return about 151 rows (just by looking at RANGE_HI_KEY=55 and reading the value EQ_ROWS=151.1951).
- If we wrote a query like ‘’select * from dbo.Tiposdecambio where TCA_MONEDASCORRESPONSAL_ID=100’’ the Engine would have to estimate. Since we know the value 100 is not a sample, we don't know the exact value (you don't see it in the column eq_rows). We know though that it's between RANGE_HI_KEY 55 and 295, knowing that the number of rows returned could be AVG_RANGE_ROWS=263.3754. Having this information, the Query Optimizer knows how much data could be retrieved.
For most of the physical and logical operators, you can find formulas on the Internet on how the statistics would be used by physical operations.
Hope this has given you a glimpse into how even the database engine is estimating and making assumptions/predictions while creating algorithms. Sometimes the algorithm might not be the most optimal, sometimes this has to do with not good enough statistics or the engine making too many predictions if the SQL is too complicated. That is why we often break the SQL statement into smaller simpler statements (doing so the DB has to do less estimation since the code is simple) and use temporary tables for breaking a large code (temporary tables also create statistics). This helps the DB create a better execution plan and a faster response time.
We hope that you have made a big step by understanding that SQL is a declarative language and how sometimes we can use statistics to help the database make better predictions and thus more efficient execution plans.
- Query processing architecture https://docs.microsoft.com/en-us/sql/relational-databases/query-processing-architecture-guide?view=sql-server-ver15
- Statistics https://docs.microsoft.com/en-us/sql/relational-databases/statistics/statistics?view=sql-server-ver15
- Declarative languages- SQL https://en.wikipedia.org/wiki/Declarative_programming
- Recompiling a stored procedure https://docs.microsoft.com/en-us/sql/relational-databases/stored-procedures/recompile-a-stored-procedure?view=sql-server-ver15
- Physical and logical operators https://docs.microsoft.com/en-us/sql/relational-databases/showplan-logical-and-physical-operators-reference?view=sql-server-ver15