Several different strategies are available for query access. It is the optimizer's job to choose the most efficient access method for any data modification language (DML) statement. Because SQL is independent of the physical layout of the data, the optimizer is a critical component in the performance of any system. Any optimizer limitations or deficiencies can have a serious impact on the applications using it.
A look into the Sybase System 11 optimizer yields insight into its limitations and shows how you can avoid the more common problems. Adaptive Server 11.5x includes parallel and hash parallel scans to improve performance. The addition of parallel queries has increased the complexity and number of permutations the optimizer can consider, placing a higher burden on the optimizer and highlighting other limitations.
The optimizer's current design can lead to inefficient access strategies. We'll discuss a possible short-term improvement you can make until the System 12 release eliminates some of the current deficiencies.
System design should attempt to avoid the limitations of the hardware or software with which you are working. Database architecture varies from product to product, and if you don't understand your product's architecture, your systems won't meet user expectations. Unfortunately, it's difficult to predict product limitations because they change with application, hardware, and software releases. You must fully test any component change for stability and performance before releasing it into a production system.
The Sybase optimizer uses three access strategies: table scan, clustered index, and nonclustered index.
In Figure 1, Ent_1 has 250 rows, a unique integer key on Col_A, an index on Col_B, and uses eight data pages. Ent_2 has 7,500 rows, a unique integer key on Col_C, and a foreign key of Col_A to Ent_1/Col_A and uses 235 data pages. Ent_3 has 150,000 rows, a unique integer key on Col_E, a foreign key of Col_A to Ent_1/Col_A, and a foreign key of Col_C to Ent_2/Col_C and uses 5,000 data pages.

With no index on the tables, the optimizer does a table scan. The optimizer starts at the first page using the first column entry in the sysindexes table and then reads each page using the data page's next page pointer. A table scan of Ent_1 requires reading eight pages, 235 pages for Ent_2 and 5,000 pages for Ent_3.
The optimizer uses the nonclustered index if the number of pages to be selected is less than a table scan. To access one row, the optimizer starts on the root index page using the root column entry in the sysindexes table. Then it reads the leaf index page that has the maximum key value greater than or equal to the selection value. Finally, the optimizer reads the data page to which the leaf index page points that is equal to the selection value.
To obtain the next value, the optimizer reads the next data page to which the leaf index page points. This is likely be a new page because the data pages aren't in the key sequence of the nonclustered index. (See Table 1.)
|
Ent_1.Col_A: 250 index entries, each consisting of 11 bytes and requiring two leaf pages and one root page (index scan requires approximately 250 pages to be read) |
|
Ent_1.Col_B: 250 index entries, each consisting of 57 bytes and requiring eight leaf pages and one root page |
|
Ent_2.Col_C: 7,500 entries, each consisting of 11 bytes and requiring 42 leaf pages and one root page (approximately 7500 pages to be read) |
|
Ent_3.Col_E: 150,000 entries of 11 bytes and will require 830 leaf pages, seven intermediate pages and one root page (approximately 150,000 pages to be read) |
With a clustered index on the tables, the optimizer uses the clustered index if the number of pages to be selected is less than a table scan. To access one row, the optimizer positions on the index root page using the root column entry in the sysindexes table and reads the data page that has the maximum key value less than or equal to the selection value.
To obtain the next value, the optimizer reads the next row on the page as the data pages are in the key sequence. (See Table 2.)
| Clustered index on: |
| Ent_1.Col_A will have eight entries each of nine bytes and require one index root page (require approximately eight pages to be read) |
| Ent_2.Col_C will have 235 entries of nine bytes and will require two intermediate index pages and one root page (235 pages to be read) |
| Ent_3.Col_E will have 5,000 entries of nine bytes and will require 23 leaf index pages and one index root page (5000 pages to be read) |
Each access strategy reads a different number of pages to satisfy a query, generating the following:
For all rows selected in a table, a clustered index search is the same as a table scan. Thus, a clustered index is generally preferable to a table scan. A nonclustered index is only used if the expected number of rows returned is less than the number of pages in the table. The ratio depends on the row size, as Table 3 shows.
You should use indexes to minimize the number of page accesses required to satisfy a particular query. Clustered indexes are more efficient for range selections than nonclustered indexes, but because you can apply only one to each table, you should reserve it for low-cardinality columns used for critical range selections. You should avoid using too many indexes because update processing deteriorates as you add each additional index. You must take particular care when adding nonclustered indexes to increase efficiency for range selections or joins; often, the optimizer ignores these indexes if the expected number of rows is greater than a table scan. (See Table 3.)
| Row | % of pages | Nonclustered | Nonclustered | Nonclustered | Nonclustered | Nonclustered | Nonclustered |
|---|---|---|---|---|---|---|---|
| size | to rows | equality | closed range | Two row join | Five row join | Two row join | Five row join |
| default | default | equality default | equality default | Five percent selectivity | Five percent selectivity | ||
| 100 | 7 | table scan | table scan | table scan | table scan | table scan | table scan |
| 150 | 10 | index | table scan | table scan | table scan | index | table scan |
| 200 | 12 | index | table scan | table scan | table scan | index | table scan |
| 300 | 20 | index | table scan | index | table scan | index | table scan |
| 400 | 25 | index | index | index | table scan | index | index |
| 500 | 33 | index | index | index | table scan | index | index |
| 1000 | 100 | index | index | index | index | index | index |
| Table 3. Ratio of rows to number of tables. | |||||||
The optimizer should transform any DML statement into the most efficient access plan without you having to tune the query or know anything about the physical layout of the data. Optimizers are either rule-based or cost-based. A rule-based optimizer weights each operator or join depending on predefined rules to obtain the most efficient plan. A cost-based optimizer uses table distribution data to estimate the relative cost of each join and operator to obtain the most efficient plan. Sybase uses a cost-based optimizer.
The Sybase optimizer holds distribution statistics for the indexes. Statistics determine the relative disk and CPU costs for each access plan being considered so the optimizer can choose the cheapest plan to satisfy the query. It predicts each type of I/O, both physical and logical, then weights them by the server default ratio. It subsequently applies any CPU costs to give a total cost for the query. The addition of parallel processing in System 11.5x has added additional cost factors for CPU and I/O.
The distribution page holds a statistical histogram of the index key value distribution and a density value for the index key. If a valid search argument (SARG) is not present, the optimizer won't use the distribution statistics, and a table scan results. The histogram is the most accurate and is used for key lookups by a known value. For joins or in places where a key lookup is by an unknown value, the optimizer uses the density value. If no statistics are available, the optimizer reverts to a syntactical costing where each operator has a fixed percentage selectivity. The equality operator (=) has a fixed percentage of 10 percent; closed range operators (> AND <, or BETWEEN) have a fixed percentage of 25 percent; and open range operators (> or <) are at a fixed percentage of 33 percent. For example, if Ent_2 has 670 rows with 335 unique values in Col_A each occurring twice, we have:
Number of distribution steps = 2016 - (1 X 2) / (4 + 2) = 335
Number of rows per step = 670/335 = 2
Number of key values per step = 335/335 = 1
Density = (12 + 12 + ...)/3552 = 335/3352 = 0.002985
| 1 | 1 | 2 | 2 | 3 | 3 | 4 | 4 | 5 | 5 | 6 | 6 | 7 | 7 | 8 | 8 | ... |
Each highlighted value represents a distribution step that will be entered in the distribution page:
Estimated rows using distribution = 1 X 2.0 = 2.0
Estimated rows using density = 670 X 0.002985 = 1.99
Estimated rows using equality default = 670 X 0.10 = 67
The use of defaults is inaccurate compared to distribution/density values and should be avoided.
The distribution page statistics are static and only produced when an index is created or when an update statistic is issued on the table. As a result, it's important that you create indexes when representative data exists in the table. Update statistics are performed after periods of data modification that affect the data distribution.
Optimizers are complex and designed to choose quickly between many permutations. They're heavily dependent upon precise interpretation of SQL syntax. This often leads to inefficient access paths.
The structure of the WHERE clause dictates how the optimizer accesses the data. In the absence of a WHERE clause, the optimizer performs a table scan even if a helpful index exists. There are exceptions, such as Min/Max, count(*), and order by, where an index scan is more efficient than a table scan. For example, a count(*) tallies the number of entries in a nonclustered index that requires fewer page accesses than a table scan.
The optimizer checks the syntax of each WHERE clause restriction to determine if it's
valid. It expects a column restriction to be in the form of:
column selective operator expression
and a join to be in the form of:
column selective operator column
A selective operator is one that defines a range or equality for selection. These include
=, >=, <=, <, >, Between, and Like. The NOT(!) and OR operators are nonselective and may result in table scans.
Some of the most common errors occur when the programmer, writing SQL, expects
efficient access but finds the optimizer choosing an inefficient strategy. This occurs because the SQL isn't in a form the optimizer can use. The seven most common problems we've experienced are the use of expressions; functions, different data types, the OR operator, the NOT operator, a missing join clause, and an invalid join involving a group.
Density values are kept for each index and the optimizer uses them to assess the relative costs of joins. Any inaccuracy in these values can join tables in an inefficient order or cause the optimizer to ignore the most efficient index.
If the Ent_2 table has 67,000 rows, 33,500 unique values in Col_A, and each value occurs twice, we have:
| 1 | 1 | 2 | ... | 100 | ... | 200 | ... | 300 | ... | 400 | ... |
Each highlighted value represents a distribution step and is entered in the distribution table. If we calculate the Density and Rows per join we have:
Number of key values per step = 33,500/335 = 100
Density = (12 + 12 + 12 + ...) / 3352 = 335/3352 = 0.002985
Rows per join = 67000 X 0.002985 = 199.99
The result should be two, but because each distribution entry represents 100 key values, the result is 100 times the correct value.
If the Ent_2 table distribution is skewed and has 670 rows with 112 values in Col_A with each value occurring twice and 112 values in Col_A with each value occurring four times, we have:
| 1 | 1 | 3 | 3 | 3 | 3 | 4 | 4 | 6 | 6 | 6 | 6 | 7 | 7 | 9 | 9 | ... |
Each highlighted value represents a distribution step and is entered in the distribution table. If we calculate Density and Rows per join we have:
Number of key values per step = 224/335 = 0.668
Density = (22 + ... + 12 + ý)/ 3352 = (112 X 4 + 112)/3352 = 0.00498
Estimated rows = 670 X 0.00498 = 3.34
The algorithm weights the skewed rows excessively because one third of the rows returns zero rows, another third returns two rows, and the rest returns four rows. On average, any one row returns two rows -- not 3.34.
The density algorithm that the Sybase optimizer uses chooses an inefficient join plan as the data distributions become skewed; slow response times result. A more skew-tolerant density algorithm significantly improves response times and reduces the need for many of the tuning techniques currently used in the industry. What follows is a new algorithm we've developed that's more tolerant of skew. It's a possible short-term improvement that will buy time until we see the System 12 release that will attempt to eliminate some of the current deficiencies.
The current algorithm neglects to take into account that the distribution page entries are a sample and reflect one or more values. What the optimizer is actually calculating is the number of rows represented by a single distribution entry.
To resolve the inaccuracy, there are five solutions:
Make the distribution page information more accurate. Sybase plans to hold the distribution information in system tables in System 12. We will not see System 12 for some time, and many systems will not be converted until more than a year after the initial release. This is not a satisfactory solution to immediate problems.
Use different coding techniques where problems exist. This is how the majority of companies are dealing with the problem. Because of syntax restrictions or hard-coded index selections, such a solution isn't satisfactory.
Make the density algorithm more accurate.
Add an option on each index to specify the use of arithmetic average or density. As the distribution changes, the density algorithm produces accurate plans. A simple algorithm of arithmetic average for unknown values and an adjusted arithmetic average for known values provides much more accurate plans.
Use a new algorithm.
The new algorithm calculates the density of an even distribution and weights the density by the distribution skew. The new density formula is:
Join Density = S / Sum(n) X R
where S is the number of steps, n is the distribution value and R is the number of rows in outer table.
The algorithm is impractical because the parent table isn't known at the time the index is created. The number of rows for the parent table can be passed as an option for index creation or update statistics. R can also be substituted for the number of distinct key values, although this doesn't account for key values that aren't in the index. The value can exceed one if this is used with extreme distributions, so a limit of one should be placed on the density value.
To illustrate the use of these algorithms, consider the Ent_1 table with 1,000 rows and Ent_2 table with 32,000 rows consisting of 800 unique values in Col_A. Also, consider each value occurring forty times. Using the adjusted algorithm we have:
Density = (334 X 334)/(3342 X 800) = 0.00125
Estimated rows = 32,000 X 0.00125 = 40.0
Using the new algorithm we have:
Density = 334/(334 X 800) = 0.00125
Estimated rows = 32,000 X 0.00125 = 40.0
If we consider that 200 rows correspond to 0 rows and 800 rows correspond to 40 rows, then the expected number of rows for an unknown value is 32,000/1000 = 32. If the value exists in the distribution page, the expected number of rows is 32,000/800 = 40. The distribution page measures the density of key values that exist in the inner table; all outer key values aren't represented in the distribution page because of the sampling or data distribution.
These algorithms still have two main areas of inaccuracy:
Small tables. This inaccuracy occurs when the number of distribution steps is greater than the number of rows. It causes duplicate entries in the distribution page which gets weighted and causes inaccuracy.
Uneven distributions. This problem arises when the number of unique distribution steps decreases at the same ratio as the number of distinct key values and causes the adjustment to be less significant. As a result, inaccuracy follows.
The optimizer is a critical component in any Sybase RDBMS because it converts the user request into the physical access strategy. Any limitations or deficiencies during this translation will cause the overall RDBMS performance to decline as more system resources are being used than necessary. Any optimizer improvements can decrease response times and increase concurrency without the need for additional hardware or specialized system tuning.
Copyright © 1998 Miller Freeman Inc. All Rights Reserved
Redistribution without permission is prohibited.