New Features of Informix Dynamic Server 7.3:
Correlated Subquery Enhancements And Optimizer
Directives
By Kevin Fennimore
Informix’s newest version of their Dynamic Server
database engine, version 7.3, is one of the most exciting and feature-rich
since the introduction of Informix OnLine Dynamic Server. An emphasis was
put on Reliability, Availability and Serviceability, or RAS. These three
elements are nice but the new functionality is what is most exciting. This
new functionality is designed to improve performance and porting from Other
database platforms. Some of these improvements include:
-
Enhanced SQL
-
String manipulation routines such as upper, lower and improved
substring functions
-
NVL function – null value function
-
Decode/Case statements
-
Keyfirst scans
-
Select first N rows
-
Correlated Subquery enhancements
-
Optimizer directives
Being a performance enthusiast I am most interested and excited
about the last two improvements: Correlated Subquery enhancements and Optimizer
Directives.
Correlated Subquery Enhancements
In past articles and presentations on performance I have
advised against using correlated subqueries(CSQs) because they perform
poorly. I still maintain that they should be used sparingly and avoided
whenever possible. However, consider the example where we are looking to
update all of the rows from one table where a value exists in another table,
as in:
update customers set stat = "A"
where exists (
select "X"
from orders o
where o.custid = customer.custid
and o.cmpny = customers.cmpny
and o.stat = "OPEN" )
In this example we want to update all of the customer statuses
where the customer has an "OPEN" order. The only way to perform this is
with the above SQL. The problem with this SQL is that Informix satisfies
it by executing the subquery, on orders, for each record in the customers
table. If the customers table has a million rows (don’t we all want a million
customers?), the subquery will get executed a million times. If there are
only a handful of customers having orders that are open, this SQL is extremely
inefficient.
The most efficient way to execute this statement wold
be to search the orders table, preferably using an index, for all orders
with a status of open, then search the customers table for the associated
records and update them. Prior to version 7.3, you were out of luck because
Informix could not execute it this way. However, with version 7.3, this
has changed and the query can now be satisfied more efficiently.
Let’s look at the "Set Explain" output for a query in
version 7.2:
QUERY:
update orders set ship_charge = 0
where exists (
select "X" from customer c
where c.customer_num = orders.customer_num
and c.state = "MD" )
1) informix.orders: SEQUENTIAL SCAN
Filters: EXISTS <subquery>
Subquery:
Estimated Cost: 1
Estimated # of Rows Returned: 1
1) informix.c: INDEX PATH
Filters: informix.c.state = 'MD'
(1) Index Keys: customer_num
Lower Index Filter:
c.customer_num = orders.customer_num
Notice the "SEQUENTIAL SCAN" on the orders table and then
the "INDEX PATH" into the customers table in the subquery. What happens
here is that the database engine will read all of the orders records
and perform the subquery into the customers table for each order. This
can be very inefficient.
To fix this problem, Informix implemented a feature
called "Subquery Flattening". The concept here is that the query is turned
into a join, optimized and executed as a join. Considering the query in
the set explain output from above, if we wanted to select the orders for
customers living in Maryland (MD), we would do the following SQL:
select orders.*
from orders, customers c
where c.customer_num = orders.customer_num
and c.state = "MD"
In this case, the Informix optimizer would choose to search
the orders table for states equal to "MD" and then search the customers
table. Now let’s look at the "Set Explain" output from the same query under
version 7.3:
QUERY:
update orders set ship_charge = 0
where exists (
select "X" from customer c
where c.customer_num = orders.customer_num
and c.state = "MD" )
1) informix.c: SEQUENTIAL SCAN
Filters: informix.c.state = 'MD'
2) informix.orders: INDEX PATH
(1) Index Keys: customer_num
Lower Index Filter:
orders.customer_num = c.customer_num
NESTED LOOP JOIN
Notice that there is no subquery in the explain output and
the optimizer is now choosing to start with a sequential scan of customers
based on state equal to "MD". The performance could be further enhanced
by adding an index to customers on state. Based on this output
we can see that the optimizer is now choosing the path that is the most
efficient – find the customers in Maryland and then search for their orders.
We are no longer reading all of the orders records and then doing
a subquery into the customers table.
Another new feature that facilitates "Subquery Flattening"
is the Skip Duplicate index scan. This scan is done in order to prevent
multiple searches into a secondary table for the same value. Consider the
following set explain output:
QUERY:
update orders set backlog = "Y"
where exists (
select "X" from items
where orders.order_num = items.order_num
and stock_num = 6 and manu_code =
"SMT" )
1) informix.items: INDEX PATH (Skip
Duplicate)
Filters: (items.stock_num=6 AND items.manu_code='SMT'
)
(1) Index Keys: order_num
(2) informix.orders: INDEX PATH
(1) Index Keys: order_num
Lower Index Filter:
orders.order_num = items.order_num
NESTED LOOP JOIN
In this case the database engine will search the items table
for records that meet the criteria and then search into the orders table
for associated records based on order_num. There is potential for
the same order_num to meet the search criteria which would result
in the same search into the orders table and the same record being updated
multiple times. By performing the "Skip Duplicate" scan, all of the duplicates
are skipped which ensures that only unique values are returned and multiple
scans for the same order_num are avoided.
The next CSQ enhancement is the concept of First
Row/Semi Join. This involves only scanning into the second table of a join
to find the first row. This is possible when a CSQ is flattened and the
second table being joined is part of an exists clause. The database
engine is able to search into the table and stop the search after one row
is found. If one row or one thousand rows are found, the exists condition
is true so the engine is now smart enough to stop after the first row.
An example of this can be observed in the following set explain output:
QUERY:
UPDATE PS_JRNL_LN SET jrnl_line_status
= ‘3’
WHERE BUSINESS_UNIT='ABC'
AND PROCESS_INSTANCE=5960
AND EXISTS (
SELECT 'X'
FROM PS_COMBO_SEL_06 A
WHERE A.SETID='ABC'
AND A.COMBINATION='OVERHEAD'
AND A.CHARTFIELD='ACCOUNT'
AND PS_JRNL_LN.ACCOUNT BETWEEN A.RANGE_FROM_06
AND A.RANGE_TO_06)
Estimated Cost: 79
Estimated # of Rows Returned: 1
1) sysadm.ps_jrnl_ln: INDEX PATH
(1) Index Keys: process_instance
business_unit
Lower Index Filter: (ps_jrnl_ln.business_unit
= 'ABC' AND ps_jrnl_ln.process_instance = 5960 )
2) informix.a: INDEX PATH (First
Row)
Filters: (informix.a.range_to_06
>= ps_jrnl_ln.account AND a.tree_effdt = <subquery> )
(1) Index Keys: setid chartfield
combination range_from_06 range_to_06
Lower Index Filter: (a.setid = 'ABC'
AND (a.combination = 'OVERHEAD' AND a.chartfield = 'ACCOUNT' ) )
Upper Index Filter:
a.range_from_06 <= ps_jrnl_ln.account
NESTED LOOP JOIN (Semi Join)
In this case the database engine is going to read into the
first table and then only read into the second table, ps_combo_sel_06,
to find the first row. This is also noted as a "Semi Join" because it is
not actually joining every row in the second table.
Another improvement that was made specifically for
one software package was the "Predicate Promotion Across Control Blocks".
This is a long technical name for a very simple process. It basically means
to substitute constants into CSQs whenever possible in order to make then
non-correlated. Consider the following example:
select * from ps_jrnl_ln
where business_unit = 'ABC’
and process_instance = 5960
and not exists
( select "X"
from PS_SP_BU_GL_NONVW P
where P.business_unit = ps_jrnl_ln.business_unit
)
This query is a CSQ which could be rewritten as a non-correlated
subquery as follows:
select * from ps_jrnl_ln
where business_unit = 'ABC’
and process_instance = 5960
and not exists
( select "X"
from PS_SP_BU_GL_NONVW P
where P.business_unit = ‘ABC’
)
This is how most of us would write this query in the first
place but sometimes software packages generate queries without much thought
and they come out like the one above. In this case, Informix can now change
the CSQ into a non-CSQ as evidenced by the following set explain output:
QUERY:
select * from ps_jrnl_ln
where business_unit = 'ABC’
and process_instance = 5960
and not exists
( select "X"
from PS_SP_BU_GL_NONVW P
where P.business_unit = ps_jrnl_ln.business_unit)
1) ps_jrnl_ln: INDEX PATH
Filters: NOT EXISTS <subquery>
(1) Index Keys: process_instance business_unit
Lower Index Filter: (ps_jrnl_ln.business_unit
= 'ABC' AND ps_jrnl_ln.process_instance = 5960 )
Subquery:
1) ps_bus_unit_tbl_gl: INDEX PATH
(1) Index Keys: business_unit (Key-Only)
Lower Index Filter:
ps_bus_unit_tbl_gl.business_unit
= 'ABC'
2) ps_bus_unit_tbl_fs: INDEX PATH
(1) Index Keys: business_unit descr
(Key-Only)
Lower Index Filter: ps_bus_unit_tbl_fs.business_unit
= ps_bus_unit_tbl_gl.business_unit
NESTED LOOP JOIN
Notice in the subquery that the "Lower Index Filter" is on
business_unit equal to ‘ABC’ not ps_jrnl_ln.business_unit.
Another nice addition, known as "Constant Subquery Optimization",
is that, at run time, the database engine can now stop processing an exists
or not exists clause on the first row of a query based on the following:
-
The query is a NOT EXISTS and a row is found, or
-
The query is an EXISTS and no row is found
In the above example, if the subquery returns no rows, the
query will stop on the first record from the ps_jrnl_ln table.
These enhancements work for all SQL statements which can
contain CSQs except for those involving views. Views will be handled in
a later release of the database. Of course there is a simple workaround
for this problem which is to not use views. All in all the enhancements
to the CSQs are excellent and produce tremendous results, especially in
applications that use a large number of them. One such application is PeopleSoft.
There are several processes within PeopleSoft which heavily use CSQs. In
some of the beta testing of 7.3 one such process, the AR Posting, was reduced
from four and one half hours to twenty three minutes simply by upgrading
the database engine to version 7.3.
Optimizer Directives
Optimizer directives are comments which can be added to
select statements in order to tell the optimizer which paths it should
consider. They are similar to the "Hints" available in that Other database
but of course they are from Informix and the are better. Informix’s directives
have more features and force the optimizer to choose certain paths as opposed
to making hints to influence the optimizer’s choices which, as I understand
it, may be ignored. Informix’s main intent for providing these directives
is to help developers resolve the few times that they have problems with
the database engine not choosing the proper path. I think Informix would
like their optimizer to be right one hundred percent of the time but we
all know that this is not possible. In cases where a developer finds a
case where they must use directives, I am sure that Informix would like
to know about it so that they can try to fix it and continue to improve
the optimizer.
Directives limit the set of paths from which the optimizer
has to choose. If the following circle was the set of paths the optimizer
could choose, a directive would limit the choices to those in the smaller
circle. In the case of a negative directive, the choices would be limited
to anything outside of the smaller circle.
The syntax for a directive is as follows:
SELECT --+ directive text
SELECT {+ directive text }
UPDATE --+ directive text
UPDATE {+ directive text }
DELETE --+ directive text
DELETE {+ directive text }
The directive must follow the SQL command (select,
update, delete) and must be a comment followed immediately by a plus (+)
sign. The following is a list of directives available, please see the manual
for a full list and specific syntax:
Access Methods
index - forces use of a subset of specified indexes
index_one - forces use of one of the specified indexes
index_all - forces use of all of the specified indexes
avoid_index - avoids use of specified indexes
full - forces sequential scan of specified table
avoid_full - avoids sequential scan of specified table
Join Order
ordered - forces table order to follow the from clause
Join Methods
use_nl - forces nested loop join on specified tables
use_merge - forces sort merge join on specified tables
use_hash - forces hash join on specified tables
avoid_nl - avoids nested loop join on specified tables
avoid_merge - avoids sort merge join on specified tables
avoid_hash - avoids hash join on specified tables
An example of the ordered directive would be:
QUERY:
select --+ ordered
customer.lname, orders.order_num,
items.total_price
from customer, orders, items
where customer.customer_num = orders.customer_num
and orders.order_num = items.order_num
and items.stock_num = 6 and items.manu_code
= "SMT"
DIRECTIVES FOLLOWED:
ORDERED
DIRECTIVES NOT FOLLOWED:
1) customer: SEQUENTIAL SCAN
2) orders: INDEX PATH
(1) Index Keys: customer_num
Lower Index Filter: orders.customer_num
=customer.customer_num
NESTED LOOP JOIN
3) items: INDEX PATH
Filters: items.order_num = orders.order_num
(1) Index Keys: stock_num manu_code
Lower Index Filter: (items.stock_num
= 6 AND items.manu_code = 'SMT' )
NESTED LOOP JOIN
As you can see, the directive forced the database engine
to read from the tables as they are ordered in the from clause. Another
example of directives using the ordered, index and avoid_index
is:
QUERY:
select --+ ordered index(customer,
zip_ix) avoid_index(orders," 101_4")
customer.lname, orders.order_num,
items.total_price
from customer c, orders o, items i
where c.customer_num = o.customer_num
and o.order_num = i.order_num
and stock_num = 6 and manu_code =
"SMT"
Directives Examples : INDEX (cont.)
DIRECTIVES FOLLOWED:
ORDERED
INDEX ( customer zip_ix )
AVOID_INDEX ( orders 101_4 )
DIRECTIVES NOT FOLLOWED:
1)customer: INDEX PATH
(1) Index Keys: zipcode
2)orders: SEQUENTIAL SCAN
DYNAMIC HASH JOIN (Build Outer)
Dynamic Hash Filters:c.customer_num
=o.customer_num
3)items: INDEX PATH
Filters:i.order_num =o.order_num
(1) Index Keys: stock_num manu_code
Lower Index Filter: (i.stock_num =
6 AND i.manu_code = 'SMT' )
NESTED LOOP JOIN
In this case we force the database engine to use the zipcode
index on the customer table and avoid using the index on the orders table
which forces a DYNAMIC HASH JOIN to occur. Again we are using the ordered
directive which forces the database engine to read from the tables following
the order of the from clause. This is obviously very inefficient but illustrates
the use of the directives.
As you can see by the above examples, you can do many
powerful things with directives, including making the database choose inefficient
query plans. The directives are nice features and can really help developers
in testing queries and resolving performance problems. However, they can
also cause performance problems when not used properly and should only
be used when other methods of optimization are unsuccessful.
Kevin Fennimore
1-888-UCI-FOR-U