Back to Current Newsletter Washington Area Informix Users Group


New Features of Informix Dynamic Server 7.3:
Correlated Subquery Enhancements And Optimizer Directives
By Kevin Fennimore

 Informix’s newest version of their Dynamic Server database engine, version 7.3, is one of the most exciting and feature-rich since the introduction of Informix OnLine Dynamic Server. An emphasis was put on Reliability, Availability and Serviceability, or RAS. These three elements are nice but the new functionality is what is most exciting. This new functionality is designed to improve performance and porting from Other database platforms. Some of these improvements include:

Being a performance enthusiast I am most interested and excited about the last two improvements: Correlated Subquery enhancements and Optimizer Directives.

 Correlated Subquery Enhancements

In past articles and presentations on performance I have advised against using correlated subqueries(CSQs) because they perform poorly. I still maintain that they should be used sparingly and avoided whenever possible. However, consider the example where we are looking to update all of the rows from one table where a value exists in another table, as in:

update customers set stat = "A"
where exists ( select "X"
from orders o
where o.custid = customer.custid
and o.cmpny = customers.cmpny
and o.stat = "OPEN" )
In this example we want to update all of the customer statuses where the customer has an "OPEN" order. The only way to perform this is with the above SQL. The problem with this SQL is that Informix satisfies it by executing the subquery, on orders, for each record in the customers table. If the customers table has a million rows (don’t we all want a million customers?), the subquery will get executed a million times. If there are only a handful of customers having orders that are open, this SQL is extremely inefficient.

The most efficient way to execute this statement wold be to search the orders table, preferably using an index, for all orders with a status of open, then search the customers table for the associated records and update them. Prior to version 7.3, you were out of luck because Informix could not execute it this way. However, with version 7.3, this has changed and the query can now be satisfied more efficiently.

Let’s look at the "Set Explain" output for a query in version 7.2:

QUERY:
update orders set ship_charge = 0
where exists ( select "X" from customer c
where c.customer_num = orders.customer_num
and c.state = "MD" )

1) informix.orders: SEQUENTIAL SCAN

Filters: EXISTS <subquery>

Subquery:
Estimated Cost: 1
Estimated # of Rows Returned: 1

1) informix.c: INDEX PATH

Filters: informix.c.state = 'MD'
(1) Index Keys: customer_num Lower Index Filter: c.customer_num = orders.customer_num
Notice the "SEQUENTIAL SCAN" on the orders table and then the "INDEX PATH" into the customers table in the subquery. What happens here is that the database engine will read all of the orders records and perform the subquery into the customers table for each order. This can be very inefficient.

 To fix this problem, Informix implemented a feature called "Subquery Flattening". The concept here is that the query is turned into a join, optimized and executed as a join. Considering the query in the set explain output from above, if we wanted to select the orders for customers living in Maryland (MD), we would do the following SQL:

select orders.*
from orders, customers c
where c.customer_num = orders.customer_num
and c.state = "MD"
In this case, the Informix optimizer would choose to search the orders table for states equal to "MD" and then search the customers table. Now let’s look at the "Set Explain" output from the same query under version 7.3: QUERY:
update orders set ship_charge = 0
where exists ( select "X" from customer c
where c.customer_num = orders.customer_num
and c.state = "MD" )
 
1) informix.c: SEQUENTIAL SCAN Filters: informix.c.state = 'MD' 2) informix.orders: INDEX PATH (1) Index Keys: customer_num Lower Index Filter: orders.customer_num = c.customer_num NESTED LOOP JOIN
Notice that there is no subquery in the explain output and the optimizer is now choosing to start with a sequential scan of customers based on state equal to "MD". The performance could be further enhanced by adding an index to customers on state. Based on this output we can see that the optimizer is now choosing the path that is the most efficient – find the customers in Maryland and then search for their orders. We are no longer reading all of the orders records and then doing a subquery into the customers table.

Another new feature that facilitates "Subquery Flattening" is the Skip Duplicate index scan. This scan is done in order to prevent multiple searches into a secondary table for the same value. Consider the following set explain output:

QUERY:
update orders set backlog = "Y"
where exists ( select "X" from items
where orders.order_num = items.order_num
and stock_num = 6 and manu_code = "SMT" )

1) informix.items: INDEX PATH (Skip Duplicate)

Filters: (items.stock_num=6 AND items.manu_code='SMT' )
(1) Index Keys: order_num
 
(2) informix.orders: INDEX PATH (1) Index Keys: order_num
Lower Index Filter:
orders.order_num = items.order_num
NESTED LOOP JOIN
In this case the database engine will search the items table for records that meet the criteria and then search into the orders table for associated records based on order_num. There is potential for the same order_num to meet the search criteria which would result in the same search into the orders table and the same record being updated multiple times. By performing the "Skip Duplicate" scan, all of the duplicates are skipped which ensures that only unique values are returned and multiple scans for the same order_num are avoided.

 The next CSQ enhancement is the concept of First Row/Semi Join. This involves only scanning into the second table of a join to find the first row. This is possible when a CSQ is flattened and the second table being joined is part of an exists clause. The database engine is able to search into the table and stop the search after one row is found. If one row or one thousand rows are found, the exists condition is true so the engine is now smart enough to stop after the first row. An example of this can be observed in the following set explain output:

QUERY:
UPDATE PS_JRNL_LN SET jrnl_line_status = ‘3’
WHERE BUSINESS_UNIT='ABC'
AND PROCESS_INSTANCE=5960
AND EXISTS ( SELECT 'X' FROM PS_COMBO_SEL_06 A
WHERE A.SETID='ABC'
AND A.COMBINATION='OVERHEAD'
AND A.CHARTFIELD='ACCOUNT'
AND PS_JRNL_LN.ACCOUNT BETWEEN A.RANGE_FROM_06
AND A.RANGE_TO_06)
 
Estimated Cost: 79
Estimated # of Rows Returned: 1

1) sysadm.ps_jrnl_ln: INDEX PATH

(1) Index Keys: process_instance business_unit Lower Index Filter: (ps_jrnl_ln.business_unit = 'ABC' AND ps_jrnl_ln.process_instance = 5960 )
 
2) informix.a: INDEX PATH (First Row)  Filters: (informix.a.range_to_06 >= ps_jrnl_ln.account AND a.tree_effdt = <subquery> )

 (1) Index Keys: setid chartfield combination range_from_06 range_to_06

Lower Index Filter: (a.setid = 'ABC' AND (a.combination = 'OVERHEAD' AND a.chartfield = 'ACCOUNT' ) )
Upper Index Filter: a.range_from_06 <= ps_jrnl_ln.account
NESTED LOOP JOIN (Semi Join)
In this case the database engine is going to read into the first table and then only read into the second table, ps_combo_sel_06, to find the first row. This is also noted as a "Semi Join" because it is not actually joining every row in the second table.

 Another improvement that was made specifically for one software package was the "Predicate Promotion Across Control Blocks". This is a long technical name for a very simple process. It basically means to substitute constants into CSQs whenever possible in order to make then non-correlated. Consider the following example:

select * from ps_jrnl_ln
where business_unit = 'ABC’
and process_instance = 5960
and not exists
( select "X" from PS_SP_BU_GL_NONVW P
where P.business_unit = ps_jrnl_ln.business_unit
)
This query is a CSQ which could be rewritten as a non-correlated subquery as follows: select * from ps_jrnl_ln
where business_unit = 'ABC’
and process_instance = 5960
and not exists
( select "X" from PS_SP_BU_GL_NONVW P
where P.business_unit = ‘ABC’
)
This is how most of us would write this query in the first place but sometimes software packages generate queries without much thought and they come out like the one above. In this case, Informix can now change the CSQ into a non-CSQ as evidenced by the following set explain output: QUERY:
select * from ps_jrnl_ln
where business_unit = 'ABC’
and process_instance = 5960
and not exists
( select "X" from PS_SP_BU_GL_NONVW P
where P.business_unit = ps_jrnl_ln.business_unit)

1) ps_jrnl_ln: INDEX PATH

Filters: NOT EXISTS <subquery>
(1) Index Keys: process_instance business_unit Lower Index Filter: (ps_jrnl_ln.business_unit = 'ABC' AND ps_jrnl_ln.process_instance = 5960 )

Subquery:
1) ps_bus_unit_tbl_gl: INDEX PATH

(1) Index Keys: business_unit (Key-Only)
Lower Index Filter: ps_bus_unit_tbl_gl.business_unit = 'ABC'

2) ps_bus_unit_tbl_fs: INDEX PATH

(1) Index Keys: business_unit descr (Key-Only)
Lower Index Filter: ps_bus_unit_tbl_fs.business_unit = ps_bus_unit_tbl_gl.business_unit

NESTED LOOP JOIN

Notice in the subquery that the "Lower Index Filter" is on business_unit equal to ‘ABC’ not ps_jrnl_ln.business_unit.

Another nice addition, known as "Constant Subquery Optimization", is that, at run time, the database engine can now stop processing an exists or not exists clause on the first row of a query based on the following:

  1. The query is a NOT EXISTS and a row is found, or
  2. The query is an EXISTS and no row is found
In the above example, if the subquery returns no rows, the query will stop on the first record from the ps_jrnl_ln table.

These enhancements work for all SQL statements which can contain CSQs except for those involving views. Views will be handled in a later release of the database. Of course there is a simple workaround for this problem which is to not use views. All in all the enhancements to the CSQs are excellent and produce tremendous results, especially in applications that use a large number of them. One such application is PeopleSoft. There are several processes within PeopleSoft which heavily use CSQs. In some of the beta testing of 7.3 one such process, the AR Posting, was reduced from four and one half hours to twenty three minutes simply by upgrading the database engine to version 7.3.

Optimizer Directives

Optimizer directives are comments which can be added to select statements in order to tell the optimizer which paths it should consider. They are similar to the "Hints" available in that Other database but of course they are from Informix and the are better. Informix’s directives have more features and force the optimizer to choose certain paths as opposed to making hints to influence the optimizer’s choices which, as I understand it, may be ignored. Informix’s main intent for providing these directives is to help developers resolve the few times that they have problems with the database engine not choosing the proper path. I think Informix would like their optimizer to be right one hundred percent of the time but we all know that this is not possible. In cases where a developer finds a case where they must use directives, I am sure that Informix would like to know about it so that they can try to fix it and continue to improve the optimizer.

Directives limit the set of paths from which the optimizer has to choose. If the following circle was the set of paths the optimizer could choose, a directive would limit the choices to those in the smaller circle. In the case of a negative directive, the choices would be limited to anything outside of the smaller circle.

 


The syntax for a directive is as follows:

The directive must follow the SQL command (select, update, delete) and must be a comment followed immediately by a plus (+) sign. The following is a list of directives available, please see the manual for a full list and specific syntax: Access Methods index - forces use of a subset of specified indexes
index_one - forces use of one of the specified indexes
index_all - forces use of all of the specified indexes
avoid_index - avoids use of specified indexes
full - forces sequential scan of specified table
avoid_full - avoids sequential scan of specified table
Join Order ordered - forces table order to follow the from clause Join Methods use_nl - forces nested loop join on specified tables
use_merge - forces sort merge join on specified tables
use_hash - forces hash join on specified tables
avoid_nl - avoids nested loop join on specified tables
avoid_merge - avoids sort merge join on specified tables
avoid_hash - avoids hash join on specified tables
 
An example of the ordered directive would be: QUERY:
select --+ ordered
customer.lname, orders.order_num, items.total_price
from customer, orders, items
where customer.customer_num = orders.customer_num
and orders.order_num = items.order_num
and items.stock_num = 6 and items.manu_code = "SMT"

DIRECTIVES FOLLOWED:
ORDERED
DIRECTIVES NOT FOLLOWED:

1) customer: SEQUENTIAL SCAN
2) orders: INDEX PATH

(1) Index Keys: customer_num
Lower Index Filter: orders.customer_num =customer.customer_num
NESTED LOOP JOIN

3) items: INDEX PATH

Filters: items.order_num = orders.order_num
(1) Index Keys: stock_num manu_code
Lower Index Filter: (items.stock_num = 6 AND items.manu_code = 'SMT' )

NESTED LOOP JOIN

As you can see, the directive forced the database engine to read from the tables as they are ordered in the from clause. Another example of directives using the ordered, index and avoid_index is: QUERY:
select --+ ordered index(customer, zip_ix) avoid_index(orders," 101_4")
customer.lname, orders.order_num, items.total_price
from customer c, orders o, items i
where c.customer_num = o.customer_num
and o.order_num = i.order_num
and stock_num = 6 and manu_code = "SMT"

Directives Examples : INDEX (cont.)
DIRECTIVES FOLLOWED:
ORDERED
INDEX ( customer zip_ix )
AVOID_INDEX ( orders 101_4 )
DIRECTIVES NOT FOLLOWED:

1)customer: INDEX PATH

(1) Index Keys: zipcode

2)orders: SEQUENTIAL SCAN
DYNAMIC HASH JOIN (Build Outer)

Dynamic Hash Filters:c.customer_num =o.customer_num

3)items: INDEX PATH
Filters:i.order_num =o.order_num

(1) Index Keys: stock_num manu_code
Lower Index Filter: (i.stock_num = 6 AND i.manu_code = 'SMT' )

NESTED LOOP JOIN

In this case we force the database engine to use the zipcode index on the customer table and avoid using the index on the orders table which forces a DYNAMIC HASH JOIN to occur. Again we are using the ordered directive which forces the database engine to read from the tables following the order of the from clause. This is obviously very inefficient but illustrates the use of the directives.

As you can see by the above examples, you can do many powerful things with directives, including making the database choose inefficient query plans. The directives are nice features and can really help developers in testing queries and resolving performance problems. However, they can also cause performance problems when not used properly and should only be used when other methods of optimization are unsuccessful.

Kevin Fennimore

1-888-UCI-FOR-U