January 1996 Newsletter
Volume 6, No. 1
Highlights of This Issue
Informix Invites You To View The Future of Technology
Announcing the International Informix Users Group, by Walt Hultgren
The Quality of Your SQL, by Gavin Nour
Forum 96 - Optimizing Your Informix Database and
Applications
Forum 96 is a one day event for programmers, database administrators, system analysts, and
managers who want to get the most out of their Informix applications and databases. Meet and
network with other Informix users, see the newest technology, and learn key technical tips that
will enable you to optimize your databases and applications. This is our third one day Technical
Forum. The past two Forums averaged over 200 participants, speakers offering many technical
tips, and exhibitors with the newest products. If you are using Informix 4GL, SE, OnLine 5.X, or
moving to NewEra or OnLine 7.X, the Forum will have something for you.
Location: Sheraton Premier at Tysons Corner
8661 Leesburg Pike, Vienna, Virginia
Date: Friday, March 1, 1996
Time: 8:00 am to 5:00 pm
Keynote: "Future Directions for Informix"
- Bob Macdonald, Vice President, INFORMIX Software Inc.
Technical Session Topics:
- Optimizing Informix 4GL
- OnLine 5.X Tuning
- 4GL to NewEra Migration
- OnLine 7.1 Tuning
- NewEra Implementation
- OnLine 8.X - New Features
Lunch Session: "New Features in NewEra 3.0"
- Anne Buzbee, Senior Marketing Manager for Tools, INFORMIX Software, Inc.
Closing Talk and Technical Q&A Session:
Jonathan Leffler, author of "Using Informix SQL", and one of the most helpful contributors to the
Internet newsgroup comp.databases.informix will be our closing guest speaker, and will respond
to your questions in our Questions and Answer session.
Lunch is sponsored by INFORMIX Federal Division
Breaks are sponsored by INFORMIX Commercial Division
Exhibitors
- Advanced DataTools Corporation
- Arbor Software Corporation
- BMC Software
- Concepts Dynamic, Inc.
- Crystal Services, Inc.
- INFORMIX Software, Inc.
- Performance Software, Inc.
- Platinum Technology, Inc.
- Pure Software, Inc.
- Pyramid Technology
- SelectStar, Inc.
- Summit Data Group
- The Technology Group
December User Group Meeting
Our December meeting featured presentations from Compuware Corporation, BMC Software, and
our new user group Web Pages, by Lester Knutsen. Compuware presented Eco-Tools which allows
Informix database administrators to pro-actively manage distributed, mission-critical Informix
databases. BMC Software presented Patrol for Informix OnLine databases to bring automated, 24
hour, event driven management to database administration. Then we explored the World Wide Web,
looking at Informix resources on the Internet, and introduced our user group Web pages.
WAIUG Newsletter Articles Around the World
Did you know that two of our WAIUG newsletter articles appeared in Informix Tech Notes? We are
also sharing articles with other user groups around the world and benefiting from articles from other
uses groups. Please let us know if you would like to contribute an article.
Newsletter Sponsorship
We would like to thank the following companies for sponsoring this newsletter:
Business Systems Support Group, Inc
Information Concepts, Inc.
Marriott
Price Waterhouse LLP
Summit Data Group
New Benefits for Members
In addition to this newsletter and our local activities, there is a new reason to be a member of the
Washington Area Informix Users Group. All current members will automatically become members
of the International Informix Users group for one year. Some of the benefits this includes are
discounts to the Informix World Wide User Conference in Chicago in July, and full access to the
members-only section of the IIUG Web Pages. Other discount programs are being worked on as
well. Have you renewed your membership for 1996? Renew your membership and register for
Forum 96 at the same time and save $10. See the registration form on the back of this newsletter.
Membership dues are $20.00. We also have a Corporate Membership Program. For membership
information, please call our Membership Director, John Petruzzi, at 703-490-4598.
Informix is the database company with its eye on the future - from its beginning as the pioneer of
database technology for UNIX to its position today as the acknowledged leader in scalable,
high-performance database technology for open systems.
Informix's acquisition of Illustra Information Technologies, Inc., will result in the creation of
INFORMIX-Universal Server, the first database server for handling all information in your enterprise.
Illustra has taken the power of relational database technology and applied it to the rich diversity of
information that is at the heart of next-generation application environments like the World Wide Web.
Its ability to manage and process any type of information - documents, photos, web html pages,
geo-spatial data, sound, and video - has made Illustra the database for cyberspace.
The combination of Illustra's sophisticated abilities with Informix's powerful features for
mission-critical transaction processing and data warehousing will make INFORMIX-Universal Server
the information foundation for the future of your enterprise.
To hear how the INFORMIX-Universal will benefit your organization, please join us on
February 6 at The Washington Press Club, Washington D.C. for a worldwide satellite
broadcast.
Seating is limited, so please RSVP at 1-800-529-6999.
by Walt Hultgren
I am pleased to officially announce the formation of the International Informix Users Group (IIUG)!
IIUG was formally established at the 1995 Informix Worldwide User Conference in San Jose,
California. A number of people from the Informix user community have been working for some time
on starting such a group, and it's exciting to have taken this first step.
Mission and Goals
The mission of IIUG is to support Informix users worldwide by working to achieve several broad
goals:
- To promote the growth of local and regional Informix User Groups;
- To provide a communication channel to disseminate news and other information efficiently to
the Informix user community;
- To give users a coherent collective voice to articulate the issues that they find important as they
utilize Informix products in their particular environments;
- To foster the professional development of individual Informix users through special training,
documentation, and other programs;
- To give an organizational home to those Informix users without access to a local group, and to
allow local group members to participate in a global organization.
A number of possible programs have been proposed to meet these goals. We will be investigating
these and other proposals to determine which will provide real benefit to IIUG members and to the
Informix community at large, yet will be feasible.
Board of Directors
During the inaugural meeting in San Jose, a President and Board of Directors were elected and
charged with the task of creating a Charter and a set of By Laws. They will also finish defining the
internal structure of the Group. The following people are guiding the operation of IIUG during its
first year:
President:
- Walt Hultgren, Atlanta, Georgia, USA
Board of Directors:
- Clem Akins, Muscle Shoals, Alabama, USA
- Tom Bondur, Fremont, California, USA
- Robert David, Sunnyvale, California, USA
- Carlton Doe, Salt Lake City, Utah, USA
- Cathy Kipp, Fort Collins, Colorado, USA
- Lester Knutsen, Washington, DC, USA
- Gavin Nour, Sydney, New South Wales, Australia
- Mike Reetz, Boulder, Colorado, USA
- Kerry Sainsbury, Auckland, New Zealand
- Nancy Twomey, Menlo Park, California, USA
- Malcolm Weallans, London, England
Each of these people brings unique strengths and experience to the Board.
Membership
Membership in IIUG is open to everyone within the Informix User Community, including end-users,
vendors, and employees of Informix Software, Inc. As a way of introducing IIUG to Informix users
worldwide, we have declared the period continuing through December 31, 1996, to be a Free Charter
Membership Period. During that time, membership in IIUG will be free of charge to anyone
completing a Membership Application or joining through their local user group. We'll post more
details soon on how to join IIUG.
Current Programs
Establishing the full set of activities and benefits that we envision for IIUG will take time. However,
there are several programs that we have already been able to put in place:
IIUG Internet Home Site
We have registered our own Internet domain, "IIUG.ORG", and have set up a home site on the
Internet, iiug.iiug.org (IP 128.117.71.100). This site can be referred to as simply "iiug.org" in most
cases. The two functions that iiug.org will serve initially will be to house our FTP archive and Web
pages. A separate message will be posted with more details about these facilities.
Resources for Local and Regional User Groups
There are a number of resources that IIUG is offering to local and regional Informix user groups free
of charge: a method for granting free IIUG memberships for all local group members at one time,
space on www.iiug.org for local group Web pages as well as assistance in developing them, space
on ftp.iiug.org for separate FTP archives maintained by local groups, plus other resources. We will
soon be sending details on these offerings to each local and regional group.
1996 Informix Worldwide User Conference
IIUG will be coordinating the various activities related to user groups at this year's Informix
Worldwide User Conference in Chicago. Many of these activities will be brought back from last year,
and there'll be some new features as well. Prominent among these will be registration discounts for
IIUG members, and a special IIUG Call for Papers.
Informix User Group Liaison
Informix Software has created a permanent position, the User Group Liaison, to help promote the
growth and success of all Informix user groups. For more information, contact Christine Shannon
by sending e-mail to "usergrp@informix.com". She has been working closely with IIUG to help get
us going, and has already contacted many local group representatives.
More Information
As we continue with the process of turning the International Informix Users Group into a first-rate
organization, we'll make sure to let everyone know of our progress through various channels,
including postings to this forum. For more immediate information, visit our Web site at
"http://www.iiug.org/" or send e-mail to "info@iiug.org".
It has taken a great deal of effort to bring IIUG this far, yet much formative work remains to be done.
If you would like to get involved in any of our activities, feel free to contact me directly.
Walt Hultgren
President, International Informix Users Group
Walt Hultgren Internet: walt@rmy.emory.edu (IP 128.140.8.1)
Emory University UUCP: {...,gatech,rutgers,uunet}!emory!rmy!walt
954 Gatewood Road, NE BITNET: walt@EMORY
Atlanta, GA 30329 USA Voice: +1 404 727 0648
The Quality of Your SQL
by Gavin Nour
Some organisations are suprised when I suggest that all SQL should be passed
through Quality Assurance (QA) by analysing the output of the command SET
EXPLAIN ON.
This article will discuss why we should QA SQL, explain the role of the
Informix optimiser (including how to read the output from SET EXPLAIN) and
offer some tips to improve SQL performance.
The subject should be pertinent to all development where an Informix database
is used, regardless of the tool. The SQL development could be via an Informix
tool such as 4GL, I-SQL, ESQL or NEWERA or it could be via a third party
product accessing an Informix database.
Firstly, why QA your SQL? Generally I recommend placing strict controls on ALL
SQL development. This becomes especially important today as we are finding the
size of databases are growing at an alarming rate. The larger our databases
become the more visible our mistakes become and the more problems we have.
Large databases are less forgiving, especially in the area of performance.
Paying careful attention in the development phase will of course reward us in
the future.
Quality assurance will not only help us avoid performance problems, it will
also help us avoid other serious problems such as filling the root database
space unnecessarily with large temporary tables (possibly causing other
INFORMIX applications to fail) or filling UNIX filesystem space with large sort
files (possibly causing any UNIX process to fail) or even filling our logical
logs with long transactions and having to restore the database from an archive.
One of the best ways to avoid such problems is to enforce SQL development
standards. Usually the best approach is to separate the SQL Quality Assurance
process from the Development process. Many large organisations do this and
have a team of staff devoted to general QA others have just one person. My
opinion is that usually the DBA is in the best position to QA SQL statements as
he/she is most familiar with the database and knows how the engine goes about
its task. Also the DBA should be aware of the SQLs being executed as this may
highlight the need to alter the database schema (eg.; add an index).
Regardless of who performs the QA, the most important thing is that it gets
done and that both the developer and the person responsible for QAing the SQL
understands what to look out for.
Undoubtably the best tool to use for optimising a query or for the QA process
is the command SET EXPLAIN ON.
We should remember first that although we can optimise the queries proper
database design plays a major part in the query performance and often the
redesign of tables and an effective indexing strategy will solve many problems.
Maybe a future article will discuss database schema
optimisation. This article
will focus on the SET EXPLAIN ON option and endeavour to explain (pun intended)
how this command can help you optimise queries. First we have to;
Understanding the Informix Optimiser
What is the optimiser?
The optimiser is the part of the INFORMIX engine that
has the job of trying to anticipate the best route to take before running a
given query. Because there are often many paths that can be taken to get to
the data, some paths will take much longer than others. This is especially
true when there are many tables involved in a join. Developers usually develop
the query and then leave the task of finding the best route to the optimiser
and assume that the engine will use the best method available. Unfortunately
because the optimiser does a good job most of the time we tend to forget that
the engine does not always make the right decision. Sometimes we know more
than the engine does and can see a better way of doing things. The output from
SET EXPLAIN ON explains how the optimiser has chosen to access the data. With
this information available we may discover the optimiser has chosen a path that
will be time consuming and we can takes steps to restructure the query.
Sometimes a small alteration will influence the optimiser to take an
alternative route. Other times we realise that an index can be added to
improve performance or we may just find out that it is fine the way it is but
we need to add more temporary space to cater for a sort.
How does the optimiser make its decision?
The answer to this is very complex.
The main aim of the optimiser is to reduce I/O by limiting the amount of data
to search through in order to obtain the requested data in the most efficient
way. The optimiser makes its decisions based on information in the system
catalogues. In version 5 this information consists of :
- the number of rows in each table used in the query (Systables.nrows)
- how many pages are used for data and how many pages are used for indexes
(systables.npused)
- whether columns values are unique or not.(sysconstraints)
- what indexes exist (sysindexes)
- whether the indexes are ascending or descending (sysindexes).
Not req in V7.
- whether the data is in the same order as the index; i.e., clustered
(sysindexes.clust)
- how many levels there are in an index
- what the second largest and second lowest values are for each column. This
gives the optimiser a rough idea of the range of values (syscolumns.colmin
and colmax). Version 7 can obtain more detail about the distribution of
data (sysdistrib table).
Using all of this information the optimiser will determine all possible routes
and then weigh each method with an estimated cost. The cost will be based on
several considerations including disk access, CPU resources required and
network access. In the process the optimiser will determine in what order to
join tables, whether to perform sequential scans; whether it should create
temporary tables; whether it can use an index for the select list, for the
filter, for the sort or for a group by etc. Once the optimiser selects a plan
which it believes is the most cost effective it passes the query on for
processing and if SET EXPLAIN ON is in effect the chosen method will be
recorded in a file.
To illustrate how important it is for the optimiser to have the right
information available we can look at a very simple query and a decision that
would have to be made by the optimiser to perform a join :
Lets say we have two tables, one with 500,000 rows (tab2) and the other with
200 rows, both with a unique index on the joining column. A simple
select to find related rows from the two tables would be:
SELECT * FROM tab1, tab2
WHERE tab1.col1=tab2.col2
Hopefully the optimiser would choose to select from the smaller table first and
then join to the second using the index. This would result in 1000 disk reads
(assuming one row per page and it takes three index reads and one data read per
row in table two). ie. 200 for table one plus 200 x 4 for table two.
Now, lets say the optimiser did not have accurate information available about
the number of rows and the number of unique values and chose to select from
table two first. This would result in two million reads! (assuming it takes
just two index reads and one data read per row in table 1). ie. 500,000 plus
500,000 x 3.
Of course if the optimiser had even less accurate information and didn't know
about an index and used a sequential scan for each row returned from table one
we would have to perform around one billion reads! (200 * 500,000).
We can see from this example using a very simple query that the wrong
information and the wrong decision could have dramatic effects on performance.
Eg; 1000 reads verses 1,000,000,000. If there are many tables involved along
with more filter conditions in the WHERE clause the decision process becomes
much more complex and the importance of accurate statistics is magnified.
How accurate are your statistics?
It is very important to remember that the
information in the system catalogues used by the optimiser is only updated when
the UPDATE STATISTICS command is executed. The optimiser is only as good as
the information it has been provided with, so, rule number one is: execute
UPDATE STATISTICS as regularly as possible. More often for very dynamic tables
than for tables which rarely change. If the system catalogue information is
out of date the optimiser may make the wrong decision and severe performance
problems could be experienced. It is surprising to see how many sites fail to
do this.
We can execute UPDATE STATISTICS for the whole database, for individual tables,
for columns in a table and for stored procedures.
What I usually recommend is that, instead of relying on memory, automate the
execution of UPDATE STATISTICS every night via a cron command. Be careful when
running the UPDATE STATISTICS command whilst other applications are running
because the system tables are momentarily locked when the catalogues are
updated. This can result in other processes receiving errors. Do not update
too often because the command takes time.
When does the engine perform the optimisation?
The optimisation occurs
whenever an SQL statement is prepared, providing there are no host variables.
If there are host variables the optimiser does not have all the information
required until the variables are passed (when the cursor is opened) and in this
case the optimisation occurs on the OPEN statement. So, with standard SQL
(i.e. not prepared) the query is optimised whenever it is executed. If an SQL
statement is used repeatedly within a program it is best to PREPARE the SQL
statement to improve performance. This way the SQL is only optimised once
within a program instead of every time it is used (unless specifically
requested to reoptimise by using the re-optimisation keyword). With Stored
Procedures the SQL is optimised when the procedure is created or when UPDATE
STATISTICS is executed for that stored procedure.
Suddenly slow?
Even after running UPDATE STATISTICS regularly in production we
may notice one day that a query which previously took 10 minutes suddenly takes
one hour. This can be the result of the optimiser suddenly choosing a new path
based on new information in the system catalogues. This may mean the query
needs restructuring to influence the optimiser to make a better decision.
Because of this we should have the ability to turn SET EXPLAIN ON for any
program in production instead of being forced to recompile the program to find
out the query plan. I usually recommend building a simple function into all
programs which enables either a parameter to be passed or an environment
variable to be set to turn SET EXPLAIN ON.
Development Verses Production
Remember, not only can the statistics used by the
optimiser change but also if you are testing a query in development on a small
database the statistics may be quite different to what is in production. A
query may appear to perform well in development but once it is in production
the optimiser may choose (rightly or wrongly) to take a totally different
route. The only way to get around this problem in the QA process is to run the
query against a test database which is of a similar size to production with the
same data, or alternatively (providing the database schemas are the same) one
might consider manually updating some of the columns in the system catalogue
tables after running UPDATE STATISTICS on the test database just so that the
optimiser makes the same decisions, however this is definitely not recommended
and INFORMIX may not support you if corruption is caused.
Controlling the optimiser
Optimisation level
In version 5.x and above we have the ability to influence
how much information the engine attempts to obtain. We can do this by using
the command SET OPTIMISATION HIGH or LOW. HIGH (which is the default) causes
the engine to examine all access paths and LOW causes the engine to eliminate
less likely options at an earlier stage in order to improve optimisation time.
LOW can be useful if many tables are used in a join (e.g. five or more) as the
optimisation time can hinder performance. The downside to using OPTIMISATION
LOW is that an access path may be eliminated too early when in fact it would
have been the most suitable option.
OPTCOMPIND
Beginning with version 7 we have a little more control over the
optimiser. Previous versions assumed that if an index existed this would be
the most cost efficient access path. Now we can specify that we would like the
optimiser to compare the cost of an index read verses the cost of a table scan
so that the most cost efficient path can be chosen. This is done by setting
OPTCOMPIND=2 (default). We can still emulate the functionality of previous
versions by setting OPTCOMPIND=0. Setting OPTCOMPIND=1 instructs the optimiser
to work like it would with it set to 2 unless REPEATABLE READ is set in which
case it works like it would if it was to 0. The reason for this option is that
a sequential scan with repeatable read will effectively place a shared lock on
the whole table (as it scans all rows) during the read. Usually undesirable!
Obtaining data distribution for the optimiser
Beginning with version 6,
INFORMIX introduced data distribution analysis and storage in order to help the
optimiser make more informed decisions. Data distribution analysis is used to
give the optimiser a better understanding of the values contained in a column,
e.g.; how many unique values there are in each area of the table. This is
done by sampling the data in a column and storing information about sections of
the table in various bins. This information can be extremely valuable to the
optimiser when dealing with large tables. To generate distribution for a
column the UPDATE STATISTICS command is used. The amount of data sampled is
controlled by the keywords MEDIUM and HIGH e.g. UPDATE STATISTICS HIGH FOR
table(column). MEDIUM merely samples the data and is very quick whereas HIGH
evaluates all the rows in the table and is therefore slower but more accurate.
LOW is like using previous versions of UPDATE STATISTICS and does not obtain
any distribution data at all. The distribution analysis can be influenced
further by the use of the keyword RESOLUTION and by specifying a resolution
value (number of bins) and a confidence value (level of sampling). Refer to
the manual for more information on using these parameters.
An UPDATE STATISTICS strategy for ONLINE DSA Version 7
The recommended UPDATE STATISTICS strategy for INFORMIX DSA version 7 is to do
the following (in the same order):
1) run UPDATE STATISTICS MEDIUM on all tables using DISTRIBUTIONS ONLY without
listing columns and using the default RESOLUTION parameters. Prior to 7.10.UD1
it is better to actually list the columns that do NOT head an index for better
performance. 2) run UPDATE STATISTICS HIGH for all columns that head an index
or the columns that are definitely part of a query. Execute a separate command
for each column. 3) run UPDATE STATISTICS LOW for all other columns in
composite indexes.
Step one is very fast as it just samples data, but still gathers useful
distribution information. This is used first to obtain information about all
the columns that are not in an index. From 7.10.UD1 it is not worth specifying
each column due to the speed of MEDIUM. Step two is to get as much
distribution information as possible for all columns which are important for
query performance (eg joining columns). Because the index can be used (in
version 7) the statistics can be gathered relatively quickly, however do note
that only one column should be specified at a time with
HIGH so that it can make use of an index. In the final step the remaining
columns in all the indexes which are not at the beginning of the index can be
referenced in the one statement (per table) using the LOW parameter.
Do not re-run UPDATE STATISTICS for large static tables.
Once the UPDATE STATISTICS steps have been completed you can view any data
distribution profiles by using the -hd option with dbschema, e.g.: dbschema -d
databasename -hd tablename
TIP: This command is a very useful tool to help determine a good fragmentation
strategy.
Examining the Optimiser's Choice
Once we are comfortable knowing that we have supplied the optimiser with enough
information we can see what query plan the INFORMIX optimiser has chosen. To
do this use SET EXPLAIN ON within the query or within the NewEra, 4GL or esql
program. When this is set the optimiser will write output for all queries (for
the same process) to a file called "sqexplain.out" in the
current directory
(usually, however the filename and location depends on the operating system and
whether the query is executed on a remote host). Once the query is submitted
it is at the point of no return. The only way to examine the optimiser output
without completing the query is to hit the interrupt key just after the query
starts (after the status line in dbaccess says explain set).
Typical SET EXPLAIN output:
QUERY:
------
select cust_id, order.* from orders, customers where order_date
> "01/12/1995" AND order_date < "01/01/1996" AND
customers.cust_id = orders.cust_id order by order_date DESC
Estimated Cost: 10
Estimated # of Rows Returned: 200
Temporary Files Required For: Order By
1) informix.orders: INDEX PATH
(1) Index keys: order_date
Lower Index Filter: informix.orders.order_date > "01/12/1995"
Upper Index Filter: informix.orders.order_date < "01/01/1996"
2) informix.customers: INDEX PATH
(1) Index keys: cust_id (Key-Only)
Lower Index Filter: informix.customers.cust_id = informix.orders.cust_id
Understanding the SET EXPLAIN output:
Query: {LOW}
This section of the output shows the actual query which was optimised. LOW is
displayed if SET OPTIMISATION is currently set to LOW. Note that sqexplain.out
is appended to if the file already exists.
Estimated Cost:
This value is simply a number the optimiser has assigned to the chosen access
method. The value is not meaningful except to the optimiser as it bears no
relationship to real time. It cannot be compared to the estimated cost of
other queries and is best ignored. It can be used however to compare changes
made for the same query (e.g.; an index change).
Estimated # of Rows Returned:
This is the optimisers estimate based on information in the system catalogue
tables. Remember that the catalogue information is fairly limited (especially
prior to version 7) so this value will often be inaccurate (more so if the
query involves a join). In ONLINE DSA Version 7 distribution information can
be obtained for the data which helps the optimiser estimate the number of rows
more accurately.
Temporary Files Required For: Order By Group By
When this is displayed there is a GROUP BY or an ORDER BY statement in the
query and the optimiser has determined that there is no corresponding index
available to obtain the data in the required order. A temporary file will be
created to order the result set. This could be very large (depending on the
size of tables) so check available disk space and be aware of the affect this
sort could have on performance. Indexes cannot be used when the columns to be
sorted are from more than one table. Note that in version 7 the optimiser can
choose to traverse an index in the direction of the ORDER BY regardless of
whether the INDEX is in the same order as the ORDER BY. Prior to version 7 the
ability of the optimiser to use the index for an ORDER BY was dependant on
whether the ASCENDING/DESCENDING values on the index and the ORDER BY matched.
1) owner.table: INDEX PATH (Key-Only)
This is the table that the optimiser has chosen to read first (indicated by the
1 ). Subsequent table accesses (for a nested loop join for example) will be
displayed further down in the explain output and will be indicated by a higher
number. For each row returned at this level the engine will query the tables
at a lower level. INDEX PATH indicates an index will be used to access this
table.
The (Key-Only) notation (with ONLINE only) indicates that only the index will
be read and the actual data value (row) will not be read from this table. Key
only access would generally be very efficient (prior to version DSA 7.2 ) due
to the smaller size of the index compared to the row. Not only is the read for
the data row eliminated but more index key values are likely to fit on the one
page which in turn reduces I/O. This type of access is only achieved if no
columns are selected from the same table. Tip: Avoid the use of SELECT * if
possible and only select the required columns. Note that with ONLINE DSA 7.2
key only reads are in fact slower in most cases due to the read ahead
capabilities.
(1) Index keys: column_name Lower
Index Filter: owner.table.column > x Upper
Index Filter: owner.table.column < y
Column_name is the name of the column to be used in the INDEX PATH read
Lower Index Filter shows the first key value (x) where the index read will
begin. Upper Index Filter shows the key value (y) where the index read will
stop.
1) owner.table: SEQUENTIAL SCAN (Serial, fragments: ALL)
In the above case all rows will be read from this table using a sequential
scan.
The section in brackets relates to version 7. If Parallel is displayed
instead of Serial then the engine will perform a parallel scan (this
behaviour is influenced by the PDQPRIORITY setting). The ALL notation
indicates all fragments have to be scanned because the optimiser could not
eliminate fragments after examining the WHERE predicates. NONE indicates the
opposite, that is the optimiser has eliminated all fragments and therefore none
need to be examined. A number (or list of numbers) indicates the engine will
only examine the fragment/s listed (numbers are relative to the order in the
sysfragments table) Pay special attention if the sequential scan is being
performed at a lower level in the query plan (indicated by a higher number) as
this could mean the whole table is being scanned for each row returned in a
previous step. Often this is one of the warning bells when optimising/QAing a
query. Sequential scans are not so bad when they are for small tables or when
they are in the first step of a query plan providing the engine does not have
to scan a large table in order to retrieve just a fraction of the table.
AUTOINDEX PATH: owner.table.column
Used more in Version 4. In order to avoid sequential access a temporary index
is built on this column to perform a join. Used if an index does not exist on
this join column and is generally an indication that you need a permanent
index.
SORT SCAN: owner.table.column
Used in combination with a sequential scan when no index is available on the
join column. This column will be sorted for later use with a join (See MERGE
JOIN).
MERGE JOIN Merge Filters: owner.table.column =
owner.table.column
A Merge join will be used to join the results of the two previous selections
sets which were prepared for a join. Once the join columns are obtained in the
appropriate order (possibly via a SORT SCAN if an index does not exist) the
server sequentially reads both result sets and merges them before accessing the
rows. Considered faster than a nested loop join in many cases.
DYNAMIC HASH JOIN (Build Outer)
Dynamic Hash Filters: owner.tab1.col = owner.tab2.column
...
Version 7 only. A Hash join will be used to join the two preceding tables in
the explain output. The Build Outer notation tells us which table is used
first. The Filter shows how the tables will be joined. Where some complex
queries cannot use an index the hash join takes over. A Hash join is also used
instead of a sort-merge join and is considered more efficient. Whereas a
sort-merge join sorts both tables a hash join typically only sorts one. Hash
joins are favoured with large amounts of data especially for PDQ with
fragmentation. Rows are placed in a hash table after using an internal hash
algorithm. The cost of a hash join can be lower than using an index,
especially when more than around 15% of data from a large table needs to be
scanned. When the data is not clustered (in a different order to the index)
the cost of traversing the index in addition to retrieving the actual rows (in
a different order) is quite high compared to a table scan with a hash join.
OPTCOMPIND=2 (which is default) will cause the optimiser to consider hash joins
instead of an index. Note OPTCOMPIND ought to be set to 1 when REPEATABLE READ
is being used (and arguably should be the default).
SQL Query Quality Assurance and Optimisation
(Some Tips, Some Warnings, Some Things To Check and Some Things to Avoid):
- Avoid sequential scans on a large table if it is not in the first position
of the query plan as the whole table scan is repeated for every row in the
preceding table. This can severely affect performance. This not only affects
the query in action but can also impact other tasks by changing the recently
used pages in shared memory (as it keeps reading the whole table into shared
memory). This increases disk I/O all round as other processes are forced to
read from disk. Consider adding an index if the query cannot be restructured
and performance is slow.
- Avoid potentially large temporary sort files. This can consume all
available CPU resources, increases disk I/O and consume all available disk
space. Consider adding an index on the columns being sorted. A hint: if the
optimiser is not using an existing index for the ORDER BY this may be because
the column being ordered is not in the WHERE clause. In this case the
optimiser can be influenced to use the index instead of creating a temporary
file by adding a dummy WHERE condition on the ORDER BY column (e.g.; AND
order_num>=0). Prior to version 7 the index has to be in the same order as the
ORDER BY. When sort files are used check $PSORT_DBTEMP and $DBSPACETEMP
settings, these can help significantly to improve performance by enabling the
engine to use more than one disk for the sort file.
- Correlated subqueries. Subqueries that make reference (in the WHERE
clause) to a selected column in the main query can severely affect performance.
This causes the subquery to be executed repeatedly for each main table row. Be
very cautious when using statements like EXISTS with large tables as logical
logs can fill very quickly in the process. Temporary table space is allocated
and logged for each row in the main table (even if no rows are returned in the
final select). The worst affect of this is filling the logs and having to
restore from an archive. Correlated subqueries should be rewritten to use a
join wherever possible. Some more complex rewrites may involve joining to a
temporary table.
- OR statements on different columns can prevent the optimiser from being
able to use an existing index. If an index does exist and the optimiser
chooses a sequential scan in the query plan, consider using a UNION statement
(one for each OR condition) as this will provide the opportunity for the
optimiser to use the index.
- If the query is slow even when an INDEX PATH has been chosen on all levels
of the query plan, do not assume the optimiser has made the right decision.
Check the query plan to see if the tables are being filtered in the right
order. The aim is usually to eliminate as many rows as possible in the early
stages of the query, but unfortunately the optimiser does not always have
enough information to do this correctly (especially in versions prior to 6).
Using UPDATE STATISTICS HIGH in version 6 and above will give the optimiser
more information about the data distribution so that the right tables are
eliminated first. Also note the INDEX reads are not necessarily the best. For
example a sequential scan of a table can be faster than an index scan if the
data pages have to be retrieved and the pages are not in the same physical
order.
- Converting data types and comparing character columns is very expensive.
E.g.; tab.character_col=tab2.integer_col. If possible consider changing the
column types to be numeric. Remember if the join columns are character types
the columns have to be compared byte by byte for each row. Although version 7
handles conversions better, the cost of the conversion overhead is still not
considered by the optimiser.
- Look out for the WHERE predicates which may not be able to use indexes.
These include: OR, LIKE or MATCHES with a wildcard at the beginning (e.g.;
MATCHES *NOUR), functions (e.g.; MONTH, DAY, LENGTH), negative expressions (
e.g.; != NOUR), non initial substring searches (e.g.; postcode[4,5]>10).
- Except for very old versions of the engine the order of tables in a SELECT
list or the order of elements in the WHERE clause will not have an effect,
however in some cases when the optimiser believes the cost is the same for two
different paths, the order of the statements in the WHERE clause may have an
effect. Some tricks have been suggested in the past (such as repeating
predicates in the WHERE clause) to try and force the optimiser to use a
particular index, however this is no longer recommended and will not work with
future versions. E.g. with version 7 the optimiser query rewrite feature
eliminates duplicates in the WHERE clause.
- Avoid long transactions in logged databases. We probably all know by now
that long transactions threaten to destroy our databases by filling the logical
logs. Warning bells should ring with the following statements: LOAD
statements, INSERT INTO xx SELECT yy FROM zz, UPDATE or DELETE statements
spanning many rows. As an aside consider locking the table in exclusive mode
to avoid excessive lock usage (hindering performance and even worse running out
of locks).
- Check the placement of WHERE conditions with joining columns to see if all
combinations of the WHERE predicate have been included. For example the WHERE
predicate; tab1.x=tab2.x and tab1.x >1000 would probably cause the optimiser
to use the index on tab1.x, however the index on tab2.x might be much more
appropriate. Adding the condition; and tab2.x > 1000 would give the optimiser
more choices to evaluate. Another example is tab1.x=tab2.x AND tab2.x=tab3.x.
The optimiser in ver 5 would not consider joining tab1 directly to tab3, so
adding tab1.x=tab3.x would again provide more choices. Note however that the
optimiser in ONLINE DSA V7 rewrites queries so that all possible combinations
are examined and these suggestions are not applicable.
- Select only the columns required, this reduces the communication between
the front end and backend and reduces I/O. Avoid the temptation to SELECT *
(all columns when they not required).
- Use temporary tables when a subset of data is being reselected with
different WHERE predicates. For example a SELECT with the WHERE clause:
orderdate> 01/12/1995 and x=1 might be followed by a SELECT with a WHERE
clause: orderdate> 01/12/1995 and x=2 in this case if the table is very
large performance could be improved by first selecting all rows WHERE
orderdate>01/12/95 into a temporary table and then performing subsequent
selects on the temporary table.
- Use temporary tables to influence the optimiser to take the route that you
know is best. This can be done by first selecting the rows you want from a
large table into a temporary table and then joining the temp table to the rest
of the tables. The optimiser may use a different query plan knowing that the
temporary table is much smaller than the original large table.
- Consider creating indexes on temporary tables. Often overlooked just
because the table is temporary.
- When using temporary tables for subsequent selects with ORDER BYs, create
the temporary table using an ORDER BY. E.g.; SELECT x FROM y ORDER BY x INTO
TEMP temp_tab. This syntax is not available in earlier versions of the engine
(pre 4.1).
- Consider running UPDATE STATISTICS (within the application) for large
temporary tables. Again this is often overlooked just because the table is
temporary.
- Use the WITH NO LOG statement when explicitly creating temporary tables,
this will help performance by eliminating the overhead of writing to the
logical logs. This also avoids the possibility of creating a long transaction
with a large temporary table. In version 7 temporary tables can be created in
a special dbspace which is not logged. Use this feature whenever possible.
- Time commands with the UNIX time or timex command or use; SELECT
CURRENT FROM systables WHERE tabid=1 before and after the query being
tested/optimised.
- When testing/timing queries remember that the second time the query is
executed it is much more likely to be faster because the pages read the first
time round will probably still be in memory. Be aware that this may distort
the test results. Use tbstat -p to monitor disk/buffer usage. When timing
queries consider restarting the instance between tests in order to re-
initialise shared memory.
- Fragmentation and Parallel queries. It is important to fragment the data
in a way that makes it easy for the optimiser to determine which fragments are
active. Making the WHERE predicates clear enough to enable optimiser to
eliminate fragments will reduce I/O and enable parallelism. PDQ and
fragmentation is a whole new subject in its own right (perhaps another
article). Just remember if you are using ONLINE DSA version 7 the trick is to
divide and conquer!
To Summarise
The Informix Cost based optimiser does an excellent job in doing
what it is supposed to do, that is shielding us from the task of having to
think about the best way to retrieve our data. For this reason we often take
it for granted. Just imagine if we had to make the same complex decisions as
the optimiser does every time we wanted to access our data. Nevertheless, it
is important that we understand the optimiser and the consequences of the
decisions it makes. More importantly we need to provide the optimiser with the
information it needs to perform its job effectively.
Informix are constantly refining the optimiser and finding new ways to improve
performance. Some of the comments made in this article may not be valid in
future releases. However this article was checked for accuracy (in Nov 95) by
the brains behind the optimiser (at INFORMIX, Menlo Park, California) and I
wish to publicly thank that person for his valuable time.
I hope this article has been of value to you. Any criticisms, comments or
suggestions would be welcome.
Gavin Nour is the Director of Select Software Solutions Pty Ltd. He has ten years Informix experience
covering a variety of Industries and applications. He was a DP Manager for a leading financial
organisation in the United Kingdom, a technical Support Manager for a leading Informix Distributor
in Australia and spent two years at a leading Informix Software house in Australia before forming his
own company specialising in Informix services.
Gavin is committed to the Informix solution and to helping the Informix community. He is the Secretary
of the NSW Informix User Group and is on the board of Directors for the International Informix User
Group.
Gavin can be contacted in one of the following ways:
Mobile: 0419 803 113 (Australia)
Fax: +61 2 314 6310
Email: nourg@acslink.net.au
Washington Area Informix User Group Web Pages and On-Line
Newsletters
by Lester Knutsen
We are pleased to announce that we have opened a Web Page and converted all our newsletters from the past three years to
WWW documents and that they are available for on-line viewing. This is the place to find the lastest news about our meetings,
future plans, links to other Informix related material and catch-up on back issues of our newsletter. The following is the index
of our newsletters from the past three years that are available on-line. Visit our Home Page at
http://www.access.digex.net/~waiug. (The index is not duplicated here.)
This Newsletter is published by the Washington Area Informix Users Group.
Lester Knutsen, President/Editor
Washington Area Informix Users Group
4216 Evergreen Lane, Suite 136, Annandale, VA 22003
Phone: 703-256-0267
lester@access.digex.net
|