This article was originally written for IDS, I have made some additional comments for XPS.
Introduction
Over the past several months I have found myself repeatedly explaining the concept of tuning a decision support system (DSS) query for an Informixâ database. I figure that I had better jot this down so that I can just hand it out and avoid spending all the time explaining it over again.
Data warehouses and very large databases tend to be used differently than traditional OLTP database systems. In an OLTP environment, the goal is to execute lots of discrete transactions in a short time frame. OLTP administrators support hundreds or thousands of users, and focus on very specific portions of the data (i.e. what is important is being able to update item 4 of the customer’s order as well as the balance in their order master record). An OLTP transaction is most probably performed with one or more UPDATE statements WHERE cust_id/ord_no = ?.
In a DSS environment, you are more likely to have only a handful of users, each of whom are making queries against a large volume of data. In an OLTP model, users like this would be severely chastised. However, in the world of DSS, this sort of activity is encouraged. Typically, a DSS query will scan every row of one or more tables. Also typically, where the size of an OLTP system might be measured in terms of hundred’s of megabytes, a DSS system will be measured in terms of hundred’s of gigabytes.
DSS systems tend to be queried by front-end tools such as Business Objects or SAS, where the SQL to perform the query is generated by the tool; you frequently do not have much control over the query itself.
If you try to support a DSS query using the same data
model, indexed reads, and engine tuning used in an OLTP system, it is likely
that your response time to such a query could be measured in days. In a
recent case, the response time for such a
query was 40 days. In this particular case, using the tuning methods detailed
in this article, I was able to bring the query result time down to about
15 minutes. All without changing the query or changing the schema. This
article will examine specific tuning procedures for DSS queries in order
to shorten their response times.
An Example Query
Imagine you have a query where you join two tables to return a count of last name by state, where name is in Table 1 and address (state) is in Table 2:
select a.state_cd, b.last_name, count(*)
from tab1 a, tab2 b
where a.key = b.key
group by 1,2
Table 1 has 29 million rows and Table 2 has 27 million rows. Table 1 is 4.5 GB and Table 2 is 6GB. Each table has been placed into a single dbspace. There is no index on either key in either table. The machine in use is an 8 processor Sun 6500 with 4GB of memory. A query like this run in an OLTP environment could take upwards of 70 or 80 hours to execute.
Before examining tuning procedures to optimize this DSS
query, you need to understand how to utilize specific monitoring tools
available to you in order to learn how the query will behave. There are
five monitoring tools that are very helpful in examining a query.
Explain Plan
The first monitoring tool is the explain plan. This can be obtained by including a SET EXPLAIN ON; command at the start of your query. The optimizer will then create, or append to, a file called sqexplain.out in which it explains how it plans to resolve the query.
QUERY:
------
select a.state_cd, b.last_name, count(*)
from tab1 a, tab2 b
where a.key = b.key
group by 1,2
Estimated Cost: 121157
Estimated # of Rows Returned: 1
1) informix.b: SEQUENTIAL SCAN (Serial, fragments: ALL)
2) informix.a: SEQUENTIAL SCAN (Serial, fragments: ALL)
Temporary table required for group operation
DYNAMIC HASH JOIN (Build Outer)
Dynamic Hash Filters: informix.b.hh_id = informix.a.hh_id
Notice that the Estimated Cost is very low; somebody forgot to UPDATE STATISTICS.
The initial section of this EXPLAIN plan is the query itself followed by an estimated cost which is derived from the number of instructions the optimizer thinks will be required to resolve the query. The value given here is way too low for this query (a value in the billions would be more normal).
The estimated number of rows returned is an indication that the optimizer thinks that at least one of the tables is empty. A minimum of one row is always returned from a count(*) operation.
Each of the tables will be read using a sequential scan, probably because it costs nothing to read an empty table. The "Serial" is an indication that each fragment of disk that the table resides on will be read sequentially as opposed to the desired "Parallel". No fragments have been eliminated from the query (hence, the "fragments: ALL").
The Temporary table is, of course, required to perform the actual group by operation. A Hash Join will be used to join the two tables, again because the optimizer thinks one of the tables is empty and because here there are no indexes.
What we would really like to see is a higher estimated cost, a higher estimated number of rows (if UPDATE STATISTICS high had been run for state_cd and last_name, we might even get a correct number here). We like the Sequential scan, although we really want a Parallel, not Serial, scan. If we had been able to eliminate some of the fragments through a WHERE condition, we might also have seen only a list of the fragments to be scanned. Finally, we would also like to see a line like:
Maximum Threads: 21
We would actually like to see a higher value indicating more parallelism, but for the sake of this example we will say 21.
This would indicate that we are using PDQPRIORITY. The exact number of threads will vary with the number of dbspaces each table is spread across, and the number of threads assigned to each additional task. As can be seen in the onstat -g sql output below, we have 9 scan threads for one table, 11 scan threads for the second table and a single sqlexec thread to pull it all together. With a higher PDQPRIORITY setting we would see more specialized threads for grouping and sorting.
These key pieces of information can tell you that something
is wrong with the way the optimizer has chosen to solve the query.
Xtree
While the explain plan can show you how the query will be answered, Xtree allows you to monitor the query as it is being executed. You can see how many threads have been initiated for each segment of the process as well as determine a run rate for the query.
Figure 1 below shows an Xtree screen shot for a typical hash join. At the bottom of this figure are two sets of scan threads, the one on the right has 9 threads, the one on the left is running with 11 threads. The table on the right was read into memory in the hash table, the table on the left is being scanned and matched against this hash table while being read.
The numbers to the right of each box indicate how many threads are in use to perform this portion of the query. Within the box, below the meter, is a number indicating how many rows have been processed by this portion of the query in the last second. At the top of the box is the number of rows processed so far. About 114,000 rows were read from the table on the left in the last second, with about 89,000 matches performed in that same second. When a hash table swap occurs, all of the process per second counters will drop to zero; by watching these you can figure out how often and how long the process is spending in a hash table swap.
The numbers next to the scan boxes should match the number of dbspaces the table is fragmented across. On this machine, I would expect to see on the order of 50,000 to 70,000 rows scanned per fragment per second. If this number drops off at the end then one or more of the fragments probably completed before the others; if this occurs in the last 5-10 seconds it’s no big deal, but if this drop off occurs half way through the query you might want to check the fragmentation scheme.
The filter box should be lowering the number of rows scanned
if in fact you are using a filter (e.g. where state_cd
= ‘MA’). The total number filtered should keep pace easily with
the scan rate.
(Insert Figure 1: Xtree Hash Join)
The following Xtree screen shot shows a typical nested
loop join (Figure 2). The table on the left is being sequentially scanned;
a probe into the second table is then performed for each row from this
first table. As you can see, during the second that this snapshot was taken,
the query joined all of 1373 rows, which is not very good. Here, the join
itself limits the scan rate. Also interesting, you can see that more rows
have been joined than read, an indication of duplicate rows. (That’s ok,
multiple people live at the same address).
(Insert Figure 2: Xtree Nested Loop Join)
onstat –g ses and onstat –g sql
onstat –g ses and onstat –g sql (below) reveal some key numbers—specifically, how many and what type of threads will be used to execute the query, as well as how much memory your query has been allocated.
session #RSAM total used id user tty pid hostname threads memory memory 320 informix 6 25979 foo 21 352256 278408 tid name rstcb flags curstk status 48671 sqlexec e041518 ---P--- 1984 e041518 ready 48672 scan_1.0 e03fb98 ------- 984 e03fb98 sleeping(secs: 3) 48673 scan_1.1 e040c98 ------- 984 e040c98 running 48674 scan_1.2 e045918 ------- 984 e045918 running 48675 scan_1.3 e046e58 ------- 984 e046e58 sleeping(secs: 3) 48676 scan_1.4 e0443d8 ------- 984 e0443d8 ready 48677 scan_1.5 e044818 ----R-- 984 e044818 running 48678 scan_1.6 e044c58 ------- 984 e044c58 ready 48679 scan_1.7 e045098 Y------ 936 e045098 cond wait(await_MC1) 48680 scan_1.8 e048398 Y------ 936 e048398 cond wait(await_MC1) 48681 scan_2.0 e042618 Y------ 936 e042618 cond wait(await_MC2) 48682 scan_2.1 e043718 Y------ 936 e043718 cond wait(await_MC2) 48683 scan_2.2 e0410d8 Y------ 936 e0410d8 cond wait(await_MC2) 48684 scan_2.3 e0421d8 Y------ 936 e0421d8 cond wait(await_MC2) 48685 scan_2.4 e046a18 Y------ 936 e046a18 cond wait(await_MC2) 48686 scan_2.5 e041d98 Y------ 936 e041d98 cond wait(await_MC2) 48687 scan_2.6 e047298 Y------ 936 e047298 cond wait(await_MC2) 48688 scan_2.7 e045d58 Y------ 936 e045d58 cond wait(await_MC2) 48689 scan_2.8 e0454d8 Y------ 936 e0454d8 cond wait(await_MC2) 48690 scan_2.9 e03eed8 Y------ 936 e03eed8 cond wait(await_MC2) 48691 scan_2.1 e0487d8 Y------ 936 e0487d8 cond wait(await_MC2) Memory pools count 1 name class addr totalsize freesize #allocfrag #freefrag 320 V e2de018 352256 73848 1420 14 name free used name free used overhead 0 120 scb 0 96 opentable 0 31208 filetable 0 8384 log 0 44856 temprec 0 17680 ralloc 0 38560 gentcb 0 18728 ostcb 0 2024 sqscb 0 7664 rdahead 0 1600 xchg_desc 0 7392 xchg_port 0 2312 xchg_packet 0 5280 xchg_group 0 144 xchg_priv 0 1352 hashfiletab 0 5880 osenv 0 1672 buft_buffer 0 19224 sqtcb 0 24192 fragman 0 36720 shmblklist 0 3320 Sess SQL Current Iso Lock SQL ISAM F.E. Id Stmt type Database Lvl Mode ERR ERR Vers 320 SELECT hercules NL Not Wait 0 0 7.24 Current statement name : slctcur select a.state_cd, b.last_name, count(*) from tab1 a, tab2 b where a.key = b.key Last parsed SQL statement : select a.state_cd, b.last_name, count(*) from tab1 a, tab2 b where a.key = b.keyThe output contains a lot of information, probably more than you need, but there are several key things to look for. The Current Statement name provides you details on the SQL of the query any session is actually running. You can see how many threads and what types were kicked off (e.g. scan_2.3 et al, we got 9 scan threads against one table, 11 against the other and a single sqlexec thread which will perform the hash join with a PDQPRIORITY=1 (since I started the query, I know this). You can also see how much memory has been allocated for the query and is being used under Memory pools and on the top line (Total Memory/Used Memory). This information will confirm what threads have been kicked off for a given query. If there are not multiple scan threads for each table (scan 1.0 through 1.8 is on one table), or if there are NO scan threads for a given table, you know the optimizer may not be doing what you have wanted.
onstat –g mgm
The onstat –g mgm (rgm under XPS) command will also show you memory, but it is especially important for multiple queries as it shows which query is waiting for resources. Memory (Resource) Grant Manager will show what queries are actually running and with what actual PDQ setting, as well as which queries have been "gated" or are waiting for resources (memory or PDQ for example) to run.
For this article I did not provide a snapshot of an onstat
–g mgm, but give it a whirl and read the manual on it—it is fairly
self evident. Of particular interest are the gating factors and what queries
get held up and why (so you can go kick Bob for running everything at PDQPRIORITY=100).
onstat –g lsc
The onstat –g lsc (scn for XPS) command will show if there are any light scans. If there is no output from this command, there are no light scans. Light scans are frequently crucial to DSS queries as we shall see below
onstat –g xqp and xqs (XPS specific)
Under XPS there is no xtree tool. There is however much more detailed information available than from xtree. From the onstat –g rgm output you will see a ‘Plan’ number. Plug this into onstat –g xqp plan_no and out will come the plan that will be used to resolve the query:
XMP Query Plan oper segid brid width misc info ----------------------------------------- scan 4 0 16 tab1 group 4 0 16 group 3 0 16 group 3 0 16 group 2 0 16 insert 1 0 32In this case the query is:
insert into tab_2 select col1, count(unique col2) from tab1 group by 1
This is running on an XPS 8.30.UC2 with 16 processors and 8 GB of DS memory.
Operations are performed in descending order based on segid. So the first operation is the scan/group. The group itself is reported twice, once for an input and once for an output. In this case, the ‘unique’ claused forced a second group. Finally the data is inserted into a final table.
Onstat –g xqs for this plan_id shows:
In the interests of brevity, I have deleted all but the first and last three lines of each segment.
XMP Query Statistics
XMP Query Statistics Cosvr_ID: 1 Plan_ID: 11961 type segid brid information ---- ----- ---- ----------- scan 4 0 inst cosvr time rows_prod rows_scan ---- ----- ---- --------- --------- 0 1 1916 39433808 39433808 1 1 1919 39471825 39471825 2 1 1903 39422536 39422536 …… 13 4 1930 39410387 39410387 14 4 1936 39364283 39364283 15 4 1936 39375704 39375704 -------------------------------------- 16 630463773 630463773In this segment, there were 16 threads initiated. Each read 39.4 million rows in about 32 minutes (see the column ‘time’ it is in seconds). Not great. There was no filter, so the rows_prod is equal to the rows_scan. You might also note the cosvr column – if you know what it means fine, otherwise ignore it.
group 4 inst cosvr time rows_prod rows_cons mem ovfl ---- ----- ---- --------- --------- --- ---- 0 1 1916 39433808 39433808 104 0 1 1 1919 39471825 39471825 104 0 2 1 1903 39422536 39422536 104 0 ….. 13 4 1930 39410387 39410387 104 0 14 4 1936 39364283 39364283 104 0 15 4 1936 39375704 39375704 104 0 -------------------------------------------------------- 16 630463770 630463773 (2048)
In this segment we pushed
rows into the first group (unique) operation. Again, 16 threads, same amount
of time – which was concurrent with the scan time – number of rows produced
is identical to the number of rows pushed in (rows_cons). Amount of memory
for this was 2K (across 16 threads).
group 3 inst cosvr time rows_prod rows_cons mem ovfl ---- ----- ---- --------- --------- --- ---- 0 1 12639 39059408 39432594 49345600 199 1 2 12676 39071879 39442561 49328000 199 2 3 12203 39054649 39429975 49336000 199 ….. 13 2 12672 38998788 39393276 49350400 199 14 3 12103 39003511 39390257 49345600 199 15 4 11942 39011955 39381624 49329600 199 -------------------------------------------------------- 16 624401945 630463770 (3958528)This segment is the second half of the above group segment, and is where the unique rows come out. Notice that this phase used almost 4GB of memory. Notice also that it overflowed. The ovfl is not a count of how many pages overflowed, but of how many segments. It’s hard to tell from this whether the overflow was serious. However, given that the time for this segment took on the order of three hours, I would hazard that it was serious. Obviously this step hurt. This is probably what gated the scan.
group 3 inst cosvr time rows_prod rows_cons mem ovfl ---- ----- ---- --------- --------- --- ---- 0 1 12639 39059408 39059408 104 0 1 2 12676 39071879 39071879 104 0 2 3 12203 39054649 39054649 104 0 ….. 13 2 12672 38998788 38998788 104 0 14 3 12103 39003511 39003511 104 0 15 4 11942 39011955 39011955 104 0 -------------------------------------------------------- 16 624401944 624401945 (2048)
This is the second group
necessitated by the count(*)/group operation. Again, this is just pushing
rows into group operation. All of this time is concurrent with previous
processes.
group 2 inst cosvr time rows_prod rows_cons mem ovfl ---- ----- ---- --------- --------- --- ---- 0 1 13203 21639904 39058139 43422720 199 1 2 13136 20410129 39069022 40452896 199 2 3 13201 22846215 39073721 46132152 199 ….. 13 2 13225 21678939 39022613 43424128 199 14 3 13180 21694541 39009538 43659528 199 15 4 13184 22908683 39014754 46380352 199 -------------------------------------------------------- 16 352937467 624401944 (3958528)This is the second phase of the second group operation. Because of the way this query was crafted, the other 4GB of memory was used for this operation – once strategy to solve this performance problem would be to separate the two groups – perform a select unique into one table and then perform the count off of it.
insert 1 0 inst cosvr time it_count ---- ----- ---- -------- 0 1 2380 10979878 1 2 2380 10979878 2 3 2380 10979878 ….. 29 2 2380 10979878 30 3 2380 10979878 31 4 2380 10979878 -------------------------- 32 351356096
The final insert phase of the query, writing to the
temp table. This phase of the operation was not started until the second
group started issuing output, hence the time is much lower.
Running the Example Query
Frequently, tuning a DSS query is merely a matter of shifting from "OLTP-think" to "DSS-think" which is exactly what we will do in this case.
The optimizer should choose to solve the example query with a hash join. If it does not, then there may be another issue, like the optimizer thinks one of the tables is empty (need to run UPDATE STATISTICS) as is the case in our query here. Let’s assume that PDQPRIORITY is NOT set to anything, which is typical of an OLTP environment. This means that it is effectively off. What will happen when the query is executed?
With these settings, the engine first reads the smaller table (Table 2), applies any filters, and builds a hash table entry for each record. It builds this hash table in memory. It then reads the large table and matches it up against the hash table. As it finds matches it pushes the matched rows out the other side of the join.
Note: If the optimizer determines that a filter will make the result set from the larger table smaller, it may choose to read the larger table first. Of course, the optimizer would need some distribution information to make this determination more accurately, and this can only come by running UPDATE STATISTICS high or medium.
After 37 minutes into this query, Xtree shows both tables
have been scanned and have moved into the hash join. Because insufficient
memory was allocated, the hash join is stalled while it swaps hash table
pages from the temp disk. As more and more memory is needed to build the
hash table, more and more available memory is used up. Since the query
is executing in an OLTP environment, the LRU buffers are filled with all
the data from the table along with the associate buffer management overhead,
spoiling everybody’s day (or rather week/month). Since the hash table memory
requirements are so large, the query probably overflowed to the temp space(s)
and, if these weren’t set up well, took away the temp space for everyone
else as well. And, after running for several days, you would probably just
kill the query anyway.
Basic Tuning Procedures
There are three basic time-consumptive elements to the example query. A table scan, a second table scan, and the join. When a nested loop is used (where the larger table is read using the index for every record in the smaller table) the join is part of the second read. The join from a hash join occurs while the second table is read. In most cases we want to use a hash join.
When tuning this query, it helps to approach it in two parts: the table scans and the join. To begin, let’s obtain a benchmark of how long it takes just to scan each table. This will not only give us the ability to estimate query time but also highlight when a query slows down, as well as ensure that we can read the tables themselves fairly quickly.
SELECT CURRENT HOUR TO SECOND FROM SYSTABLES WHERE TABID = 1;
SELECT SUM(some_column) from tab1;
SELECT CURRENT HOUR TO SECOND FROM SYSTABLES WHERE TABID = 1;
SELECT SUM(some_other_column) from tab2;
SELECT CURRENT HOUR TO SECOND FROM SYSTABLES WHERE TABID = 1;
Results: 17 minutes for Table 1, 20 minutes for Table 2. Roughly 30,000 rows per second. These results are interpolated. The actual read time was 37 minutes for both tables.
The first tuning objective is to improve the read performance.
With all of the data being in one dbspace, the engine is limited
to a single thread to read the entire table. If the data is spread over
multiple dbspaces the database can kick off one thread per dbspace,
or fragment. Using ipload you can generate and run a job which will
unload (and reload) each table. You can then drop each table, recreate
them fragmented over several disks, and reload them. In this particular
instance, the fragmentation expression is not that important, so you can
just use round robin. If it takes 20 minutes to read a table in one dbspace,
spreading the table across 4 dbspaces should bring the scan time
down to about 5 minutes. Spreading the table across 8 dbspaces should
lower the time to about 2.5 minutes.
PDQPRIORITY
Now that we have fixed the tables, the most obvious next step is PDQPRIORITY. With PDQPRIORITY off, Informix Dynamic Serverä is doing nothing in parallel (Note, with XPS PDQPRIORITY does not have as much control, parallelism is still used with it set to 0, PDQPRIORITY now becomes important for allocating memory). One thread is doing all of the reading. If you assume that it takes 5 minutes to read the data from one disk, and that there are 16 dbspaces per table, then the query is automatically starting off with 80 minutes of overhead per table. Just by setting PDQPRIORITY to 1 the engine can kick off one read thread per disk (don’t worry, it won’t try to read both tables at once).
Running the benchmark again with PDQPRIORITY set to 1, either from the operating system (EXPORT PDQPRIORITY=1) or from dbaccess (SET PDQPRIORITY 1), the length of time required to scan each table greatly improves.
Results: 3 minutes 21 seconds for Table 1, 3 minutes 52 seconds for Table 2. Roughly 150,000 rows per second.
Table scan times improve five-fold just by fragmenting
the data. However, it can still be improved.
Light Scans
In an OLTP environment, you want to maximize the cache read and write rates. To do this, you want the engine to find the row it needs in the buffer cache and not have to read it from disk. OLTP buffer cache management has some minimal overhead. However, in a DSS environment, you want to scan the entire table; the odds of the data being in the cache are, therefore, dramatically reduced. The overhead associated with managing the DSS buffer cache becomes onerous. Enter the light scan, a critical element in DSS queries.
The light scan is designed for a DSS query. Rather than use the normal resident memory buffer cache, the light scan uses its own light scan buffers, for which there is much less overhead (primarily because you don’t have to worry about LRU queues). Each query gets its own set of buffer pools. This can have a dramatic affect on our read rate.
To force a light scan the trick is to ensure that the table being read is larger than the resident memory buffer size AND to have the ISOLATION mode set to DIRTY READ. You can also set an environment variable (export LIGHT_SCANS=FORCE).
The number of light scan buffers assigned is a factor of your RA (read ahead) settings. From the Informix Masters Series training manual:
Light_scan_buffers = roundup (( RA_PAGES + RA_THRESHOLD) / (60/2/))
Bumping the RA_PAGES and RA_THRESHOLD values to their maximum of 128 and 120, respectively, gives you more light scan buffer pools, which is a good thing.
In the previous benchmark runs, the engine did not use light scans (the was intentional to show the performance impact of light scans). By running UPDATE STATISTICS on the engine, the optimizer will recognize that the tables being read in our example query are larger than the resident memory buffer pool and will utilize light scans when the benchmark is run again.
UPDATE STATISTICS is of course key to getting the light scan to work, but to what degree? UPDATE STATISTICS medium or high will tend to run for hours—on a large data warehouse, potentially days. The results of this are very useful if you plan to do indexed joins or filtering, but for this hash join you are looking to scan the entire table anyway, so just run the quick UPDATE STATISTICS low and leave the rest for the hash join.
(Note. Setting the environment variable DBUPSPACE can dramatically reduce the time require to run UPDATE STATISTICS)
Results: 43 seconds for Table 1, 50 seconds for Table 2. Roughly 700,000 rows per second.
I have seen scan rates of up to 2 million rows per second on a table with 32 fragments. We could fragment the table over more dbspaces to improve this read rate even more, but is cutting another 20 to 40 seconds off the query time worth the added administrative headache? You decide. The number of fragments you can allocate may be limited by some other factors as well, like the business requirement for a particular fragmentation scheme. For the purposes of this example, these results here work just fine.
(Note. I am told that there is a limit of three scan threads per CPUVP. Hence the upper limit on number of dbspaces should be 3 x CPUVP).
The next tuning efforts focus on the join itself. There
are several options available. Let’s first try running the query with a
nested loop join using the indexes.
Indexes
By adding an index, you are trying to do one of two things: either setup a nested loop join or limit the size of the data read to just the index pages. In other words, you are trying to put into the index whatever the query needs to be successful.
The addition of an index gives the optimizer a different path it can travel to get at the data. If it feels that one of the tables is small enough, it will choose to read that table and, for every row, probe the large table (using the index) to perform the join. At issue is that the speed with which the engine can perform probes can be measured in thousands per second. 30 million rows at 3000 rows/second, which is optimistic for the machine we are using here, comes out to 10,000 seconds or almost 3 hours. This is way too slow. Even fragmenting the indexes and detaching them from the tables will not change the probe rate dramatically enough to make a nested loop faster.
So much for that idea. Let’s try a dynamic hash join.
Hash Join
To recap, in a hash join the smaller of the two tables is scanned and stored in a hash table in memory. If memory fills up, which will happen unless the two tables in question are much smaller, this hash table is swapped out to the temp disk. Once the first table has been read, the second table is read and matched up against the hash table. Periodically, the query will have to swap in the hash table entries that are out on temp disk. This temp to memory swap is expensive and is generally the slowest portion of a hash join.
First thing to note is that building the hash table actually slows down the scan rate to about 300,000 rows per second (results vary).
At 40 PDQ this query took 22 minutes 12 seconds, and the match rate was about 40,000 rows per second. Not only can more memory lower the amount of swapping done to and from temp disk, but it also increases the size of the hash table in memory and allows more matches per second.
Utilizing a hash join method, it turns out that the example query join runs at about 90,000 rows per second (at 80 PDQ) for a projected total of about 5.5 minutes. This would be great except that the entire hash table does not fit into memory, causing the query to have to swap portions of the table from disk which is why the actual results of this query are much different (see below). Therefore, the more memory you can assign to the hash join, the less swapping will occur.
PDQPRIORITY controls memory allocation. Your ultimate PDQPRIORITY setting (which is your personal PDQ_PRIORITY/100 * MAX_PDQPRIORITY/100) * DS_TOTALMEMORY = how much memory you get for your query. If you have a DS_TOTALMEMORY of 1.5GB and take 1% of that (PDQPRIORITY=1) you get 15MB of memory in which to build the hash table. 15MB/20 bytes (our hash table entry) = 786,000 or so entries. That’s a rough guess, but after that many entries the database will have to push the hash table entries out to temp disk. With 16 scan threads the database will probably be reading around 160,000 rows per second, so in 5 seconds it will fill the hash table and have to push to temp disk with the above memory allocation.
(Note, the actual formula for the size of the hash table is 32 + keysize + rowsize. The hash table size would therefore be 32 + 2 + 155 or 189 bytes per tab1 record. * 29 million = 5.4GB. A key improvement here would be to lower the size of the tab1 record to reduce the size of the hash table)
Actual results of this query: 8 minutes 40 seconds (PDQ = 80). During the match phase of the query the match rate was 90,000 rows/second. After 17 million matches the database started swapping. It performed 10 total swaps, each swap about 10 seconds long (100 seconds or 1 minute 40 seconds). Since the entire table did not fit into memory, the query also could not keep up the 90,000 rows/second rate, which is why (sans the swap time) the query runs in 7 minutes not 5.5 minutes (as first projected).
At times, the optimizer can, frustratingly, keep pushing your query down the nested loop path, despite your best intentions and even the use of optimizer hints. In these cases, you can trick the optimizer into a hash join by giving it something it won’t find in the index:
SELECT …
FROM tab1, tab2
WHERE tab1.key + 0 = tab2.key + 0
The "+ 0" is inexpensive and forces the query to perform
as a hash join. This is because tab1.key
is no longer recognizable as an index to the optimizer.
Temp Disks
Not surprising, temp disk is written to and read from
in the same manner as normal tables: 1 thread per fragment. It is therefore
important to have multiple temp disks set up to maximize the number of
read and write threads. It is best to have multiple dbspaces for
your temp disks as well. I recommend 2 temp dbspaces per CPU as
was used in the query above, although as long as you keep the number at
some multiple you can go as far as you like (within reason), 64 temp dbspaces
per CPU is way beyond the point of vanishing returns. Even if you only
have eight temp disks for eight CPUs, break them into two chunks and make
each one it’s own dbspace. Since the temp spaces will be written
to in round robin, the smallest temp space will fill up first and cause
you to generically "run out of temp space" even though the other spaces
still have room. Therefore, use temp spaces of equal size; anything extra
on a single temp space is wasted.
(Note, I my tests I have discredited the notion of creating two dbspaces on one physical disk to get more scan threads, yes you get more scan threads, but the read time does not decrease)
Balance
It is important to balance all that you do against how
many processors you have so that there is minimal task swapping between
processors. In the example here, the tables were laid out
over 9 and 11 dbspaces, which meant that 8 processors were
handling 9 scan threads in one case and 11 in the other. Each processor
probably wasted some time swapping from one scan thread to another. Alas,
business requirements do not always match what makes sense technically.
Model Changes
There are a few changes you can make to your onconfig
file to better support decision support systems. The following configurations
are from the Informix Masters Series training manual:
| BUFFERS | Low | 2000 |
| SHMVIRTSIZE | High | 75% of available memory |
| SHMADD | Whatever | 32000 |
| SHMTOTAL | Maximize | Set to available memory |
| RA_PAGES | Maximize | 128 |
| RA_THRESHOLD | Maximize | 120 |
| DS_TOTAL_MEMORY | Maximize | 90% of available memory |
I actually prefer BUFFERS to be 20000—to provide support
for OLTP activity as required, the tables are large enough to dwarf the
buffers in all cases, and there is still OLTP style work to be done to
the data.
Summary
Any DSS query similar to the example described in this article will have three principle time-consuming components. A read of Table 1, a read of Table 2, and a join step. If the query limits the selection statement with some sort of filter (a key between 100 and 200) then the read components become easy to handle with an index. For example, read Table 1 for these rows and probe Table 2 for the matching rows, where the actual join component occurs during the probe.
If the query is not limited with a filter, then a full join of both tables is required. This is very inefficient when performed with an indexed read. It is far better to scan both tables and perform the join in memory. A hash join is the best way to perform this. Tuning, therefore, becomes a matter of shortening the three stages of the query (scan1, scan2, and the join itself). With a light scan the database can read (in this case, 300 thousand rows per second) from the first table while building the hash table. With the second table, the database can read only marginally less quickly including the hash join. And with a high enough PDQPRIORITY the database has enough memory to perform the hash join in memory without expending time to swap the hash table out to temp disk. Even if the hash tables are pushed to temp disk, if the temp disks are laid out well, it’s not that expensive.
If you can spread the read load over multiple disks, reserve
enough memory, and set aside enough temp disks, you can bring any query
time down to within a 5 minute window to scan each table and, potentially,
another minute or two to wrap up the join. If you are getting a 40 hour,
or 40 day, query then one or more of these things isn’t happening.
About the Author
Jack Parker is a Database Administrator at Engage Technologies. He has been working with Informix products for 14 years and has specialized in Data Warehousing for the past 5 of these. He can be reached as jparker@netway.com.