Back to Current Newsletter Washington Area Informix Users Group


Tuning Informix DSS Queries
by: Jack Parker
jparker@netway.com

This article was originally written for IDS, I have made some additional comments for XPS.

Introduction

Over the past several months I have found myself repeatedly explaining the concept of tuning a decision support system (DSS) query for an Informixâ database. I figure that I had better jot this down so that I can just hand it out and avoid spending all the time explaining it over again.

Data warehouses and very large databases tend to be used differently than traditional OLTP database systems. In an OLTP environment, the goal is to execute lots of discrete transactions in a short time frame. OLTP administrators support hundreds or thousands of users, and focus on very specific portions of the data (i.e. what is important is being able to update item 4 of the customer’s order as well as the balance in their order master record). An OLTP transaction is most probably performed with one or more UPDATE statements WHERE cust_id/ord_no = ?.

In a DSS environment, you are more likely to have only a handful of users, each of whom are making queries against a large volume of data. In an OLTP model, users like this would be severely chastised. However, in the world of DSS, this sort of activity is encouraged. Typically, a DSS query will scan every row of one or more tables. Also typically, where the size of an OLTP system might be measured in terms of hundred’s of megabytes, a DSS system will be measured in terms of hundred’s of gigabytes.

DSS systems tend to be queried by front-end tools such as Business Objects or SAS, where the SQL to perform the query is generated by the tool; you frequently do not have much control over the query itself.

If you try to support a DSS query using the same data model, indexed reads, and engine tuning used in an OLTP system, it is likely that your response time to such a query could be measured in days. In a recent case, the response time for such a query was 40 days. In this particular case, using the tuning methods detailed in this article, I was able to bring the query result time down to about 15 minutes. All without changing the query or changing the schema. This article will examine specific tuning procedures for DSS queries in order to shorten their response times.
 
 

An Example Query

Imagine you have a query where you join two tables to return a count of last name by state, where name is in Table 1 and address (state) is in Table 2:

select a.state_cd, b.last_name, count(*)
from tab1 a, tab2 b
where a.key = b.key
group by 1,2

Table 1 has 29 million rows and Table 2 has 27 million rows. Table 1 is 4.5 GB and Table 2 is 6GB. Each table has been placed into a single dbspace. There is no index on either key in either table. The machine in use is an 8 processor Sun 6500 with 4GB of memory. A query like this run in an OLTP environment could take upwards of 70 or 80 hours to execute.

Before examining tuning procedures to optimize this DSS query, you need to understand how to utilize specific monitoring tools available to you in order to learn how the query will behave. There are five monitoring tools that are very helpful in examining a query.
 
 

Explain Plan

The first monitoring tool is the explain plan. This can be obtained by including a SET EXPLAIN ON; command at the start of your query. The optimizer will then create, or append to, a file called sqexplain.out in which it explains how it plans to resolve the query.

QUERY:

------

select a.state_cd, b.last_name, count(*)

from tab1 a, tab2 b

where a.key = b.key

group by 1,2

Estimated Cost: 121157

Estimated # of Rows Returned: 1

1) informix.b: SEQUENTIAL SCAN (Serial, fragments: ALL)

2) informix.a: SEQUENTIAL SCAN (Serial, fragments: ALL)

Temporary table required for group operation

DYNAMIC HASH JOIN (Build Outer)

Dynamic Hash Filters: informix.b.hh_id = informix.a.hh_id

Notice that the Estimated Cost is very low; somebody forgot to UPDATE STATISTICS.

The initial section of this EXPLAIN plan is the query itself followed by an estimated cost which is derived from the number of instructions the optimizer thinks will be required to resolve the query. The value given here is way too low for this query (a value in the billions would be more normal).

The estimated number of rows returned is an indication that the optimizer thinks that at least one of the tables is empty. A minimum of one row is always returned from a count(*) operation.

Each of the tables will be read using a sequential scan, probably because it costs nothing to read an empty table. The "Serial" is an indication that each fragment of disk that the table resides on will be read sequentially as opposed to the desired "Parallel". No fragments have been eliminated from the query (hence, the "fragments: ALL").

The Temporary table is, of course, required to perform the actual group by operation. A Hash Join will be used to join the two tables, again because the optimizer thinks one of the tables is empty and because here there are no indexes.

What we would really like to see is a higher estimated cost, a higher estimated number of rows (if UPDATE STATISTICS high had been run for state_cd and last_name, we might even get a correct number here). We like the Sequential scan, although we really want a Parallel, not Serial, scan. If we had been able to eliminate some of the fragments through a WHERE condition, we might also have seen only a list of the fragments to be scanned. Finally, we would also like to see a line like:

Maximum Threads: 21

We would actually like to see a higher value indicating more parallelism, but for the sake of this example we will say 21.

This would indicate that we are using PDQPRIORITY. The exact number of threads will vary with the number of dbspaces each table is spread across, and the number of threads assigned to each additional task. As can be seen in the onstat -g sql output below, we have 9 scan threads for one table, 11 scan threads for the second table and a single sqlexec thread to pull it all together. With a higher PDQPRIORITY setting we would see more specialized threads for grouping and sorting.

These key pieces of information can tell you that something is wrong with the way the optimizer has chosen to solve the query.
 
 

Xtree

While the explain plan can show you how the query will be answered, Xtree allows you to monitor the query as it is being executed. You can see how many threads have been initiated for each segment of the process as well as determine a run rate for the query.

Figure 1 below shows an Xtree screen shot for a typical hash join. At the bottom of this figure are two sets of scan threads, the one on the right has 9 threads, the one on the left is running with 11 threads. The table on the right was read into memory in the hash table, the table on the left is being scanned and matched against this hash table while being read.

The numbers to the right of each box indicate how many threads are in use to perform this portion of the query. Within the box, below the meter, is a number indicating how many rows have been processed by this portion of the query in the last second. At the top of the box is the number of rows processed so far. About 114,000 rows were read from the table on the left in the last second, with about 89,000 matches performed in that same second. When a hash table swap occurs, all of the process per second counters will drop to zero; by watching these you can figure out how often and how long the process is spending in a hash table swap.

The numbers next to the scan boxes should match the number of dbspaces the table is fragmented across. On this machine, I would expect to see on the order of 50,000 to 70,000 rows scanned per fragment per second. If this number drops off at the end then one or more of the fragments probably completed before the others; if this occurs in the last 5-10 seconds it’s no big deal, but if this drop off occurs half way through the query you might want to check the fragmentation scheme.

The filter box should be lowering the number of rows scanned if in fact you are using a filter (e.g. where state_cd = ‘MA’). The total number filtered should keep pace easily with the scan rate.
 
 

(Insert Figure 1: Xtree Hash Join)

The following Xtree screen shot shows a typical nested loop join (Figure 2). The table on the left is being sequentially scanned; a probe into the second table is then performed for each row from this first table. As you can see, during the second that this snapshot was taken, the query joined all of 1373 rows, which is not very good. Here, the join itself limits the scan rate. Also interesting, you can see that more rows have been joined than read, an indication of duplicate rows. (That’s ok, multiple people live at the same address).
 
 

(Insert Figure 2: Xtree Nested Loop Join)

onstat –g ses and onstat –g sql

onstat –g ses and onstat –g sql (below) reveal some key numbers—specifically, how many and what type of threads will be used to execute the query, as well as how much memory your query has been allocated.

session                                   #RSAM    total      used      

id       user     tty      pid   hostname threads  memory     memory    

320      informix 6        25979 foo      21       352256     278408    

 

tid      name     rstcb    flags    curstk   status

48671    sqlexec  e041518  ---P---  1984     e041518 ready

48672    scan_1.0 e03fb98  -------  984      e03fb98 sleeping(secs: 3)

48673    scan_1.1 e040c98  -------  984      e040c98 running

48674    scan_1.2 e045918  -------  984      e045918 running

48675    scan_1.3 e046e58  -------  984      e046e58 sleeping(secs: 3)

48676    scan_1.4 e0443d8  -------  984      e0443d8 ready

48677    scan_1.5 e044818  ----R--  984      e044818 running

48678    scan_1.6 e044c58  -------  984      e044c58 ready

48679    scan_1.7 e045098  Y------  936      e045098 cond wait(await_MC1)

48680    scan_1.8 e048398  Y------  936      e048398 cond wait(await_MC1)

48681    scan_2.0 e042618  Y------  936      e042618 cond wait(await_MC2)

48682    scan_2.1 e043718  Y------  936      e043718 cond wait(await_MC2)

48683    scan_2.2 e0410d8  Y------  936      e0410d8 cond wait(await_MC2)

48684    scan_2.3 e0421d8  Y------  936      e0421d8 cond wait(await_MC2)

48685    scan_2.4 e046a18  Y------  936      e046a18 cond wait(await_MC2)

48686    scan_2.5 e041d98  Y------  936      e041d98 cond wait(await_MC2)

48687    scan_2.6 e047298  Y------  936      e047298 cond wait(await_MC2)

48688    scan_2.7 e045d58  Y------  936      e045d58 cond wait(await_MC2)

48689    scan_2.8 e0454d8  Y------  936      e0454d8 cond wait(await_MC2)

48690    scan_2.9 e03eed8  Y------  936      e03eed8 cond wait(await_MC2)

48691    scan_2.1 e0487d8  Y------  936      e0487d8 cond wait(await_MC2)

 

Memory pools    count 1

name         class addr     totalsize freesize #allocfrag #freefrag 

320          V     e2de018  352256    73848    1420       14        

 

name           free       used           name           free       used      

overhead       0          120            scb            0          96        

opentable      0          31208          filetable      0          8384      

log            0          44856          temprec        0          17680     

ralloc         0          38560          gentcb         0          18728     

ostcb          0          2024           sqscb          0          7664      

rdahead        0          1600           xchg_desc      0          7392      

xchg_port      0          2312           xchg_packet    0          5280      

xchg_group     0          144            xchg_priv      0          1352      

hashfiletab    0          5880           osenv          0          1672      

buft_buffer    0          19224          sqtcb          0          24192     

fragman        0          36720          shmblklist     0          3320      

 

Sess  SQL            Current            Iso Lock       SQL  ISAM F.E.

Id    Stmt type      Database           Lvl Mode       ERR  ERR  Vers

320   SELECT         hercules           NL  Not Wait   0    0    7.24

 

Current statement name : slctcur

select a.state_cd, b.last_name, count(*)

  from tab1 a, tab2 b

 where a.key = b.key



Last parsed SQL statement :

select a.state_cd, b.last_name, count(*)

  from tab1 a, tab2 b

 where a.key = b.key

The output contains a lot of information, probably more than you need, but there are several key things to look for. The Current Statement name provides you details on the SQL of the query any session is actually running. You can see how many threads and what types were kicked off (e.g. scan_2.3 et al, we got 9 scan threads against one table, 11 against the other and a single sqlexec thread which will perform the hash join with a PDQPRIORITY=1 (since I started the query, I know this). You can also see how much memory has been allocated for the query and is being used under Memory pools and on the top line (Total Memory/Used Memory). This information will confirm what threads have been kicked off for a given query. If there are not multiple scan threads for each table (scan 1.0 through 1.8 is on one table), or if there are NO scan threads for a given table, you know the optimizer may not be doing what you have wanted.
 
 

onstat –g mgm

The onstat –g mgm (rgm under XPS) command will also show you memory, but it is especially important for multiple queries as it shows which query is waiting for resources. Memory (Resource) Grant Manager will show what queries are actually running and with what actual PDQ setting, as well as which queries have been "gated" or are waiting for resources (memory or PDQ for example) to run.

For this article I did not provide a snapshot of an onstat –g mgm, but give it a whirl and read the manual on it—it is fairly self evident. Of particular interest are the gating factors and what queries get held up and why (so you can go kick Bob for running everything at PDQPRIORITY=100).
 
 

onstat –g lsc

The onstat –g lsc (scn for XPS) command will show if there are any light scans. If there is no output from this command, there are no light scans. Light scans are frequently crucial to DSS queries as we shall see below

onstat –g xqp and xqs (XPS specific)

Under XPS there is no xtree tool. There is however much more detailed information available than from xtree. From the onstat –g rgm output you will see a ‘Plan’ number. Plug this into onstat –g xqp plan_no and out will come the plan that will be used to resolve the query:

XMP Query Plan



  oper       segid  brid   width  misc info 

  -----------------------------------------

  scan       4      0      16     tab1

  group      4      0      16     

  group      3      0      16     

  group      3      0      16     

  group      2      0      16     

  insert     1      0      32     



In this case the query is:

insert into tab_2 select col1, count(unique col2) from tab1 group by 1

This is running on an XPS 8.30.UC2 with 16 processors and 8 GB of DS memory.

Operations are performed in descending order based on segid. So the first operation is the scan/group. The group itself is reported twice, once for an input and once for an output. In this case, the ‘unique’ claused forced a second group. Finally the data is inserted into a final table.

Onstat –g xqs for this plan_id shows:

In the interests of brevity, I have deleted all but the first and last three lines of each segment.

XMP Query Statistics
 

XMP Query Statistics



 Cosvr_ID: 1     

 Plan_ID: 11961



  type  segid brid information

  ----  ----- ---- -----------

  scan   4    0    inst cosvr time   rows_prod  rows_scan

                   ---- ----- ----   ---------  ---------

                   0    1     1916   39433808   39433808  

                   1    1     1919   39471825   39471825  

                   2    1     1903   39422536   39422536  

……

                   13   4     1930   39410387   39410387  

                   14   4     1936   39364283   39364283  

                   15   4     1936   39375704   39375704  

                   --------------------------------------

16 630463773  630463773 

In this segment, there were 16 threads initiated. Each read 39.4 million rows in about 32 minutes (see the column ‘time’ it is in seconds). Not great. There was no filter, so the rows_prod is equal to the rows_scan. You might also note the cosvr column – if you know what it means fine, otherwise ignore it.
 
  group  4

        inst cosvr time   rows_prod  rows_cons   mem        ovfl

        ---- ----- ----   ---------  ---------   ---        ----

        0    1     1916   39433808   39433808    104        0   

        1    1     1919   39471825   39471825    104        0   

        2    1     1903   39422536   39422536    104        0   

…..

        13   4     1930   39410387   39410387    104        0   

        14   4     1936   39364283   39364283    104        0   

        15   4     1936   39375704   39375704    104        0   

        --------------------------------------------------------

        16                630463770  630463773   (2048)




In this segment we pushed rows into the first group (unique) operation. Again, 16 threads, same amount of time – which was concurrent with the scan time – number of rows produced is identical to the number of rows pushed in (rows_cons). Amount of memory for this was 2K (across 16 threads).
 
 
 

  group  3    

       inst cosvr time   rows_prod  rows_cons   mem        ovfl

       ---- ----- ----   ---------  ---------   ---        ----

       0    1     12639  39059408   39432594    49345600   199 

       1    2     12676  39071879   39442561    49328000   199 

       2    3     12203  39054649   39429975    49336000   199 

…..

       13   2     12672  38998788   39393276    49350400   199 

       14   3     12103  39003511   39390257    49345600   199 

       15   4     11942  39011955   39381624    49329600   199 

       --------------------------------------------------------

       16                624401945  630463770   (3958528)

This segment is the second half of the above group segment, and is where the unique rows come out. Notice that this phase used almost 4GB of memory. Notice also that it overflowed. The ovfl is not a count of how many pages overflowed, but of how many segments. It’s hard to tell from this whether the overflow was serious. However, given that the time for this segment took on the order of three hours, I would hazard that it was serious. Obviously this step hurt. This is probably what gated the scan.
 
  group  3    

       inst cosvr time   rows_prod  rows_cons   mem        ovfl

       ---- ----- ----   ---------  ---------   ---        ----

       0    1     12639  39059408   39059408    104        0   

       1    2     12676  39071879   39071879    104        0   

       2    3     12203  39054649   39054649    104        0   

…..

       13   2     12672  38998788   38998788    104        0   

       14   3     12103  39003511   39003511    104        0   

       15   4     11942  39011955   39011955    104        0   

       --------------------------------------------------------

       16                624401944  624401945   (2048)


This is the second group necessitated by the count(*)/group operation. Again, this is just pushing rows into group operation. All of this time is concurrent with previous processes.
 

  group  2    

       inst cosvr time   rows_prod  rows_cons   mem        ovfl

       ---- ----- ----   ---------  ---------   ---        ----

       0    1     13203  21639904   39058139    43422720   199 

       1    2     13136  20410129   39069022    40452896   199 

       2    3     13201  22846215   39073721    46132152   199 

…..

       13   2     13225  21678939   39022613    43424128   199 

       14   3     13180  21694541   39009538    43659528   199 

       15   4     13184  22908683   39014754    46380352   199 

       --------------------------------------------------------

       16                352937467  624401944   (3958528)

This is the second phase of the second group operation. Because of the way this query was crafted, the other 4GB of memory was used for this operation – once strategy to solve this performance problem would be to separate the two groups – perform a select unique into one table and then perform the count off of it.
 
 
  insert 1    0    inst cosvr time   it_count

                   ---- ----- ----   --------

                   0    1     2380   10979878  

                   1    2     2380   10979878  

                   2    3     2380   10979878  

…..

                   29   2     2380   10979878  

                   30   3     2380   10979878  

                   31   4     2380   10979878  

                   --------------------------

                   32                351356096


The final insert phase of the query, writing to the temp table. This phase of the operation was not started until the second group started issuing output, hence the time is much lower.
 
 

Running the Example Query

Frequently, tuning a DSS query is merely a matter of shifting from "OLTP-think" to "DSS-think" which is exactly what we will do in this case.

The optimizer should choose to solve the example query with a hash join. If it does not, then there may be another issue, like the optimizer thinks one of the tables is empty (need to run UPDATE STATISTICS) as is the case in our query here. Let’s assume that PDQPRIORITY is NOT set to anything, which is typical of an OLTP environment. This means that it is effectively off. What will happen when the query is executed?

With these settings, the engine first reads the smaller table (Table 2), applies any filters, and builds a hash table entry for each record. It builds this hash table in memory. It then reads the large table and matches it up against the hash table. As it finds matches it pushes the matched rows out the other side of the join.

Note: If the optimizer determines that a filter will make the result set from the larger table smaller, it may choose to read the larger table first. Of course, the optimizer would need some distribution information to make this determination more accurately, and this can only come by running UPDATE STATISTICS high or medium.

After 37 minutes into this query, Xtree shows both tables have been scanned and have moved into the hash join. Because insufficient memory was allocated, the hash join is stalled while it swaps hash table pages from the temp disk. As more and more memory is needed to build the hash table, more and more available memory is used up. Since the query is executing in an OLTP environment, the LRU buffers are filled with all the data from the table along with the associate buffer management overhead, spoiling everybody’s day (or rather week/month). Since the hash table memory requirements are so large, the query probably overflowed to the temp space(s) and, if these weren’t set up well, took away the temp space for everyone else as well. And, after running for several days, you would probably just kill the query anyway.
 
 

Basic Tuning Procedures

There are three basic time-consumptive elements to the example query. A table scan, a second table scan, and the join. When a nested loop is used (where the larger table is read using the index for every record in the smaller table) the join is part of the second read. The join from a hash join occurs while the second table is read. In most cases we want to use a hash join.

When tuning this query, it helps to approach it in two parts: the table scans and the join. To begin, let’s obtain a benchmark of how long it takes just to scan each table. This will not only give us the ability to estimate query time but also highlight when a query slows down, as well as ensure that we can read the tables themselves fairly quickly.

SELECT CURRENT HOUR TO SECOND FROM SYSTABLES WHERE TABID = 1;

SELECT SUM(some_column) from tab1;

SELECT CURRENT HOUR TO SECOND FROM SYSTABLES WHERE TABID = 1;

SELECT SUM(some_other_column) from tab2;

SELECT CURRENT HOUR TO SECOND FROM SYSTABLES WHERE TABID = 1;

Results: 17 minutes for Table 1, 20 minutes for Table 2. Roughly 30,000 rows per second. These results are interpolated. The actual read time was 37 minutes for both tables.

The first tuning objective is to improve the read performance. With all of the data being in one dbspace, the engine is limited to a single thread to read the entire table. If the data is spread over multiple dbspaces the database can kick off one thread per dbspace, or fragment. Using ipload you can generate and run a job which will unload (and reload) each table. You can then drop each table, recreate them fragmented over several disks, and reload them. In this particular instance, the fragmentation expression is not that important, so you can just use round robin. If it takes 20 minutes to read a table in one dbspace, spreading the table across 4 dbspaces should bring the scan time down to about 5 minutes. Spreading the table across 8 dbspaces should lower the time to about 2.5 minutes.
 
 

PDQPRIORITY

Now that we have fixed the tables, the most obvious next step is PDQPRIORITY. With PDQPRIORITY off, Informix Dynamic Serverä is doing nothing in parallel (Note, with XPS PDQPRIORITY does not have as much control, parallelism is still used with it set to 0, PDQPRIORITY now becomes important for allocating memory). One thread is doing all of the reading. If you assume that it takes 5 minutes to read the data from one disk, and that there are 16 dbspaces per table, then the query is automatically starting off with 80 minutes of overhead per table. Just by setting PDQPRIORITY to 1 the engine can kick off one read thread per disk (don’t worry, it won’t try to read both tables at once).

Running the benchmark again with PDQPRIORITY set to 1, either from the operating system (EXPORT PDQPRIORITY=1) or from dbaccess (SET PDQPRIORITY 1), the length of time required to scan each table greatly improves.

Results: 3 minutes 21 seconds for Table 1, 3 minutes 52 seconds for Table 2. Roughly 150,000 rows per second.

Table scan times improve five-fold just by fragmenting the data. However, it can still be improved.
 
 

Light Scans

In an OLTP environment, you want to maximize the cache read and write rates. To do this, you want the engine to find the row it needs in the buffer cache and not have to read it from disk. OLTP buffer cache management has some minimal overhead. However, in a DSS environment, you want to scan the entire table; the odds of the data being in the cache are, therefore, dramatically reduced. The overhead associated with managing the DSS buffer cache becomes onerous. Enter the light scan, a critical element in DSS queries.

The light scan is designed for a DSS query. Rather than use the normal resident memory buffer cache, the light scan uses its own light scan buffers, for which there is much less overhead (primarily because you don’t have to worry about LRU queues). Each query gets its own set of buffer pools. This can have a dramatic affect on our read rate.

To force a light scan the trick is to ensure that the table being read is larger than the resident memory buffer size AND to have the ISOLATION mode set to DIRTY READ. You can also set an environment variable (export LIGHT_SCANS=FORCE).

The number of light scan buffers assigned is a factor of your RA (read ahead) settings. From the Informix Masters Series training manual:

Light_scan_buffers = roundup (( RA_PAGES + RA_THRESHOLD) / (60/2/))

Bumping the RA_PAGES and RA_THRESHOLD values to their maximum of 128 and 120, respectively, gives you more light scan buffer pools, which is a good thing.

In the previous benchmark runs, the engine did not use light scans (the was intentional to show the performance impact of light scans). By running UPDATE STATISTICS on the engine, the optimizer will recognize that the tables being read in our example query are larger than the resident memory buffer pool and will utilize light scans when the benchmark is run again.

UPDATE STATISTICS is of course key to getting the light scan to work, but to what degree? UPDATE STATISTICS medium or high will tend to run for hours—on a large data warehouse, potentially days. The results of this are very useful if you plan to do indexed joins or filtering, but for this hash join you are looking to scan the entire table anyway, so just run the quick UPDATE STATISTICS low and leave the rest for the hash join.

(Note. Setting the environment variable DBUPSPACE can dramatically reduce the time require to run UPDATE STATISTICS)

Results: 43 seconds for Table 1, 50 seconds for Table 2. Roughly 700,000 rows per second.

I have seen scan rates of up to 2 million rows per second on a table with 32 fragments. We could fragment the table over more dbspaces to improve this read rate even more, but is cutting another 20 to 40 seconds off the query time worth the added administrative headache? You decide. The number of fragments you can allocate may be limited by some other factors as well, like the business requirement for a particular fragmentation scheme. For the purposes of this example, these results here work just fine.

(Note. I am told that there is a limit of three scan threads per CPUVP. Hence the upper limit on number of dbspaces should be 3 x CPUVP).

The next tuning efforts focus on the join itself. There are several options available. Let’s first try running the query with a nested loop join using the indexes.
 
 

Indexes

By adding an index, you are trying to do one of two things: either setup a nested loop join or limit the size of the data read to just the index pages. In other words, you are trying to put into the index whatever the query needs to be successful.

The addition of an index gives the optimizer a different path it can travel to get at the data. If it feels that one of the tables is small enough, it will choose to read that table and, for every row, probe the large table (using the index) to perform the join. At issue is that the speed with which the engine can perform probes can be measured in thousands per second. 30 million rows at 3000 rows/second, which is optimistic for the machine we are using here, comes out to 10,000 seconds or almost 3 hours. This is way too slow. Even fragmenting the indexes and detaching them from the tables will not change the probe rate dramatically enough to make a nested loop faster.

So much for that idea. Let’s try a dynamic hash join.
 
 

Hash Join

To recap, in a hash join the smaller of the two tables is scanned and stored in a hash table in memory. If memory fills up, which will happen unless the two tables in question are much smaller, this hash table is swapped out to the temp disk. Once the first table has been read, the second table is read and matched up against the hash table. Periodically, the query will have to swap in the hash table entries that are out on temp disk. This temp to memory swap is expensive and is generally the slowest portion of a hash join.

First thing to note is that building the hash table actually slows down the scan rate to about 300,000 rows per second (results vary).

At 40 PDQ this query took 22 minutes 12 seconds, and the match rate was about 40,000 rows per second. Not only can more memory lower the amount of swapping done to and from temp disk, but it also increases the size of the hash table in memory and allows more matches per second.

Utilizing a hash join method, it turns out that the example query join runs at about 90,000 rows per second (at 80 PDQ) for a projected total of about 5.5 minutes. This would be great except that the entire hash table does not fit into memory, causing the query to have to swap portions of the table from disk which is why the actual results of this query are much different (see below). Therefore, the more memory you can assign to the hash join, the less swapping will occur.

PDQPRIORITY controls memory allocation. Your ultimate PDQPRIORITY setting (which is your personal PDQ_PRIORITY/100 * MAX_PDQPRIORITY/100) * DS_TOTALMEMORY = how much memory you get for your query. If you have a DS_TOTALMEMORY of 1.5GB and take 1% of that (PDQPRIORITY=1) you get 15MB of memory in which to build the hash table. 15MB/20 bytes (our hash table entry) = 786,000 or so entries. That’s a rough guess, but after that many entries the database will have to push the hash table entries out to temp disk. With 16 scan threads the database will probably be reading around 160,000 rows per second, so in 5 seconds it will fill the hash table and have to push to temp disk with the above memory allocation.

(Note, the actual formula for the size of the hash table is 32 + keysize + rowsize. The hash table size would therefore be 32 + 2 + 155 or 189 bytes per tab1 record. * 29 million = 5.4GB. A key improvement here would be to lower the size of the tab1 record to reduce the size of the hash table)

Actual results of this query: 8 minutes 40 seconds (PDQ = 80). During the match phase of the query the match rate was 90,000 rows/second. After 17 million matches the database started swapping. It performed 10 total swaps, each swap about 10 seconds long (100 seconds or 1 minute 40 seconds). Since the entire table did not fit into memory, the query also could not keep up the 90,000 rows/second rate, which is why (sans the swap time) the query runs in 7 minutes not 5.5 minutes (as first projected).

At times, the optimizer can, frustratingly, keep pushing your query down the nested loop path, despite your best intentions and even the use of optimizer hints. In these cases, you can trick the optimizer into a hash join by giving it something it won’t find in the index:

SELECT …

FROM tab1, tab2

WHERE tab1.key + 0 = tab2.key + 0

The "+ 0" is inexpensive and forces the query to perform as a hash join. This is because tab1.key is no longer recognizable as an index to the optimizer.
 
 

Temp Disks

Not surprising, temp disk is written to and read from in the same manner as normal tables: 1 thread per fragment. It is therefore important to have multiple temp disks set up to maximize the number of read and write threads. It is best to have multiple dbspaces for your temp disks as well. I recommend 2 temp dbspaces per CPU as was used in the query above, although as long as you keep the number at some multiple you can go as far as you like (within reason), 64 temp dbspaces per CPU is way beyond the point of vanishing returns. Even if you only have eight temp disks for eight CPUs, break them into two chunks and make each one it’s own dbspace. Since the temp spaces will be written to in round robin, the smallest temp space will fill up first and cause you to generically "run out of temp space" even though the other spaces still have room. Therefore, use temp spaces of equal size; anything extra on a single temp space is wasted.
 
 

(Note, I my tests I have discredited the notion of creating two dbspaces on one physical disk to get more scan threads, yes you get more scan threads, but the read time does not decrease)

Balance

It is important to balance all that you do against how many processors you have so that there is minimal task swapping between processors. In the example here, the tables were laid out over 9 and 11 dbspaces, which meant that 8 processors were handling 9 scan threads in one case and 11 in the other. Each processor probably wasted some time swapping from one scan thread to another. Alas, business requirements do not always match what makes sense technically.
 
 

Model Changes

There are a few changes you can make to your onconfig file to better support decision support systems. The following configurations are from the Informix Masters Series training manual:
 
BUFFERS Low 2000
SHMVIRTSIZE High 75% of available memory
SHMADD Whatever  32000
SHMTOTAL Maximize Set to available memory
RA_PAGES Maximize 128
RA_THRESHOLD Maximize 120
DS_TOTAL_MEMORY Maximize 90% of available memory

I actually prefer BUFFERS to be 20000—to provide support for OLTP activity as required, the tables are large enough to dwarf the buffers in all cases, and there is still OLTP style work to be done to the data.
 
 

Summary

Any DSS query similar to the example described in this article will have three principle time-consuming components. A read of Table 1, a read of Table 2, and a join step. If the query limits the selection statement with some sort of filter (a key between 100 and 200) then the read components become easy to handle with an index. For example, read Table 1 for these rows and probe Table 2 for the matching rows, where the actual join component occurs during the probe.

If the query is not limited with a filter, then a full join of both tables is required. This is very inefficient when performed with an indexed read. It is far better to scan both tables and perform the join in memory. A hash join is the best way to perform this. Tuning, therefore, becomes a matter of shortening the three stages of the query (scan1, scan2, and the join itself). With a light scan the database can read (in this case, 300 thousand rows per second) from the first table while building the hash table. With the second table, the database can read only marginally less quickly including the hash join. And with a high enough PDQPRIORITY the database has enough memory to perform the hash join in memory without expending time to swap the hash table out to temp disk. Even if the hash tables are pushed to temp disk, if the temp disks are laid out well, it’s not that expensive.

If you can spread the read load over multiple disks, reserve enough memory, and set aside enough temp disks, you can bring any query time down to within a 5 minute window to scan each table and, potentially, another minute or two to wrap up the join. If you are getting a 40 hour, or 40 day, query then one or more of these things isn’t happening.
 
 

About the Author

Jack Parker is a Database Administrator at Engage Technologies. He has been working with Informix products for 14 years and has specialized in Data Warehousing for the past 5 of these. He can be reached as jparker@netway.com.