Db2 SQL and Application Performance: Part 3, Using Subqueries to Reduce Calls to the Data Server

This is the third of several blog articles addressing Db2 performance. All the content of these articles will be based upon Db2 for z/OS. However, much of it can also be applied to Db2 for LUW. The articles will address mostly application performance regarding coding efficient SQL statements as well as efficient Db2 application coding techniques and settings. There is a saying in the industry, “it depends”, which is just another waying of saying “I don’t know”. While these articles will avoid using either of these terms, please keep in mind that your result can and will be different. With that in mind any design decision should be thoroughly tested prior to implementation. For testing references please see

https://www.db2expert.com/db2expert/testing-db2-design-decisions/

Call Only When Necessary

Yes, indeed the most efficient SQL statement is the one that is never executed. There are two things to remember about Db2. First, it’s a black box of sorts. That is, under the covers it is managing data in a vast number of physical datasets, maintaining the associated metadata that obscures the actual physical location and organization of the data. Thus, relieving the application developer of any knowledge of the physicality of the data they are accessing. All the developer needs to know is the tables and columns they need to access. That comes with an associated cost such that a database call will never be as cheap as a native file system call. Second, any call to Db2 from an application program (all calls to Db2 are from a program) is an external call from that application program. For example, if the call is in a CICS program or COBOL batch program to Db2 for z/OS located in the same z/OS LPAR it is considered a cross-memory call from one address space to another. If the call is coming from a remote application on another server, then that call is not only external from the Db2 distributed address space, but it involves traversing a network as well.

So, how can calls to the Db2 be reduced? There are a lot of things that can be done in this regard, and this is the single most important thing that an application developer can do to improve performance. You see, when a DBA sees an application with excessive “in Db2” times, they can investigate individual SQL statements that the application executes. If they see a few SQL statements using a lot of resources it can be an easy fix without application changes, such as adding an index or running statistics. However, if what they see is a whole lot of inexpensive statements executing then there is little a DBA can do to fix the issue. In this case it’s all on the application developer to reduce the number of calls.

Please see article 1 for call avoidance recommendations.

Please see article 2 for using SQL joins to avoid calls to the data server.

In this article I’m going to present using subqueries to avoid excessive calls to the data server! I’m also going to address subquery and join performance basics!

What is a subquery?

As in my previous article I demonstrated how you can use SQL join statements to reduce calls to the data server. As we know, many application programmers simply learn basic SQL and as a result code very basic single SQL statements that end up being wrapped by application logic that is more complex than it needs to be. The result of this often is an application that is more expensive from a database perspective as well as an application perspective. Like using joins to reduce calls to the data server you can also use subqueries in a similar fashion to joins. Subqueries are useful when you need information from one table dependent upon information from a second table, which may be a different table or the same table. The program doesn’t need the data from the second table, but that data is required to get the data from the first table. Let’s look at an example.

Looking at the Db2 sample database suppose you wanted to retrieve data about the employee with the highest salary in the company. You could code two separate statements that look like this:

SELECT MAX(EMP1.SALARY)
    FROM EMP EMP1;

SELECT EMP2.EMPNO, EMP2.LASTNAME
FROM EMP EMP2
WHERE EMP2.SALARY = 52750.00;

Of course, that is two trips to the data server. You can satisfy the same requirement in one trip to the data server with by utilizing a subquery. A subquery is a <subselect> as part of a predicate in another <subselect> or <fullselect> portion of a query. Here is the same result as the above two statements as a single statement utilizing a subquery.

SELECT EMP2.EMPNO, EMP2.LASTNAME
FROM EMP EMP2
WHERE EMP2.SALARY =
   (SELECT MAX(EMP1.SALARY)
    FROM EMP EMP1)

This saves a trip to the data server which theoretically saves 50% elapsed time if you are calling from a remote client.

Subquery Types

There are two forms of subqueries: correlated and non-correlated. Most query requirements that can be satisfied with a subquery can be written as either a correlated or non-correlated subquery. Imagine a requirement to return all employees that work in a department where the department name includes the word “SYSTEMS”. This subquery example is a non-correlated subquery.

SELECT WORKDEPT, EMPNO, LASTNAME
FROM   EMP EMP2
WHERE  WORKDEPT IN
   (SELECT DEPTNO
    FROM   DEPT DEPT
    WHERE  DEPTNAME LIKE '%SYSTEMS%')

This subquery example is correlated and returns the same result as the non-correlated example.

SELECT WORKDEPT, EMPNO, LASTNAME
FROM   EMP EMP2
WHERE  EXISTS
   (SELECT 1
    FROM   DEPT DEPT
    WHERE  DEPT.DEPTNO = EMP2.WORKDEPT
    AND    DEPT.DEPTNAME LIKE '%SYSTEMS%')

So, if these two statements return the same result, then why have two forms you can use? The answer is statement performance! A non-correlated query operates in a bottom-up fashion. Specifically, the outer (upper) portion of the query cannot execute until the subquery result is available. Therefore, from a performance perspective a non-correlated subquery may be the better performer in the subquery results produce relatively fewer rows than the outer query, do not have good index support, and the outer query has good index support. A query containing a correlated subquery operates in a top-down, top-down, top-down, etc. fashion. Specifically, the inner (subquery) portion of the query cannot execute until the outer (upper) portion of the query returns data. Therefore, from a performance perspective a correlated subquery may be the better performer if the outer portion returns relatively few rows than the subquery, and the inner subquery portion has good index support. Specifically, the first query may perform better if there is an index on the WORKDEPT column and the subquery returns relatively few rows, and the second one might perform better if the is an index on the DEPTNO column and the EMP table returns relatively few rows.

To reiterate, non-correlated subqueries are processed first then the outer query is processed.

A correlated subquery is executed after the outer query starts returning rows and is reprocessed for every qualifying row of the outer query.

Additional Performance Options

We can talk about performance for days, but the best thing to do is to try out all the options and make sure you perform accurate tests. Do an EXPLAIN on the statements, run them, and measure the performance using Db2 accounting information and not simply looking at a stopwatch. With that in mind, there is another option to the non-correlated versus correlated choices for subqueries, and that’s a join! Yes, you can also satisfy a query requirement that utilizes a subquery as a join. This join returns the same result as the fist two subqueries above in the subquery types section of this article.

SELECT EMP2.WORKDEPT, EMP2.EMPNO, EMP2.LASTNAME
FROM   EMP EMP2 
INNER JOIN
       DEPT DEPT
ON     DEPT.DEPTNO = EMP2.WORKDEPT
WHERE  DEPT.DEPTNAME LIKE '%SYSTEMS%'

Likewise, a join can also be correlated, which is also know as a sideways reference. These two joins return the same results.

SELECT EMPNO, SALARY, DEPTSAL.AVGSAL
FROM   EMP EMP
LEFT JOIN
      (SELECT WORKDEPT, AVG(SALARY) 
       FROM   EMP
       GROUP BY WORKDEPT) DEPTSAL(WORKDEPT, AVGSAL)
ON     EMP.WORKDEPT = DEPTSAL.WORKDEPT

SELECT E.EMPNO, E.SALARY, DEPTSAL.AVGSAL
FROM   EMP E
LEFT JOIN 
TABLE (SELECT F.WORKDEPT, AVG(F.SALARY) 
       FROM   EMP F
       WHERE  F.WORKDEPT = E.WORKDEPT
       GROUP BY WORKDEPT) DEPTSAL(WORKDEPT, AVGSAL)
ON     E.WORKDEPT = DEPTSAL.WORKDEPT

The same rules regarding the performance of subqueries apply to correlated versus non-correlated joins, but please be very careful with correlated joins as they can be a performance nightmare in some situations. Always test extensively and verify you are getting good index access.

Summary – Why Use Subqueries?

Remember, the number one reason to use subqueries is to reduce calls to the Db2 data server, especially when calling Db2 over a network. Once you do that you have already made a significant performance improvement to your application. If you want to take things further you can investigate using a correlated versus non-correlated subquery, but that will require a little more knowledge about the database objects involved, including the available indexes and amount of data you expect from each part of the query. This is where you can consult with your friendly neighborhood DBA for some advice and maybe some indexes to support your subqueries.

Leave Comment

Your email address will not be published. Required fields are marked *