Wednesday, July 29, 2015

Diagnose High-Latency I/O Operations Using SystemTap

Topic: this post is about some simple tools and techniques that can be used to drill down high-latency I/O events using SystemTap probes.

The problem: Operations with high latency on a filesystem and/or a storage volume can sometimes be attributed to just a few disks 'misbehaving', possibly because they are suffering mechanical failures and/or are going to break completely in the near future.
I/Os of high latency on just a few disks can then appear as latency outliers when accessing volumes build on a large number of disks and overall affect the performance of the entire storage. I write this having in mind the example of a storage system built with (SATA) JBODs using Oracle ASM as volume manager/DB filesystem. However the main ideas and tools described in this post apply to many other volume managers and file systems, including HDFS.

The standard tools: One way to find that one or more disks are serving I/O requests with high latency is with the use of standard Linux tools such as iostat, sar or collectl. Typically you would use those tools to spot anomalous values of average service time, average wait time and also of queue size.
A structured approach on how to do this is described in Brendan Gregg's USE method and the tools that can be used in Linux to implement it.

SystemTap scripts: In this post we focus on a technique and a couple of simple scripts that can be used to identify disks serving I/O with high latency using SystemTap probes to investigate I/O latency of the block devices.
The script blockio_latency_outliers_per_device.stp provides a measurement of some basics latency statistics for block device, including number of I/Os, average and maximum latency. The script also provides details of all the I/O where the latency is above a certain programmable threshold (the default threshold is set at 500 microseconds).
An example of its use (edited output for clarity) is here below. Note the latency warning message and overall the very large maximum value measured for the latency of the /dev/sdy block device:

[root@myhost] # stap -v blockio_latency_outliers_per_device.stp 10

Measuring block I/O latency and statistics
A warning will be printed for I/Os with latency higher than 500000 microseconds
Statistics will be printed every 10 seconds. Press CTRL-C to stop

...
latency warning, >500000 microsec: device=sdy, sector=166891231, latency=714984
latency warning, >500000 microsec: device=sdy, sector=165679327, latency=582708
latency warning, >500000 microsec: device=sdy, sector=167102975, latency=1162550
....

I/O latency basic statistics per device, measurement time: 10 seconds
Latency measured in microseconds
Disk name          Num I/Os    Min latency    Avg latency    Max latency
....
sdu                     219            106           6217          27117
sdz                     200            123           5995          27205
sdq                     211             71           6553          31120
sdh                     256            103           6643          22663
sds                     224            101           6610          29743
sdm                     238             92           7550          35571
sde                     243             90           8652          52029
sdt                     200            105           5997          25180
sdk                     200             94           5696          35057
sdi                     206             99           7849          30636
sdg                     269             74           6746          36746
sdy                     197            102          98298        1167977
sdr                     200             89           6559          27873
sdl                     200            140           8789          31996
sdw                     210             99           7009          37118
sdd                     217             94           7440          56071
sdn                     205             99           6628          41339
....


When candidate disks for high latency have been identified, the second step is to further drill down using the script blockio_rq_issue_filter_latencyhistogram.stp. This script gathers and displays I/O latency histograms for a subset of block devices that can be specified using filters in the script header. The default filters are:

# SystemTap variables used to define filters, edit as needed
global IO_size = 8192    # this will be used as a filter for the I/O request size
                         # the value 8192 targets 8KB operations for Oracle single-block I/O
                         # use the value -1 to disable this filter
global IO_operation = 0  # this will be used as a filter: only read operations
                         # a value of 0 considers only read operations (the value 1 is for write)
                         # use the value -1 to disable this filter
global IO_devmaj = -1    # this will be used as a filter: device major number (-1 means no filter)
                         # example: use the value 253 for device mapper block devices
global IO_devmin = -1    # device minor number (or -1 if no filter)

You can use  blockio_rq_issue_filter_latencyhistogram.stp to drill down on the latency histogram for those disks that have shown high latency I/O and also compare them with "good" disks. In the example above the candidate "trouble disk" is /dev/sdy with major number 65 and minor 128 (you can major and minor device numbers for "sdy" simply using ls -l /dev/sdy).
Example:

stap -v blockio_rq_issue_filter_latencyhistogram.stp 10

Block I/O latency histograms from kernel trace points
Filters:
    IO_size = 8192
    IO_operation = 0 (0=read, 1=write, -1=disable filter)
    IO_devmaj = 65 (-1=disable filter)
    IO_devmin = 128 (-1=disable filter)

lock I/O latency histogram, measurement time: 10 seconds, I/O count: 199
Value = latency bucket (microseconds), count=I/O operations in 10 seconds
  value |-------------------------------------------------- count
     16 |                                                    0
     32 |                                                    0
     64 |@                                                   2
    128 |@@@@@@@                                            14
    256 |@@                                                  4
    512 |@                                                   2
   1024 |@@@                                                 7
   2048 |@@@@@@@@@                                          19
   4096 |@@@@@@@@@@@@@@@@@@                                 37
   8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      59
  16384 |@@@@@@@                                            14
  32768 |                                                    0
  65536 |@@@@@          HIGH                                10
 131072 |@@@@@          LATENCY                             11
 262144 |@@@@@          I/O                                 10
 524288 |@@@            OPERATIONS                           6
1048576 |@@                                                  4
2097152 |                                                    0


Note the presence of several I/O operations at high latency together with more normal I/O latecy, for example I/Os served from SATA disks at around 8 ms and also I/O served from the controller cache at sub-millisecond latency.
For comparison here below is the latency histogram measured on another disk while the same workload was running. We can see that in this case high latency points found in the previous example are no more present. Most of the I/O operations are around 8 ms latency and some operations served from cache at sub-millisecond latency. The I/O reported for the /dev/sdy disks with latency of 64 ms and above are not present in this case.

stap -v blockio_rq_issue_filter_latencyhistogram.stp 10

Block I/O latency histograms from kernel trace points
Filters:
    IO_size = 8192
    IO_operation = 0 (0=read, 1=write, -1=disable filter)
    IO_devmaj = 65 (-1=disable filter)
    IO_devmin = 48 (-1=disable filter)


Block I/O latency histogram, measurement time: 10 seconds, I/O count: 196
Value = latency bucket (microseconds), count=I/O operations in 10 seconds
value |-------------------------------------------------- count
   32 |                                                    0
   64 |                                                    0
  128 |@@@@@@@@@                                          19
  256 |@@@@@@@@                                           16
  512 |                                                    1
 1024 |@@@@                                                9
 2048 |@@@@@@@@@@@@@@                                     28
 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       56
 8192 |@@@@@@@@@@@@@@@@@@@@@@@@                           49
16384 |@@@@@@@@@                                          18
32768 |                                                    0
65536 |                                                    0

The tests reported here have been performed on a old system running RHEL 5 (kernel 2.6.18-371)  and SystemTap v. 1.8, however the scripts have also been tested on OL and CentOS 7.1 (kernel 3.10.0-229) and SystemTap 2.8. If you are using SystemTap 2.6 or above you can use the scripts blockio_rq_issue_latencyhistogram_new.stp and blockio_rq_issue_filter_latencyhistogram_new.stp instead.

Conclusion: In this post we have shown a simple technique to diagnose I/O latency outliers and in particular on how to drill down on I/Os served at high latency because of one or more misbehaving disk in the storage volume/filesystem. The investigation has been done using SystemTap scripts used first to discover which disks were serving some of their I/O s with high latency and then drilling down on specific devices with the use of latency histograms. The fact of identifying and later replacing the badly performing disks can be beneficial for the performance of the entire storage systems.

Download: The tools discussed in this post can be downloaded from this webpage and from Github.

Acknowledgements and additional links: Brendan Gregg has published an extensive set of articles and tools on performance tuning, including storage latency investigations.
For more information and examples on how to use SystemTap see the SystemTap wiki
Additional tools for Oracle ASM I/O investigations also see Bertrand Drouvot's asm_metrics.pl
Kevin Closson's SLOB was used as workload generator for the examples discussed here.
Additional links on tools and techniques on Oracle I/O troubleshooting in this blog: Heat Map Visualization of I/O Latency with SystemTap and PyLatencyMap, Event Histogram Metric and Oracle 12c and Life of an Oracle I/O: Tracing Logical and Physical I/O with Systemtap

Tuesday, July 14, 2015

Heat Map Visualization of Latency Histograms for NetApp C-Mode

Topic: This post is about collecting and visualizing I/O latency histograms for NetApp filers in C-mode.

Motivations: The drill down of I/O latency is an important technique for troubleshooting and benchmarking storage. Average latency values can hide details of what is happening on the storage. Think for example of storage systems with flash and spindles, each serving I/O at different latency. Moreover averaging the measured values over time can hide details in case of varying workload or with issues that appear sporadically.

Latency heat maps: Representing latency histograms over time is a 3-D visualization problem and can be nicely solved using heat maps. This has been discussed in details by Brendan Gregg in ...

PyLatencyMap is a simple visualization tool that works on the command line and takes as input latency histograms and outputs two latency maps using ANSI codes for colors.
The latency heat map on the top of the screen uses a blue palette and represents the number of operations per second as a function of the latency bucket and time. The second map in yellow-red represents the time waited as a function of the latency bucket and time.

How to collect and display I/O latency histograms for NetApp filers

For a quick you can use Example10_NetApp_Cmode_reads.sh. Before running it you will need to edit the parameters: password to connect to the storage, the junction point you want to measure, the cluster management node and optionally the time interval between data points. Then run the example script. Similarly Example10b_NetApp_Cmode_writes.sh will capture and display heat maps for write operations if you prefer.

This is an example of the output

Figure1: Example output of PyLatencyMap measuring and displaying I/O read latency histograms using performance counters from the NetApp filer (C-mode). The top heat map (blue color) represents I/O operations latency over time. The lower heat map (yellow-red color) represents the time waited per bucket.

Additional details

The 'glue' between NetApp and PyLatencyMap is provided by two Perl scripts written by Ruben Gaspar. The scripts are: (1) NetApp_histogram_Cmode.pl connects to the filer and gets the latency histograms (2) NetApp_latency_connector.pl transforms the histograms from the format of the NetApp filer into a format that is understood by PyLatencyMap (i.e. a power-of-two histograms).

Note that data aggregation in buckets is done by NetApp_latency_connector.pl using the same conventions as Oracle's wait event histograms (v$event_histogram and v$event_histogram_micro), that is a value of N for bucket B means that there have been 100 I/O operations with latency between B and B/2. For those of you who use DTrace 'quantize' or SystemTap 'hist_log', please note the difference as in those tools the convention is that a value of N for bucket B means N operations with latency between B and 2*B.
Note also the following dependencies: the scripts discussed here for the integration between NetApp performance counters and PyLatencyMap have been developed for NetApp Clustered Data ONTAP and in particular have been tested on version 8.2 an 8.3 (notably they will not work against OnTap version 7). Also Perl is needed including the modules Net::OpenSSH and IO::Pty. PyLatencyMap is written in Python.

Conclusions and additional links

PyLatencyMap is a tool for collecting and displaying latency heat maps on the command line intended as a aid for performance troubleshooting and benchmarking activities. PyLatencyMap has now been integrated with NetApp performance counters which adds to the existing data sources of Oracle wait events, SystemTap and DTrace probes among others.

Download PyLatencyMap from:
https://github.com/LucaCanali/PyLatencyMap or http://canali.web.cern.ch/canali/resources.htm 

Previous blog entries on PyLatencyMap:
http://externaltable.blogspot.com/2015/03/heat-map-visualization-for-systemtap.html
http://externaltable.blogspot.com/2013/09/getting-started-with-pylatencymap.html
http://externaltable.blogspot.com/2013/08/pylatencymap-performance-tool-to-drill.html

Monday, June 15, 2015

Event Histogram Metric and Oracle 12c

Topic: event histogram metric, a script and some related discussions on collecting and displaying wait event latency histograms for Oracle performance troubleshooting. Also discussed is the new 12c feature of event histrograms with microsecond buckets.

Why: Latency histograms (and by extension wait event histograms) provide very useful information when troubleshooting performance for systems exhibiting response time with multi-mode distribution. In such cases average wait values are often not sufficient to understand the behavior of the system under study and histograms provide a finer level of details. A common case is with I/O performance investigations for many modern storage systems, which  can serve I/O from HDD or flash, with completely different latency characteristics.

What: Event histogram metric is a command-line script for "near real-time" monitoring, based on the simple idea of taking two snapshots of the relevant V$ views and print the wait event histogram of the delta values, see also Figure 1 for an example. You can find an example of a case when event histogram metric has helped for performance troubleshooting in a previous blog entry at this link.

What's new: The latest version of event histogram metric (v 1.1, June 2015) adds the support for a little but very useful  new feature of Oracle version 12.1.0.2, where histogram buckets go down to the microsecond level with V$EVENT_HISTOGRAM_MICRO. Another feature I have added event histogram metric v 1.1 is the calculation of the estimated time waited per bucket. This gives an indication of the relative weight/importance of the time waited in each latency bucket and allows, for example, to compare the relative weight of many low-latency waits vs. a few events with long latency.

Download the tool from http://cern.ch/canali/resources.htm or from https://github.com/lucacanali The relevant scripts for this post are:
  • ehm.sql - a script to collect and visualize wait event histograms for Oracle RAC, 11g and 12c.
  • ehm_local.sql - same as ehm.sql but collects data only for the local instance (V$ view data), 11g and 12c.
  • ehm_micro.sql - collects and displays wait event histograms with microsecond buckets for RAC, requires Oracle 12.1.0.2.
  • ehm_micro_local - same as ehm_micro but for the local instance (V$ view data), requires Oracle 12.1.0.2.

Example of usage

Figure 1: Example of usage of the event histogram metric script on Oracle 12c. You can see low latency and high latency I/O modes. Sub millisecond I/Os come from the storage SSD/cache, high latency come from I/Os served by reading SATA disks. Note also that the average latency value, reported on the bottom of the figure, does not capture the details of the multiple modes and therefore does not represent the additional complexity of this workload.


Wait events of interest

Wait event 'db file sequential read' is used by the Oracle engine to provide a measure of the time spent on single-block read operations. This is an important source of data when troubleshooting the performance of many OLTP workloads, as it accounts for the time spent on storage access for random I/O (additional details and comments on the limitations of this approach are detailed in the next paragraph). This is the typical data access path when using indexes, for operations like primary key look-ups, index range scans, nested loops joins, etc, that are quite common in many transactional workloads. A short investigation of the internals of Oracle and its interaction with the OS (see also this blog post for details) shows that in most cases the time accounted by 'db file sequential read' is essentially the time taken to complete the underlying system call call to read data from the storage. Therefore this event provides a good measure for the storage response time as seen from the database server. Moreover in RAC it is easy to aggregate wait event data for all instances in one histogram using GV$ views and this provide an easy source of data for the shared storage.
Note: in Exadata deployments instead of 'db file sequential read' you should expect to see the wait event 'cell single block physical read'.

Another important wait event for troubleshooting the performance of many Oracle databases with significant transactional workloads is 'log file sync'. This wait event measures the time a session will wait on commit operations. There are several components to the wait time accounted in log file sync (see paragraph below for some additional hints and known pitfalls), one of the most important for the scope of this article is the time necessary to perform I/O on the redo log files. Studying the latency of this event can be a very good source of information for performance troubleshooting as it may allow you to understand if writes to redo log files are performed at low latency, so most likely into the storage DRAM or SSD cache, or rather with high latency, which may indicate the lack of caching, or the cache going into write through mode or other issues worth further investigations.

Limitations and additional comments

When investigating latency and storage response time for random I/O, the wait event 'db file sequential read' is a very important source of data, however not all random I/O performed by Oracle is accounted under this wait event. When Oracle see the opportunity of optimizing storage access by using prefetching or batching I/O, it will use asynchronous I/O (when available). Asynchronous I/O for random reads will be accounted with the wait event 'db file parallel read'. Asynchronous I/O improves the utilization of the I/O subsystem by the Oracle processes, as it allows to submit and reap many I/O operations in groups, as opposed for example to reading blocks one by one as in the case of pread system calls. However this makes the task of correlating the wait event latency with storage response time much more difficult. It is the case of 'db file parallel read' but also the case of wait events associated to asynchronous I/O used for sequential I/O operations such as 'direct path read' More on this topic and the pitfalls when using the wait event interface for asynchronous I/O are detailed in this other blog post (see also Oracle Support Doc ID 7448407.8  "Bug 7448407  Resource Accounting and Tracing change for Direct Path Waits").

One known pitfall when interpreting the measurements of the 'log file sync' wait event is that the wait time includes also what can be a significant CPU workload. This has been discussed in details by Kevin Closson. There are also many other details, including a new adaptive mechanism for the communication between the server process and log writer introduced in 11.2.0.3 (see: adaptive log file sync ) and more changes introduced in 12c with redo writer worker processes. See also Oracle Support Doc ID 34592.1 "WAITEVENT: log file sync Reference Note " for a list of known issues and bugs in this area.

As an addendum to the discussion: this is the formula used in the script event histogram metric to compute the the estimated value for the wait time in each bucket: T_wait = 0.75 * bucket_max_val * wait_count. As an example of the usage of the formula, consider 100 wait event in the Oracle bucket 2 ms. This means that the wait time has been between 1 ms * 100 events = 100 ms and 2 ms * 100 event = 200 ms. According to the formula above the estimated value is 0.75 * 2 ms * 100, that is the mean value between the maximum was of 2 ms * 100 and minimim wait of 1 ms * 100.

Related work and links

The study of latency histograms as a function of time leads to a 3D representation problem. A natural solution for it is to represent latency values in a heat map. If you are interested in more details see this blog post on OraLatencyMap, a tool to collect and display heat maps of wait events using SQL*plus and this series of posts on PyLatencyMap, a python-based tool integrated with many data sources, including Oracle wait events, SystemTap and DTrace.
Event histogram metric is part of a group of short command line scripts I use for Oracle troubleshooting, including scripts for near-real time measurement of Oracle performance. The scripts are available for download, you can read more details following this link.
When troubleshooting Oracle performance and investigating I/O latency details it can be useful to get graphs of historical trends from the AWR repository. PerfSheet4 is a tool that I use to extract and visualize AWR data.

Conclusions

Latency histograms of Oracle wait events are a powerful drill-down tool for performance investigations. Histograms are necessary when average values do not provide an accurate picture, for example when studying I/O latency of storage with multi-mode response time. Two wait events are of particular interest for many Oracle workloads: db file sequential read and log file sync. In Oracle 12.1.0.2 a new view V$EVENT_HISTOGRAM_MICRO provides latency details with microsecond buckets, which is very useful when investigating SSDs and low-latency storage in general. A set of simple scripts for near real-time data collection and display of the histograms have been discussed in this post and are available for download.

Wednesday, March 18, 2015

Heat Map Visualization of I/O Latency with SystemTap and PyLatencyMap

Topic: PyLatencyMap v1.2 a tool for collecting and visualizing I/O latency data collected and about its integration with SystemTap.

Introduction: When studying storage performance latency drill-down can be very important. Measuring the average I/O latency is often not enough, latency histograms are proven to be more suitable for investigating modern storage systems. This is because the storage service time can present multiple "modes". A common case is with storage systems where I/O requests can be served by low-latency SSD or by spindles with much higher service time. For such systems latency histograms provide an additional dimension to the performance investigation.
Latency heat maps come into play when we want to visualize a series of latency histograms over time. This is a 3-D visualization problem and heat maps provide a way to represent the third dimension as color (see as an example Figure 1 later in this article).
The main ideas behind the use of latency histograms for investigating storage performance and their visualization heat maps have been previously detailed in the excellent paper by Brendan Gregg on ACM queue and also on blog articles and presentations.

PyLatencyMap is a tool and set of scripts for gathering  and displaying heat maps using the command line interface (supporting ANSI terminal graphics). The main use for which PyLatencyMap has been developed is to measure and display storage latency data. PyLatencyMap can also be used to display heat maps from a generic source of latency data.
The latest update to PylatencyMap, version 1.2, provides integration with SystemTap and a few example scripts for getting started. SystemTap can be used to collect storage latency and PylatencyMap for producing the associated heat maps. Many more data sources have already been integrated with PyLatencyMap and examples are available with the tool: Oracle wait event histograms, DTrace histograms and custom aggregations from trace files.
PyLatencyMap has a modular architecture, which follows the typical use of Unix/Linux CLI tools. Data is generated by a collector script (for example a SystemTap script), then it is piped through an optional filter and finally piped into to the visualization engine:

latency data_source | <optional connector script> | python LatencyMap.py <options>

A series of examples and get-started scripts are provided with the tool. In this article we will focus on the scripts based on SystemTap. For discussions on how to use PyLatencyMap for SQL*plus and DTrace see also this other blog entry. Just to get started see the following CLI to run data collection with SystemTap and visualization with PyLatencyMap:

stap -v SystemTap/blockio_rq_issue_basic_pylatencymap.stp |python SystemTap/systemtap_connector.py |python LatencyMap.py

The script blockio_rq_issue_basic_pylatencymap.stp measures the latency of block I/O operations and can be used to collect latency histograms and display them as heat maps. The probe hooks top kernel.trace("block_rq_issue") and kernel.trace("block_rq_complete").
There are three probes in the script: one that gathers information at each block io operation, another that collects stats at the end of the I/O and a third probe that periodically prints the latency histogram. Note the use of the aggregation operator '<<<'  that SystemTap uses to populate the histograms, which are later printed with the use 'hist_log'.

probe kernel.trace("block_rq_issue") {
        requestTime[$rq] = gettimeofday_us()
}

probe kernel.trace("block_rq_complete") {
   t = gettimeofday_us()
   s = requestTime[$rq]
   if (s > 0) {
       latencyTimes <<< (t-s)
       delete requestTime[$rq]
   }
}

probe timer.sec(3) {
   printf("\n<begin record>\n")
   printf("timestamp, microsec, %d, %s\n",gettimeofday_us(),tz_ctime(gettimeofday_s()))
   printf("label, Latency of block I/O requests measured with SystemTap\n")
   printf("latencyunit, microsec\n")
   printf("datasource, systemtap\n")
   if (@count(latencyTimes) > 0)
       println(@hist_log(latencyTimes))
   printf("\n<end record>\n")

}

SystemTap scripts for PyLatencyMap:

For production usage there are more advanced scripts than the one just discussed. The script  Example9_SystemTap_blockIO_req.sh is based on blockio_rq_issue_pylatencymap.stp and provides tracing based on block_rq_issue and block_rq_complete as the example above but contains additional filters to remove 'false positives'. For ease of use this is packed into

Another example script is Example9b_SystemTap_blockIO_filter_read_8KB.sh is based on blockio_rq_issue_filter_pylatencymap.stp which provides the possibility to further filter on the block I/O being issues, in particular the defaults filter for 8KB-reads, which are of interested when testing OLTP applications in Oracle.

Example9c_SystemTap_pread.sh is a scripts that packages a SystemTap probe on pread calls. These are typically the underlying OS calls for single block reads that show up as 'db file sequential read' wait events in Oracle. Note however that Oracle can do single-block reads also using asynchronous, therefore the probes discussed above on block_io trace points are more general than this one.

Examples 9e, 9f: the probes in these scripts measure block_io hooking on different probe points and are provided for reference. Note, these scripts require to have kernel debuginfo symbols.

oracle_event_latency.stp: this script measures the latency of Oracle wait events and produces histograms by hooking directly into the Oracle executable. More details on how this works can be found at the blog entry "SystemTap into Oracle".

Example:

Figure 1 here below is an example the output from PyLatencyMap run using the script Example9b_SystemTap_blockIO_filter_read_8KB.sh. The measured workload has been produced with SLOB (at increasing load). The top heat map represents the number of I/O operations per latency bucket and as a function of time. On the horizontal axis you can fine the time, on the vertical axis the latency buckets. The tool assumes a power-of-2 scale. Color is used to represent the number of I/O operations per second. A blue palette is used for the frequency heat map.
The  heat map on the lower half of the figure uses a yellow-red palette to represent the amount of time waited in each bucket. I refer to it as a intensity/importance heat map. It highlights the relative weight of latency time spent on each bucket and helps in comparing the impact of a large number of low-latency operations vs. a smaller number of high latency operations.


Figure 1:  The output of PyLatencyMap contains the latency histogram of the latest measured point and  two heat maps: on top the heat map of the IOPS, on the lower graph the heat map of the waited time.

Note: for SystemTap (and DTrace) data sources, the latency bucket has to be interpreted as follows: a value of N in bucket B, means N I/O operations with latency between B and 2*B.
Other histogram latency data sources (notably Oracle) use a different convention. With the same example a value of N in a latency bucket B for Oracle histograms means N waits of latency between B and B/2. PyLatencyMap takes care of this difference when calculating the time waited histograms (i.e. the
Note: The scripts discussed here for block I/O tracing use the trace point block_rq_issue, it is also possible to use the trace point block_rq_insert to measure I/O latency from the point of insertion into the queue instead.

References and download links:

PyLatencyMap can be downloaded from this link or can be forked from Github. Additional information on PyLatencyMap is also available at this blog entry and also at this other blog entry.

Conclusions and acknowledgements

SystemTap is a very powerful tool for measuring I/O and producing latency histograms for storage performance investigations. PyLatencyMap provides an easy way to visualize histograms produced by SystemTap as latency heat maps. PyLatencyMap collects data and displays heat maps using the command line interface on character terminals with ANSI escape codes. This provides the combined benefit of a graphical output and the power of the command line. PyLatencyMap can also be used to visualize heat maps from latency histograms produced with DTrace, Oracle's SQL*plus and/or a from latency data aggregated from text and trace files.

Many thanks go to Brendan Gregg for sharing in his blog many awesome ideas and tools on the topic of heat map visualization and on using DTrace and SystemTap that have provided a great source of inspiration for this work.

Note: this entry was originally posted in March 2015, with major updates in July 2015.

Wednesday, February 18, 2015

Latest Updates to PerfSheet4, a Tool for Oracle AWR Data Mining and Visualization

Topic: This post is about the latest updates to PerfSheet4 v3.7 (February 2015). PerfSheet4 is a tool aimed at DBAs and Oracle performance analysts. It provides a simplified interface to extract and visualize AWR time series data using Excel pivot charts.


Why PerfSheet4: PerfSheet4 is aimed at querying and displaying time-series data from AWR repository tables. This is very a rich source of information to analyze database workloads and trends in the context of performance analysis or capacity planning. The tool automates several time consuming tasks and provides a point-and-click user interface:
  • PerfSheet4 comes with a set of pre-defined queries to extract data from some of the most common AWR table and in particular to compute deltas for the performance values were relevant.  
  • PerfSheet4 can query the database directly or work in a offline mode (you run the sql script from sqlplus and later load the csv file into PerfSheet4)
  • PerfSheet4 uses Excel Pivot Charts. Pivot Charts provide a powerful interface to display and navigate through AWR data.
  • PerfSheet4 comes with a set of pre-defined graphs for each of the pre-defined queries. This makes preparing some of the most common workload graphs a matter of a few clicks and also provides a good starting point into more advanced usages of the Pivot Charts.

What's new: If you have used previous versions of PerfSheet4 you will find in v3.7 (February 2015 update) a larger collection of pre-defined graphs. This is aimed at making PerfSheet4 easier to use for many of the standard tasks. Typically 4 graphs will be presented for each of the pre-defined queries. Another new feature is the choice of using Oracle ODBC drivers, as opposed to using the default Microsoft ODBC driver for Oracle. This feature has been introduced especially for Windows 8/8.1 where Microsoft ODBC driver for Oracle does not seem to work correctly.


PerfSheet4 is a free tool and is available for download at this link. The tool has been developed and tested for Oracle 11gR2 and 12cR1. On the client side it has been tested for Excel 2010 and 2013 for Windows (7 and 8/8.1). The figures here below show two examples of the user interface.

Figure 1: The start page and user interface of PerfSheet4. This is where you can run the query of interest to extract data from AWR. or load csv data into Excel. From this page you can then launch the Pivot Chart plot. 


Figure 2: A Pivot Chart generated by PerfSheet4. The pre-defined graph is just a starting point. Customize the chart from there by selecting the data of interest and/or changing any other feature of the graph. 


Additional details: Here below is the list of AWR queries and charts available in PerfSheet4, v3.7 (February 2015). The tool is open source and therefore fully configurable, in case you want to modify his default behavior. There are two versions packaged in the zip file of the tool: the main version is called called PerfSheet4_v3.7.xlsm. A second version is available, called PerfSheet4_v3.7_noActivex_controls.xlsm, which avoids the use of Activex controls and should be better suited for those cases where Activex controls don't work well and/or for older versions of Excel.

Query: Workload data

  • Extracts data from dba_hist_sysmetric_summary. System metrics provide details on the usage of system resources over time.
  • Pre-defined charts: "Average active sessions", "CPU usage and database time", "Physical reads and writes IOPS", "Physical reads and writes throughput".

Query: System statistics  

  • Extracts data from dba_hist_sysstat. Hundreds of system statistics provide valuable details on the workload. The query computes delta values between snapshots and rates (that is delta values divided over delta time).
  • Pre-defined charts: "Physical read and write IOPS", "Physical read and write throughput", "CPU and wait time", "CPU used per instance".

Query: Wait events  

  • Extracts data from dba_hist_system_event. This provides information on all the wait events. The query computes delta values between snapshots and rates (i.e. delta values divided over delta time).
  • Pre-defined charts: "db file sequential read - wait event for single block reads", "db file scattered read - wait event for multiblock reads into the buffer cache", "log file sync - wait event for commit time".

Query: Wait events per class  

  • Extracts data from dba_hist_system_event.  This provides information on all the wait events, computes delta values and rates (delta value over delta time) aggregating over wait class.
  • Pre-defined chart: "Wait time grouped per class".

Query: IO stats details  

  • Extracts data from dba_hist_iostat_detail. This provides details on the I/O operations performed by Oracle, computes delta values and rates (delta value over delta time) aggregating over function and file type.
  • Pre-defined charts: "Small read IOPS", "Large read throughput", "Small write IOPS", "Large write throughput".

Query: IO wait histograms  

  • Extracts data from dba_hist_event_histogram for I/O related events. This provide drill-down information on latency. The query computes delta values between snapshots and rates (i.e. delta values divided over delta time).
  • Pre-defined charts: "db file sequential read latency histogram - number of wait events", "db file sequential read latency histogram - time waited (millisec/sec)", "commit time wait latency histogram - number of wait events per sec", "commit time wait latency histogram - time waited (millisec/sec)".

Query: Stats per service  

  • Extracts data from dba_hist_service_stat, computes delta values between snapshots and rates (i.e. delta values divided over delta time) aggregating by service name.
  • Pre-defined charts: "DB time", "CPU time", "User I/O wait time".


Query: Top 3 wait events and CPU  

  • Extracts data from dba_hist_system_event and dba_hist_sysstat. This query selects the top 3 non-idle wait events and the CPU usage for each instance. It computes delta values and rates (delta value over delta time).
  • Pre-defined charts: "Instance N.1 - top 3 wait and CPU", "Instance N.2 - top 3 wait and CPU", "All instances - top 3 wait and CPU".


Previous work and acknowledgments: PerfSheet4 is a spin off of previous original work by Tanel Poder, rewritten and integrated with additional functionality for AWR analysis and with important changes to the user interface with the goal of making it more 'point-and-click'. A previous post on PerfSheet4 can be found at this link. Many thanks go to Tanel for his original work on PerfSheet and for his support to this development.


Getting started video: http://youtu.be/sdvx4zB-fvo



Wednesday, December 3, 2014

Talks at UKOUG TECH 2014 with CERN Speakers

This is a short blog post with a list of the talks at the UKOUG TECH 2014 conference in Liverpool with CERN speakers. Come and say hi, we hope to see many of you there!


(*) slides can be downloaded from: http://cern.ch/canali

Tuesday, November 4, 2014

Life of an Oracle I/O: Tracing Logical and Physical I/O with Systemtap

Topic: This post is about tracing logical and physical reads in Oracle using systemtap. You will find a few examples illustrating the main mechanisms used by Oracle to do physical and logical I/O and you will learn how to build systemtap scripts to further explore and troubleshoot Oracle I/O.

Introduction: Performing input/output to and from the storage media, i.e. physical I/O, is one of the critical tasks underlying all database engines. When accessing data stored in RAM from the database cache we can talk of logical I/O. Physical I/O is performed by the database using system calls to the O/S kernel. The type of system calls used depends on database settings, the O/S and the type of storage. Logical I/O is performed by the database processes using Oracle kernel functions. Oracle instruments the timing of physical I/O calls with the wait event interface, while logical I/O time is generically accounted for as CPU time. Tracing I/O operations performed by Oracle processes together with the details provided by the Oracle wait event interface is a powerful technique to investigate Oracle I/O internals. With the help of a few real-life examples we will investigate what the mechanisms are that Oracle uses to access storage, how the I/O operations correlate to the wait event interface and what the meaning of the wait event time is for the most common wait events associated with I/O.

Building the tracing tool: Investigations of what Oracle and the O/S do when an I/O operation is performed require specialized tools. Systemtap is a dynamic tracing tool specific to Linux that allows tracing of both the O/S and the Oracle functions (Linux kernel functions, system calls and Oracle database kernel functions). In the following you will read a description of a systemtap script that can be used to trace Oracle logical and physical I/O: it is composed of probes that can attach to the function calls of interest and can read function parameters, memory and CPU registers.
Systemtap requires relatively recent kernels to be able to probe userspace: either the utrace or uprobe_events functionality needs to be available. This is the case with RHEL/OEL 6.5 and higher but not with RHEL5. Moreover for tracing system calls the kernel debuginfo package needs to be installed.
More details on this and on how to build a lab environment can be found in this article. The script that we will use in this blogpost to trace Oracle logical and physical I/O and wait event interface is: trace_oracle_logicalio_wait_events_physicalio_12102.stp (for Oracle 12.1.0.2). A version for 11.2.0.4 is also available: trace_oracle_logicalio_wait_events_physicalio_11204.stp Before moving to the examples and studies of Oracle I/O I would like to review the three main parts of the script: tracing physical I/O, logical I/O and wait events.

Tracing physical I/O: Oracle will ultimately issue system calls to execute I/O operations. The type of calls depend on database settings, the O/S type and the type of storage. In this article we will restrict our examples to Oracle on Linux using ASM storage on block devices. Notably direct NFS, local storage and ASM with asmlib will be left out and possibly addressed in future studies. System calls can be traced using the strace(1) utility in Linux too. However, in this article I use systemtap probes to gather information on the system calls of interest: pread, pwrite, io_submit and io_getevents. Additional useful systemtap probes to the underlying block I/O layer are: ioblock.request, ioblock_trace.request and ioblock.end (note, these are not system calls, but rather tracepoints for function calls inside the Linux kernel).

Probe name Purpose Selected parameters
syscall.pread/pwrite synchronous I/O: read/write from/to a file descriptor fd (file descriptor number), offset, count. 
return value: bytes read/written
syscall.io_submit asynchronous I/O: submit blocks for processing nr (number of I/O). For each I/O: file descriptor, offset, bytes, opcode (0 for read, 1 for write)
return value: number of I/O operations performed
syscall.io_getevents asynchronous I/O: read events from the completion queue min_nr (min number of I/Os to read), timeout for event waiting.
Return value: number of I/O reaped, For each I/O: file descriptor, offset, bytes
ioblock.request, ioblock_trace.request I/O requests sent to the block device interface kernel layer devname, sector, size, rw, block_io structure address
ioblock.end return from the block device interface kernel layer devname, sector, rw, block_io structure address


Tracing logical I/O: This is about providing the history and details of the logical I/O operations, which is when Oracle reads a data block from memory (buffer cache). Logical I/O is a super-set of physical I/O: if Oracle does not find data in memory (buffer cache) it will issue a physical read. You can find a few examples of this mechanism in action later in this post.
Oracle instrumentation provides extensive information on logical reads in several V$ views, such as V$SYSSTAT, V$SESSTAT, V$SYSMETRIC. Also logical read data is available from sql*plus with "set autotrace on" and with tkprof reports of 10046 traces . Logical reads come in two main "flavors": consistent reads and current reads. Oracle will use consistent reads when reading data at a given system change number SCN, current reads are used when performing DML operations on the block. Consistent reads can be traced in Oracle using the events 10200 and 10201. More details can be found in the presentation by Julian Dyke on logical I/O.
In this article we will trace logical I/O by attaching systemtap probes to a set of Oracle kernel functions. The role of functions kcbgtcr and kcbgcur for tracing consistent and current reads has already been discussed by Alexander Anokhin in his excellent blog articles on Dtrace LIO. More recently I have investigated this topic and found useful to also trace the functions kcbzib, kcbzgb, kcbzvb.
A summary of the main finding on these functions is in the table here below. Note: the expressions %rdi, %rsi, %r8 indicate the values of the corresponding CPU registers. With systemtap the register value can be read with the function register, example: register("rdi"). The CPU registers are used to extract the function call arguments. Systemtap does not provide the function call arguments when there are no debug symbols available for the executable. Notably the calling conventions for x86-64 as documented in  www.x86-64.org/documentation/abi.pdf state that the integer parameters for a function call are available in the following registers (in order): %rdi, %rsi, %rdx, %rcx, %r8 and %r9.


Function name Purpose Selected parameters
kcbgtcr Kernel Cache Buffers Get Consistent Read

used for consistent reads
tbs#=user_int32(%rdi) 
rel file n#= user_int32(%rdi+4) >> 22 & 0x003FFFFF 
block#=user_int32(%rdi+4) & 0x003FFFFF
data_object_id#=user_int32(%rdi+8)
object_id#=user_int32(%rdi+12)

Note: for bigfile tablespaces:
block#=user_int32(%rdi+4)
kcbgcur Kernel Cache Buffers Current

used for current reads
tbs#=user_int32(%rdi)
rel file n#= user_int32(%rdi+4) >> 22 & 0x003FFFFF
block#=user_int32(%rdi+4) & 0x003FFFFF
data_object_id#=user_int32(%rdi+8)
object_id#=user_int32(%rdi+12)

Note: for bigfile tablespaces:
block#=user_int32(%rdi+4)
kcbzib kcbZIB should stand for: Z (kcbz.o is a module for physical IO helper functions), IB: Input Buffer

Oracle will perform physical read(s) into the buffer cache

kcbzgb The suffix GB in kcbZGB should stand for: Get (space for) Buffer

Oracle allocates space in the buffer cache for a given block (typically before I/O operations). 

tbs n#=%rsi,
rel file n#=%rdx >> 22 & 0x003FFFFF
block#=%rdx & 0x003FFFFF
data_object_id#=%r8
object_id#=%r9

Note: for bigfile tablespaces:
block#=%rdx
kcbzvb Invoked after Oracle has read a given block

Note: this function is used both for reads in the buffer cache and for direct reads

tbs n#=%rsi, 
rel file n#=%rdx >> 22 & 0x003FFFFF,
block#=%rdx & 0x003FFFFF
data_object_id=user_int32(%rdi+24)

Note: for bigfile tablespaces:
block#=%rdx

Note: as I write this only uprobes for kernel versions 3.10 or higher support return probes for userspace functions. Therefore this functionality is not available with RHEL or OEL 6, but it is available on RHEL 7.0 kernel.

Tracing wait events: The wait event interface is probably one of the best features in Oracle performance tuning as it provides a quantitative method to identify performance issues and bottlenecks. Cary Millsap has published inspiring work on the subject.
For our purposes it suffices to say that wait events are instrumentation points inside the Oracle where selected Oracle kernel actions are timed. In particular Oracle will time the start and end calls to most physical I/O operations, therefore providing a good way to understand how much time is spent during physical I/O (some limitations of this mechanism will be discussed later in this post). There are also limitations to using the wait event interface to study I/O latency, as you will see in the example section some interesting challenges related to timing asynchronous I/O events.
Wait event tracing is typically activated in Oracle using the 10046 event or with by calling DBMS_MONITOR.SESSION_TRACE_ENABLE. In this article we will get the wait event details by attaching systemtap probes to the relevant Oracle kernel functions, in particular kskthbwt and kskthewt to gather information respectively at the start and at the end of each wait event. It’s important to realize that enabling wait event tracing using event 10046 at level 8 and other methods is a way to externalize the wait event timing information for analysis (as well as other SQL processing details), not to enable the wait information. This is done using the database parameter TIMED_STATISTICS, and is and should be set to TRUE in modern Oracle databases. One key piece of information that the Oracle kernel functions kskthbwt and kskthewt give us is a pointer into the X$KSUSE fixed table (or rather to the underlying segment array in the SGA), which is the underlying table to V$SESSION and therefore opens the road to correlate the wait event with many useful information on session executing the SQL. Resolving the pointer in register r13 to the base value of X$KSUSE and computing the offset for the fields of interest requires additional work, this is described in a previous article (see also the script: ksuse_find_offsets.sql).

Function name Purpose Selected parameters
kskthbwt Kernel service Kompile thread begin wait.
This function is called at the start of an Oracle wait event.
kslwtbctx is its parent function call and marks the start of a wait event. 

Useful information when probing the function:
register r13 -> points into X$KSUSE (V$SESSION) SGA segmented array
register rsi -> timestamp of the beginning of the wait (in microseconds)
register rdx -> wait event number
kskthewt Kernel service Kompile thread end wait.
This function is called at the end of an Oracle wait event.
kslwtectx is its parent function call marking the end of a wait event.
The suffix "ewt" most likely stands for "end wait".
Useful information when probing the function:
register r13 -> points into X$KSUSE (V$SESSION) SGA segmented array
register rdi -> timestamp of the beginning of the wait (in microseconds)
register rsi -> wait event number


Running the script:
Below you find a few examples of traces for specific read-life Oracle I/O access patterns. The traces have been obtained by runnig  stap -v trace_oracle_logicalio_wait_events_physicalio_12102.stp -x <pid> where <pid> is the process id of the session being traced. The output is piped into sed -f eventsname.sed, which takes care of translating the event number with the associated event name. The file eventsname.sed is generated by running the script eventsname.sql. It’s important to rerun the script to generate the event names before testing, because the wait event numbers can change without any warning, for example by migrations to newer versions, but potentially also by applying a patch. The system used to collect the traces on the following examples is Oracle 12.1.0.2 running on OEL 7.0 with UEK3 under Virtualbox.


Example 1: single-block random reads

This example is about single block (random) reads. This access path to data is very important for many OLTP workloads as it is used for table access via a index key. In Figure 1 here below you can see a snippet of a trace file of a query of the type: select <col_list> from table <tab> where <indexed_col>=<val>.

Figure 1: Oracle single-block random read and db file sequential read wait event

The key points are:
  • Oracle issues a logical read on one of the blocks needed to to retrieve data for the SQL of this example
    • additional details are provided in the logical read call where we can see that Oracle wants to read block#=2505675 from tbs#=7 and belonging to obj#=32174 (which is the table segment in this example)
  • The block is not found the cache so a new block is allocated and a physical read is initiated
  • The database marks the start of a wait event for 'db file sequential read'
  • The physical read request for a block becomes a pread call to the O/S
    • The IO request enters the kernel via a system call, any function encountered after the system call is a kernel function (such as ioblock.request), until the system call returns, and thus gets back to userland, which means the Oracle executable in this case.
  • This read call is passed on to the block I/O interface to read 8192 bytes (1 Oracle block), from the block device device (/dev/sdl in the example of Figure 1)
  • The block interface and then the OS call return the read data to the Oracle kernel
  • The Oracle wait interface marks the end of the wait and computes the elapsed time
    • the wait event in db file sequential read and the parameters are: p1 is the file number, p2 in the block number, p3 is the number of blocks read (1 block).
    • From the timestamp values in the trace you can see that the duration of the db file sequential read wait is mostly accounted for by the time it took to do the pread call, with a small overhead (note also that the overhead could have been more significant if we had high CPU utilization on the DB server, for example because of workload from other Oracle sessions)
  • We can conclude that in this example the measured wait time for db file sequential read corresponds to physical read latency.

Example 2: sequential I/O into the buffer cache with synchronous I/O

This example is about multi-block reads into the cache. This is an access path that is used for example when Oracle performs a full table scan. There are multiple ways in which Oracle can access the storage to perform full scans, in this example you will see the case when Oracle uses synchronous I/O (i.e. the pread call) to perform O/S reads, examples 3 and 4 will cover sequential reads with asynchronous I/O. In Figure 2 here below you can see a snippet from tracing the workload of a query of the type: select <col_list> from table <tab>, a full scan of <tab> is performed by Oracle.

Figure 2: Oracle sequential I/O and pread O/S calls

The key points of the trace in Figure 2:
  • Oracle issues a consistent read for tbs#=7 block#=2621443 and finds it in the cache
  • Oracle issues a consistent read for the next block, does not find it in the cache so prepares to read a list of block from storage in one physical read. The list is 124-blocks long and shortened in this example
  • The databases marks the start of a wait event for "db file scattered read"
  • The physical read requests become a single pread call to the OS for 1015808 bytes (124 blocks)
  • This request moves down the chain to the block I/O interfaces which splits it into 2 reads from the underlying block device, as in the system used for tests the maximum size of I/O is 524288 bytes.
    • As investigated by Frits Hoogland: the submitted I/O request lands in the IO scheduler scattered/gather list, where it is broken down to the IO size as advertised by the block device.
  • The I/Os are completed at the block I/O interface
  • The pread call returns
  • The wait event interface times the end of the wait
    • The wait event is db file scattered read and the parameters are: p1 is the file number, p2 in the block number, p3 is the number of blocks.
    • From the timestamp values in the trace you can see that the duration of the db file scattered read wait is mostly accounted for by the time it took to do the pread call, with a small overhead 
  • We can conclude that in this example the wait time for db file scattered read corresponds to I/O latency.
  • From the calls to kcbzvb we get additional confirmation of the blocks that have been read

Example 3: sequential I/O into the buffer cache with asynchronous I/O O/S calls

In Figure 3 here below you can see a snippet from the trace of  the SQL: select <col_list> from table <tab>, where Oracle performs a full scan of table <tab>. This is the same query that was used for Example 2 above, the difference from that case is that the trace in Figure 3 has been taken at a later time in the execution of the query, when the Oracle process switched from using synchronous I/O with pread calls to using the asynchronous  I/O interface. Asynchronous I/O is implemented in Linux with io_submit and io_getevents calls, to send the I/O requests and then to reap the results. Although Oracle uses the asynchronous interface to perform read operations in this example, the final result is very similar to what described in Example 2 (i.e. with synchronous I/O): the Oracle wait event is the same, "db file scattered read", the reads are also performed into the buffer cache and I/Os are reaped with blocking calls, therefore making the process effectively synchronous. A more 'aggressive use' of the asynchronous I/O interface by Oracle aimed is found in Oracle with direct reads, that are discussed in Example 4 below.

Figure 3: Oracle performing sequential I/O reads into the buffer cache using asynchronous I/O calls

The key points are:
  • Oracle issues a consistent read for tbs#=7 block#=2522760, it does not find it in the cache so prepares to read a list of blocks into the buffer cache. The list is 128-blocks long and shortened in this example
  • The databases marks the start of a wait event for "db file scattered read"
  • The physical read requests are passed on to the OS as a call to io_submit with 2 requests for a total of 1MB in size (that is 128 blocks): the first request is for a read (opcode 0) of 65536 bytes, the second for a read operation of 983040 bytes (that is 1MB -64 KB)
  • These requests are passed to the block I/O interface which splits the largest request into 2 reads from the underlying block device (/dev/sdi in this example). The read of 64KB is passed on unchanged in size to /dev/sdr
  • Oracle calls io_getevents to reap the results of the asynchronous I/O requests that have just been submitted and waits. The timeout.tv_sec field is set to 600, which indicates this is a blocking wait (which means io_getevents waits for 600 secs (tv_sec) and 0 nano secs (tv_nsec)) for min_nr=2 of IOs to finish. 
  • The block device interface returns the data upstream.
  • The io_getevents call returns the 2 I/O read.
  • The wait event interface marks the end of the wait. The wait event in this case is db file scattered read. 
    • The wait event parameters are: p1 is the file number, p2 in the block number, p3 is the number of blocks.
    • The wait event duration is basically the time is took to do the io_submit and io_getevents calls plus a small overhead.
    • Note that although Oracle has used the asynchronous interface, it has also waited for all the I/Os to complete therefore making the I/O essentially synchronous
  • The wait time for db file scattered read in this example is not a precise measure of the I/O latency, as two reads on two different devices where submitted in parallel and the total time was reported as if it was a single read.
  • From the calls to kcbzvb we get additional confirmation of the blocks that have been read.

Example 4: full scans with direct reads and asynchronous I/O

In this paragraph you will see two examples of Oracle I/O for the case of direct reads against ASM. The same query that was used in Examples 2 and 3 above will used here too: it is a query that performs a full table scan. In this example differently from Examples 2 and 3 Oracle uses serial direct read to access data. Serial direct read is a feature introduced in 11gR2 for optimizing full segment scan operations and bypassing the buffer cache layer, similarly to what happens with parallel queries. Oracle chooses when to use serial direct reads vs. cached reads based on a few parameters. In first approximation direct reads are preferred when the table is "big" compared to the size of buffer cache, the actual mechanism is more complex and can potentially change between Oracle versions, you can read more on this topic in this blog post by Tanel Poder.
For the particular setup I used to run this tests I had to force the use of serial direct reads before running the query, this was done with: alter session set "_serial_direct_read"=always; 
Note that alter session set "_serial_direct_read"=never; can be used to disable serial direct reads and force Oracle to use cached reads (that is the type of I/O you have seen in Example 2 and 3 above). The default value for the underscore parameter "_serial_direct_read" is "auto".
For the investigations in this example I will use a slightly different systemtap probe from what we have used in the previous examples. I have removed tracing of logical I/O and block I/O for reducing cluttering and added a new probe on the userspace function io_getevents_0_4 of libaio (libaio is the Linux library that implements asynchronous I/O). This is because serial direct reads use non-blocking calls to reap I/Os and these calls are best traced at the libaio interface for reasons described by Frits Hoogland in this blog post. The systemtap script that I will use in Figure 4a and 4b is: trace_oracle_wait_events_asyncio_libaio_12102.stp (a version for 11.2.0.4 is available at trace_oracle_wait_events_asyncio_libaio_11204.stp).

Figure 4a: Trace of the wait event and I/O calls of Oracle during a full table scan with serial direct read and asynchronous I/O, this is what happens in the first phases of the full scan

The key points of Figure 4a are:
  • Oracle requests I/O in chunks of contiguous storage space of 1MB. This is to reduce the overhead of seek time and potentially make use of readahead when available. 
    • The value of 1MB for the read request comes from the parameter db_file_multiblock_read_count=128 multiplied by the block size of 8KB. 
  • Oracle submits one I/O of 1MB with a call to io_submit and then calls io_getevents_0_4 in the libiao library using the non-blocking mode to try to reap the I/O. This happens twice, with no results. Oracle then submits another I/O of 1MB and tries for 4 times to reap I/O in non-blocking mode, with no results (the I/O operations have not yet finished).
  • The wait event timing starts just before the calls to io_getevents, which are used to reap the I/O results this time in blocking mode (with a timeout of 600 seconds)
    • The wait event in this case is: "direct path read"
    • Oracle reaps the 2 I/O in two subsequent calls to io_getevents
  • After reaping the I/Os Oracle marks the end of wait for the event. 
  • The result is that 2 reads of 1 MB each have been performed 
  • The wait event interface does not report the correct amount of I/O performed:
    • the "direct path read" wait in this example reports one read of 128 blocks (1 MB), while 2 MB were read
  • The elapsed time for the wait event "direct path read" does not measure the I/O latency
    • Oracle is only timing the blocking calls to io_getevents, which start after the I/Os are submitted. 
    • Oracle is timing 2 reads together.
The use of serial direct reads in Oracle is more complex than the simple example of Figure 4a. Frits Hoogland has discussed this in more details the internals of direct reads in Oracle in this presentation. In particular Oracle uses an adaptive mechanism for direct reads with the goal of driving the storage system to high throughput when possible. This is implemented by increasing the number of outstanding I/O when calling the asynchronous I/O interface. Internally this makes use of I/O slots. Note, from a comment on Oracle diagnostic event 10353 we have additional indirect information on this topic: "Slots are a unit of I/O and this factor controls the number of outstanding I/Os".
Another finding of Frits' is that there are I/O operations performed for direct reads that are not timed by the wait event interface. This happens because Oracle can issue the io_getevents with timeout=0, that is reaping the I/O without having to block. You can see the these mechanisms in action in the trace of Figure 4b, which is for the same full scan as reported above in Figure 4a just at later time when the mechanism of adaptive serial direct reads has tuned Oracle I/O requests to be "more aggressive" on the use of storage resources.

Figure 4b: Trace of the wait event and I/O calls of Oracle during a full table scan with serial direct read and asynchronous I/O, this is an example of what happens when the adaptive mechanism increases the number of outstanding I/Os for optimizing the throughput

The key points of Figure 4b are:
  • This is the same full scan as shown in Figure 4a, the difference is that the trace in Figure 4b has been taken at a later time of the full scan operation (the table has 1 million blocks) when adaptive serial direct reads has increased the number of I/O slots (outstanding asynchronous I/O operations).
  • You can see that several io_submit operations are performed before the wait event timing is started. 
  • Oracle is trying to keep several outstanding I/Os in the queue: up to 9 outstanding asynchronous I/O operations can be seen in Figure 4b, while only 2 were present in simpler case of Figure 4a, that is at the beginning of the full scan.
  • Several attempts to reap I/Os with io_getevents operations are run by the Oracle process before the wait event timing is started
    • 2 reads of 1MB each are reaped by one io_getevents call before the wait event timing has started, therefore these reads will not be accounted for in the wait event interface.
    • It is important to note that the timeout for this particular io_getevents call (as well as for the io_getevents_0_4 calls in libaio which have not returned any I/O) is set to 0 by the Oracle process, which means that Oracle does not wait for I/O results to be available, it will only reap them if they are available.
  • Between the moment at which Oracle starts timing the wait event "direct path read" and the end of the wait event 3 io_getevents operations have been performed (with the timeout set to 600, therefore blocking for I/O reads to be available) and a total of 9 reads of 1MB each are performed
    • The wait event interface does not report all the I/O that has been performed:
      • In the wait event parameters we can only find trace that 1 MB has been read (128 blocks), while in reality 9 MB have been read.
      • This is in addition to the 2 MB read outside of the wait event instrumentation.
    • The elapsed time for "db file direct read" does not accurately reflect I/O latency but rather the wait time for reaping I/Os as seen in Figure 4b.

Example 5: asynchronous I/O for random reads

Oracle will optimize single-block reads when it can do prefetching and/or batching of I/O. This is an optimization for random I/O using the asynchronous I/O interface, as Oracle can group multiple I/O requests in a single io_submit call, as opposed to sending them one after the other as in the case of Example 1, where the OS call pread was used. The example from Figure 5 comes from running Kevin Closson's SLOB on a 12c database. In the execution plan please note step 3 where Oracle uses the operation "TABLE ACCESS BY INDEX ROWID BATCHED" to access table data.

Figure 5: execution plan for a SLOB query illustrating batch table access

Here below in Figure 6 you can see a snippet of the Oracle I/O tracing when running the SLOB workload. You can see that the asynchronous I/O interface is being used to to perform random I/O. In the case of Figure 6, 126 requests are submitted in a single io_submit operation and subsequently reaped by io_getevents, waiting for all the requests to finish. The wait event in this case is "db file parallel read" and the calculated wait time is the time for the batch of 126 requests to finish. Therefore the wait time of db file parallel read wait does not provide a measurement of the I/O latency for single block random reads.

Figure 6: Oracle performing I/O for random reads with asynchronous I/Os, for the case of batched reads. The trace has been edited replacing a long list of similar trace lines (typically 126 lines only varying for the block#) with "..."

Example 6: tracing DML

In Figure 7 you can see a snippet of tracing Oracle logical I/O and wait events when inserting a row into a table followed by a commit. No physical I/O was performed in this operation (the table has been previously cached). You can see that most of the block reads have been done in current read mode, as expected for DML. Moreover the blocks involved are both belonging to the table being inserted (you can identify them in Figure 7 as belonging to tbs#=11) to rollback segments (you can identify them in Figure 7 as belonging to tbs#=2). Note that for simplicity the table in this example has no indexes.

Figure 7: Logical I/O trace of Oracle performing and insert and commit operation

Example 7: tracing log writer
In Figure 8 you can see the trace of log writer during a commit operation. This is the same commit operation traced in Example 6 (see also Figure 7 above). The wait event is "log file parallel write" and its duration is timed by oracle from the beginning to the end of the asynchronous I/O operations performed. Therefore the measured wait event time "log file parallel write" is not an accurate measure of block write latency, as its duration depends on the number of operations performed. In the case of Figure 8 you can see that log writer writes to two distinct block devices, which is consistent with the use of 1 logfile member per redo log group, allocated on an ASM diskgroup with normal redundancy: log writer sends 2 write I/O requests simultaneously in order to write the primary and mirror extents of the logfile and waits for both of them to finish.

Figure 8: log writer tracing during a commit operation

Conclusions
Dynamic tracing and in particular systemtap provide the opportunity to investigate Oracle physical and logical I/O and open the way to drill-down investigations of Oracle's I/O into the various layer of the Oracle engine down to the Linux kernel layer. In this post you have seen the methods, tools and examples of how these investigations can be performed.
The examples from real-world Oracle workloads highlight some of the most common cases of Oracle logical and physical I/O and the connections between the wait event interface and the O/S calls issued by Oracle processes. The Oracle wait event interface can provide useful data for measuring I/O latency and in general for troubleshooting I/O problems. This is true for random reads instrumented by db file sequential read. The information provided by wait events related to asynchronous I/O, however, require additional attention: such events typically do not provide an accurate measure of the I/O latency. Moreover we have illustrated examples where some of the asynchronous I/Os for Oracle direct reads are not instrumented at all by the wait event interface.

I would like to thank Tanel Poder and Frits Hoogland for original ideas that have inspired this work. Special thanks go to Frits Hoogland for providing comments and additional ideas from his deep expertise on the subject.