External Table: Oracle Events' Latency Visualization and Heat Maps in SQL*plus

Topic: This post is about a technique for Oracle performance tuning, the use of heat maps to investigate wait event latency (and in particular I/O-related latency). This post also discusses a SQL*plus-based script/tool I have developed to help with this type of monitoring and performance drill-down (OraLatencyMap).

Context: Oracle exposes latency data for the wait event interface in V$EVENT_HISTOGRAM. This gives an additional dimension to drill down performance data for analysis and troubleshooting. In an older blog post I described an example of troubleshooting a storage issue for an OLTP (RAC) database by investigating the histogram data of the 'db file sequential read' wait event. In that context I had also developed and discussed ehm.sql, a simple PL/SQL script to collect and display data from GV$EVENT_HISTOGRAM.

What's new: An excellent article by Brendan Gregg, "Visualizing System Latency", Communications of the ACM, July 2010 has inspired me to develop an enhanced version of ehm.sql. The idea is to display in real time data of current and historical values of selected wait event histograms. In particular I/O-related events such as db file sequential read and log file sync make excellent candidates for this type of analysis. Moreover those events are relevant in a context that is familiar to me, that is drilling down issues with OLTP performance and access to storage.
As Brendan shows in his article, I/O latency data fits naturally to heat map representation, where time is on the horizontal axis, latency buckets are on the vertical axis and the quantity to display (for example number of waits or time waited) is displayed as color (hence the name heat map).

The tool: OraLatencyMap is a tool I have developed to help extract and represent event histogram data in a heat map. It is intended to be lightweight and 'DBA-friendly'. It's implemented in PL/SQL, it does not require the creation of any DB objects and runs under SQL*plus. OraLatencyMap requires a terminal supporting ANSI escape codes (for example PuTTY, MobaXterm, xterm, notably it does not run under windows' cmd.exe). Making SQL*plus behave like a monitoring tool, requires jumping through hoops. Credits to the experts who have shared their results in this area and thus made my job much easier here. In particular: Tanel Poder (for moats, sqlplus and color, snapper, etc..), Adrian Billington (moats) and Marcin Przepiorowski (topaas).

Example 1: study of db file sequential read
See here below a screen shot of a putty terminal where I ran SQL*plus and @OraLatencyMap. The script samples GV$EVENT_HISTOGRAM roughly every 3 seconds and displays 2 heat maps. The top heat map gives information on the number of waits per second on each latency bucket. The bottom heat map instead represents the estimated wait time per latency bucket. The two graphs represent the same type of information but with 2 different 'points of view'.

SQL> @OraLatencyMap

This type of data is useful when investigating single block read latency issues in OLTP systems for example. I'll leave for another time a discussion of the details and limitations of this approach. I'll just point out that among others, it's also important to make sure the system is not starving with CPU to make sense of the data (CPU data not shown here). When reading the heat map I typically focus on 3 areas: one is low-latency (1ms) where I get info on what is most likely reads from storage cache, the second is the area of latency around the 16 and 32 ms buckets, most likely representing physical reads from 'rotating disks'. The third very interesting area to watch is the 'high latency' (>100ms), that is the area of "IO latency outliers" and can be sign of problems with the storage for example. Note that OraLatencyMap is a drill-down tool based on Oracle instrumentation so the interpretation of the results, especially when extended to storage, will often need additional data from the specific context being investigated.

Example 2: study of log file sync
This example is about visualizing the latency of log file sync. This can be useful when drilling down commit latency issues. This is a critical area for many OLTP systems and the troubleshooting is often not easy: storage performance, server CPU starvation and unexpected behavior by LGWR, among others can all potentially cause problems in this area.
This screen shot was taken from a putty window, this time with a white background.

SQL> @OraLatencyMap_event 3 "log file sync"

Note: See also the blog entry on OraLatencyMap v1.1

Conclusions
Latency investigations of Oracle wait events give an additional and powerful dimension to performance tuning and troubleshooting. This is particularly useful for (synchronous) I/O-related wait events such db file sequential read and log file sync. Latency heat maps are a particularly suited to visualize IO latency (see also Brendan Gregg's article). The first version of a simple SQL*plus script (OraLatencyMap) to collect and visualize event histogram data as heat maps has been discussed.
OraLatencyMap is available for download from http://canali.web.cern.ch/canali/resources.htm and on GitHub

11 comments:

UnknownMay 9, 2013 at 11:47 PM
Crazy!! It works really well, just added to my favorite SQL scripts directory! :-)
Thank you for sharing this...

@ludovicocaldara
AnonymousMay 10, 2013 at 8:46 PM
wow - works just like just like that - no install, just run the SQL file.
Has to be the coolest graphical widget for Oracle performance that I've seen in ages.
Works great with https://fritshoogland.wordpress.com/2012/12/15/throttling-io-with-linux/ for playing around with the output.
- Kyle Hailey
http://dboptimizer.com
AnonymousMay 10, 2013 at 9:46 PM
Luca,

Very, very good post as usual... although your 16 to 32ms for mechanical drives is really scary ! Maybe lots of over utilized RAID 5 SATA 7200 RPMs :-) ?

Your point about CPU utilization and how it relates to LFS is spot on. If you don't mind I'd like to refer your readers to a post I have on that very topic:

http://kevinclosson.wordpress.com/2007/07/21/manly-men-only-use-solid-state-disk-for-redo-logging-lgwr-io-is-simple-but-not-lgwr-processing/

UnknownMay 10, 2013 at 10:27 PM
That is very cool stuff. Another extension would be to color using how unusual the outliers are relative to the entire GV$EVENT_HISTOGRAM which is taken as representing the latency distribution altogether (after sufficient time under normal workloads) This would be more of a monitoring than diagnostic use case.
Database SceneMay 11, 2013 at 6:47 PM
Well done guy...!
Sayan MalakshinovMay 12, 2013 at 12:17 AM
It also works on windows client with ansicon :)
AnonymousMay 13, 2013 at 12:44 PM
Luca, thanks for sharing, this is brilliant work.
AlbertoMay 31, 2013 at 9:13 AM
Hi Luca,
Excellent stuff....thanks a lot for sharing....
Alberto

Wednesday, May 8, 2013

Oracle Events' Latency Visualization and Heat Maps in SQL*plus

11 comments: