Chapter 4. Useful SystemTap Scripts

This chapter enumerates several SystemTap scripts you can use to monitor and investigate different subsystems. All of these scripts are available at /usr/share/systemtap/testsuite/systemtap.examples/ once you install the systemtap-testsuite RPM.

4.1. Network

The following sections showcase scripts that trace network-related functions and build a profile of network activity.

4.1.1. Network Profiling

This section describes how to profile network activity. nettop.stp provides a glimpse into how much network traffic each process is generating on a machine.

nettop.stp

#! /usr/bin/env stapglobal ifxmit, ifrecvglobal ifmergedprobe netdev.transmit{  ifxmit[pid(), dev_name, execname(), uid()] <<< length}probe netdev.receive{  ifrecv[pid(), dev_name, execname(), uid()] <<< length}function print_activity(){  printf("%5s %5s %-7s %7s %7s %7s %7s %-15s\n", "PID", "UID", "DEV", "XMIT_PK", "RECV_PK", "XMIT_KB", "RECV_KB", "COMMAND")  foreach ([pid, dev, exec, uid] in ifrecv) {  ifmerged[pid, dev, exec, uid] += @count(ifrecv[pid,dev,exec,uid]);  }  foreach ([pid, dev, exec, uid] in ifxmit) {  ifmerged[pid, dev, exec, uid] += @count(ifxmit[pid,dev,exec,uid]);  }  foreach ([pid, dev, exec, uid] in ifmerged-) { n_xmit = @count(ifxmit[pid, dev, exec, uid]) n_recv = @count(ifrecv[pid, dev, exec, uid]) printf("%5d %5d %-7s %7d %7d %7d %7d %-15s\n",   pid, uid, dev, n_xmit, n_recv,   n_xmit ? @sum(ifxmit[pid, dev, exec, uid])/1024 : 0,   n_recv ? @sum(ifrecv[pid, dev, exec, uid])/1024 : 0,   exec)  }  print("\n")  delete ifxmit  delete ifrecv  delete ifmerged}probe timer.ms(5000), end, error{  print_activity()}

Note that function print_activity() uses the following expressions:

n_xmit ? @sum(ifxmit[pid, dev, exec, uid])/1024 : 0n_recv ? @sum(ifrecv[pid, dev, exec, uid])/1024 : 0

These expressions are if/else conditionals. The first statement is simply a more concise way of writing the following psuedo code:

if n_recv != 0 then  @sum(ifrecv[pid, dev, exec, uid])/1024else  0

nettop.stp tracks which processes are generating network traffic on the system, and provides the following information about each process:

PID - the ID of the listed process.
UID - user ID. A user ID of 0 refers to the root user.
DEV - which ethernet device the process used to send / receive data (e.g. eth0, eth1)
XMIT_PK - number of packets transmitted by the process
RECV_PK - number of packets received by the process
XMIT_KB - amount of data sent by the process, in kilobytes
RECV_KB - amount of data received by the service, in kilobytes

nettop.stp provides network profile sampling every 5 seconds. You can change this setting by editing probe timer.ms(5000) accordingly. Example 4.1, "nettop.stp Sample Output" contains an excerpt of the output from nettop.stp over a 20-second period:

Example 4.1. nettop.stp Sample Output

[...]  PID   UID DEV XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND 0 0 eth0  0   5   0   0 swapper 11178 0 eth0  2   0   0   0 synergyc PID   UID DEV XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND 2886 4 eth0 79   0   5   0 cups-polld 11362 0 eth0  0  61   0   5 firefox 0 0 eth0  3  32   0   3 swapper 2886 4 lo 4   4   0   0 cups-polld 11178 0 eth0  3   0   0   0 synergyc PID   UID DEV XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND 0 0 eth0  0   6   0   0 swapper 2886 4 lo 2   2   0   0 cups-polld 11178 0 eth0  3   0   0   0 synergyc 3611 0 eth0  0   1   0   0 Xorg PID   UID DEV XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND 0 0 eth0  3  42   0   2 swapper 11178 0 eth0 43   1   3   0 synergyc   11362 0 eth0  0   7   0   0 firefox 3897 0 eth0  0   1   0   0 multiload-apple[...]

4.1.2. Tracing Functions Called in Network Socket Code

This section describes how to trace functions called from the kernel's net/socket.c file. This task helps you identify, in finer detail, how each process interacts with the network at the kernel level.

socket-trace.stp

#!/usr/bin/stapprobe kernel.function("*@net/socket.c").call {  printf ("%s -> %s\n", thread_indent(1), probefunc())}probe kernel.function("*@net/socket.c").return {  printf ("%s <- %s\n", thread_indent(-1), probefunc())}

socket-trace.stp is identical to Example 3.6, "thread_indent.stp", which was earlier used in SystemTap Functions to illustrate how thread_indent() works.

Example 4.2. socket-trace.stp Sample Output

[...]0 Xorg(3611): -> sock_poll3 Xorg(3611): <- sock_poll0 Xorg(3611): -> sock_poll3 Xorg(3611): <- sock_poll0 gnome-terminal(11106): -> sock_poll5 gnome-terminal(11106): <- sock_poll0 scim-bridge(3883): -> sock_poll3 scim-bridge(3883): <- sock_poll0 scim-bridge(3883): -> sys_socketcall4 scim-bridge(3883):  -> sys_recv8 scim-bridge(3883):   -> sys_recvfrom12 scim-bridge(3883):-> sock_from_file16 scim-bridge(3883):<- sock_from_file20 scim-bridge(3883):-> sock_recvmsg24 scim-bridge(3883):<- sock_recvmsg28 scim-bridge(3883):   <- sys_recvfrom31 scim-bridge(3883):  <- sys_recv35 scim-bridge(3883): <- sys_socketcall[...]

Example 4.2, "socket-trace.stp Sample Output" contains a 3-second excerpt of the output for socket-trace.stp. For more information about the output of this script as provided by thread_indent(), refer to SystemTap Functions Example 3.6, "thread_indent.stp".

4.1.3. Monitoring Incoming TCP Connections

This section illustrates how to monitor incoming TCP connections. This task is useful in identifying any unauthorized, suspicious, or otherwise unwanted network access requests in real time.

tcp_connections.stp

#! /usr/bin/env stapprobe begin {  printf("%6s %16s %6s %6s %16s\n", "UID", "CMD", "PID", "PORT", "IP_SOURCE")}probe kernel.function("tcp_accept").return?,  kernel.function("inet_csk_accept").return? {  sock = $return  if (sock != 0) printf("%6d %16s %6d %6d %16s\n", uid(), execname(), pid(),   inet_get_local_port(sock), inet_get_ip_source(sock))}

While tcp_connections.stp is running, it will print out the following information about any incoming TCP connections accepted by the system in real time:

Current UID
CMD - the command accepting the connection
PID of the command
Port used by the connection
IP address from which the TCP connection originated

Example 4.3. tcp_connections.stp Sample Output

UID CMD PID   PORT IP_SOURCE0 sshd   3165 22  10.64.0.2270 sshd   3165 22  10.64.0.227

4.1.4. Monitoring Network Packets Drops in Kernel

The network stack in Linux can discard packets for various reasons. Some Linux kernels include a tracepoint, kernel.trace("kfree_skb"), which easily tracks where packets are discarded. dropwatch.stp uses kernel.trace("kfree_skb") to trace packet discards; the script summarizes which locations discard packets every five-second interval.

dropwatch.stp

#!/usr/bin/stap############################################################# Dropwatch.stp# Author: Neil Horman <[email protected]># An example script to mimic the behavior of the dropwatch utility# http://fedorahosted.org/dropwatch############################################################# Array to hold the list of drop points we findglobal locations# Note when we turn the monitor on and offprobe begin { printf("Monitoring for dropped packets\n") }probe end { printf("Stopping dropped packet monitor\n") }# increment a drop counter for every location we drop atprobe kernel.trace("kfree_skb") { locations[$location] <<< 1 }# Every 5 seconds report our drop locationsprobe timer.sec(5){printf("\n")foreach (l in locations-) {printf("%d packets dropped at location %p\n",   @count(locations[l]), l)}delete locations}

The kernel.trace("kfree_skb") traces which places in the kernel drop network packets. The kernel.trace("kfree_skb") has two arguments: a pointer to the buffer being freed ($skb) and the location in kernel code the buffer is being freed ($location).

Running the dropwatch.stp script 15 seconds would result in output similar in Example 4.4, "dropwatch.stp Sample Output". The output lists the number of misses for tracepoint address and the actual address.

Example 4.4. dropwatch.stp Sample Output

Monitoring for dropped packets51 packets dropped at location 0xffffffff8024cd0f2 packets dropped at location 0xffffffff8044b47251 packets dropped at location 0xffffffff8024cd0f1 packets dropped at location 0xffffffff8044b47297 packets dropped at location 0xffffffff8024cd0f1 packets dropped at location 0xffffffff8044b472Stopping dropped packet monitor

To make the location of packet drops more meaningful, refer to the /boot/System.map-`uname -r` file. This file lists the starting addresses for each function, allowing you to map the addresses in the output of Example 4.4, "dropwatch.stp Sample Output" to a specific function name. Given the following snippet of the /boot/System.map-`uname -r` file, the address 0xffffffff8024cd0f maps to the function unix_stream_recvmsg and the address 0xffffffff8044b472 maps to the function arp_rcv:

[...]ffffffff8024c5cd T unlock_new_inodeffffffff8024c5da t unix_stream_sendmsgffffffff8024c920 t unix_stream_recvmsgffffffff8024cea1 t udp_v4_lookup_longway[...]ffffffff8044addc t arp_processffffffff8044b360 t arp_rcvffffffff8044b487 t parp_redoffffffff8044b48c t arp_solicit[...]

4.2. Disk

The following sections showcase scripts that monitor disk and I/O activity.

4.2.1. Summarizing Disk Read/Write Traffic

This section describes how to identify which processes are performing the heaviest disk reads/writes to the system.

disktop.stp

#!/usr/bin/stap## Copyright (C) 2007 Oracle Corp.## Get the status of reading/writing disk every 5 seconds,# output top ten entries ## This is free software,GNU General Public License (GPL);# either version 2, or (at your option) any later version.## Usage:#  ./disktop.stp#global io_stat,deviceglobal read_bytes,write_bytesprobe vfs.read.return {  if ($return>0) { if (devname!="N/A") {/*skip read from cache*/  io_stat[pid(),execname(),uid(),ppid(),"R"] += $return  device[pid(),execname(),uid(),ppid(),"R"] = devname  read_bytes += $return }  }}probe vfs.write.return {  if ($return>0) { if (devname!="N/A") { /*skip update cache*/  io_stat[pid(),execname(),uid(),ppid(),"W"] += $return  device[pid(),execname(),uid(),ppid(),"W"] = devname  write_bytes += $return }  }}probe timer.ms(5000) {  /* skip non-read/write disk */  if (read_bytes+write_bytes) { printf("\n%-25s, %-8s%4dKb/sec, %-7s%6dKb, %-7s%6dKb\n\n",   ctime(gettimeofday_s()),   "Average:", ((read_bytes+write_bytes)/1024)/5,   "Read:",read_bytes/1024,   "Write:",write_bytes/1024) /* print header */ printf("%8s %8s %8s %25s %8s %4s %12s\n",   "UID","PID","PPID","CMD","DEVICE","T","BYTES")  }  /* print top ten I/O */  foreach ([process,cmd,userid,parent,action] in io_stat- limit 10) printf("%8d %8d %8d %25s %8s %4s %12d\n",   userid,process,parent,cmd,   device[process,cmd,userid,parent,action],   action,io_stat[process,cmd,userid,parent,action])  /* clear data */  delete io_stat  delete device  read_bytes = 0  write_bytes = 0  }probe end{  delete io_stat  delete device  delete read_bytes  delete write_bytes}

disktop.stp outputs the top ten processes responsible for the heaviest reads/writes to disk. Example 4.5, "disktop.stp Sample Output" displays a sample output for this script, and includes the following data per listed process:

UID - user ID. A user ID of 0 refers to the root user.
PID - the ID of the listed process.
PPID - the process ID of the listed process's parent process.
CMD - the name of the listed process.
DEVICE - which storage device the listed process is reading from or writing to.
T - the type of action performed by the listed process; W refers to write, while R refers to read.
BYTES - the amount of data read to or written from disk.

The time and date in the output of disktop.stp is returned by the functions ctime() and gettimeofday_s(). ctime() derives calendar time in terms of seconds passed since the Unix epoch (January 1, 1970). gettimeofday_s() counts the actual number of seconds since Unix epoch, which gives a fairly accurate human-readable timestamp for the output.

In this script, the $return is a local variable that stores the actual number of bytes each process reads or writes from the virtual file system. $return can only be used in return probes (e.g. vfs.read.return and vfs.read.return).

Example 4.5. disktop.stp Sample Output

[...]Mon Sep 29 03:38:28 2008 , Average:  19Kb/sec, Read: 7Kb, Write: 89KbUID  PID PPID   CMD   DEVICE T BYTES0 26319 26294   firefox sda5 W 902290 2758 2757   pam_timestamp_c sda5 R 80640 2885 1 cupsd sda5 W 1678Mon Sep 29 03:38:38 2008 , Average:   1Kb/sec, Read: 7Kb, Write: 1KbUID  PID PPID   CMD   DEVICE T BYTES0 2758 2757   pam_timestamp_c sda5 R 80640 2885 1 cupsd sda5 W 1678

4.2.2. Tracking I/O Time For Each File Read or Write

This section describes how to monitor the amount of time it takes for each process to read from or write to any file. This is useful if you wish to determine what files are slow to load on a given system.

iotime.stp

global startglobal entry_ioglobal fd_ioglobal time_iofunction timestamp:long() {  return gettimeofday_us() - start}function proc:string() {  return sprintf("%d (%s)", pid(), execname())}probe begin { start = gettimeofday_us()}global filenamesglobal filehandlesglobal filereadglobal filewriteprobe syscall.open {  filenames[pid()] = user_string($filename)} probe syscall.open.return {  if ($return != -1) { filehandles[pid(), $return] = filenames[pid()] fileread[pid(), $return] = 0 filewrite[pid(), $return] = 0  } else { printf("%d %s access %s fail\n", timestamp(), proc(), filenames[pid()])  }  delete filenames[pid()]}probe syscall.read {  if ($count > 0) { fileread[pid(), $fd] += $count  }  t = gettimeofday_us(); p = pid()  entry_io[p] = t  fd_io[p] = $fd}probe syscall.read.return {  t = gettimeofday_us(); p = pid()  fd = fd_io[p]  time_io[p,fd] <<< t - entry_io[p]}probe syscall.write {  if ($count > 0) { filewrite[pid(), $fd] += $count  }  t = gettimeofday_us(); p = pid()  entry_io[p] = t  fd_io[p] = $fd}probe syscall.write.return {  t = gettimeofday_us(); p = pid()  fd = fd_io[p]  time_io[p,fd] <<< t - entry_io[p]}probe syscall.close {  if (filehandles[pid(), $fd] != "") { printf("%d %s access %s read: %d write: %d\n",  timestamp(), proc(),   filehandles[pid(), $fd], fileread[pid(), $fd], filewrite[pid(), $fd]) if (@count(time_io[pid(), $fd]))  printf("%d %s iotime %s time: %d\n",  timestamp(), proc(), filehandles[pid(), $fd], @sum(time_io[pid(), $fd]))   }  delete fileread[pid(), $fd]  delete filewrite[pid(), $fd]  delete filehandles[pid(), $fd]  delete fd_io[pid()]  delete entry_io[pid()]  delete time_io[pid(),$fd]}

iotime.stp tracks each time a system call opens, closes, reads from, and writes to a file. For each file any system call accesses, iotime.stp counts the number of microseconds it takes for any reads or writes to finish and tracks the amount of data (in bytes) read from or written to the file.

iotime.stp also uses the local variable $count to track the amount of data (in bytes) that any system call attempts to read or write. Note that $return (as used in disktop.stp from Section 4.2.1, "Summarizing Disk Read/Write Traffic") stores the actual amount of data read/written. $count can only be used on probes that track data reads or writes (e.g. syscall.read and syscall.write).

Example 4.6. iotime.stp Sample Output

[...]825946 3364 (NetworkManager) access /sys/class/net/eth0/carrier read: 8190 write: 0825955 3364 (NetworkManager) iotime /sys/class/net/eth0/carrier time: 9[...]117061 2460 (pcscd) access /dev/bus/usb/003/001 read: 43 write: 0117065 2460 (pcscd) iotime /dev/bus/usb/003/001 time: 7[...]3973737 2886 (sendmail) access /proc/loadavg read: 4096 write: 03973744 2886 (sendmail) iotime /proc/loadavg time: 11[...]

Example 4.6, "iotime.stp Sample Output" prints out the following data:

A timestamp, in microseconds.
Process ID and process name.
An access or iotime flag.
The file accessed.

If a process was able to read or write any data, a pair of access and iotime lines should appear together. The access line's timestamp refers to the time that a given process started accessing a file; at the end of the line, it will show the amount of data read/written (in bytes). The iotime line will show the amount of time (in microseconds) that the process took in order to perform the read or write.

If an access line is not followed by an iotime line, it simply means that the process did not read or write any data.

4.2.3. Track Cumulative IO

This section describes how to track the cumulative amount of I/O to the system.

traceio.stp

#! /usr/bin/env stap# traceio.stp# Copyright (C) 2007 Red Hat, Inc., Eugene Teo <[email protected]># Copyright (C) 2009 Kai Meyer <[email protected]>#   Fixed a bug that allows this to run longer#   And added the humanreadable function## This program is free software; you can redistribute it and/or modify# it under the terms of the GNU General Public License version 2 as# published by the Free Software Foundation.#global reads, writes, total_ioprobe vfs.read.return {  reads[pid(),execname()] += $return  total_io[pid(),execname()] += $return}probe vfs.write.return {  writes[pid(),execname()] += $return  total_io[pid(),execname()] += $return}function humanreadable(bytes) {  if (bytes > 1024*1024*1024) { return sprintf("%d GiB", bytes/1024/1024/1024)  } else if (bytes > 1024*1024) { return sprintf("%d MiB", bytes/1024/1024)  } else if (bytes > 1024) { return sprintf("%d KiB", bytes/1024)  } else { return sprintf("%d   B", bytes)  }}probe timer.s(1) {  foreach([p,e] in total_io- limit 10) printf("%8d %15s r: %12s w: %12s\n",   p, e, humanreadable(reads[p,e]),   humanreadable(writes[p,e]))  printf("\n")  # Note we don't zero out reads, writes and total_io,  # so the values are cumulative since the script started.}

traceio.stp prints the top ten executables generating I/O traffic over time. In addition, it also tracks the cumulative amount of I/O reads and writes done by those ten executables. This information is tracked and printed out in 1-second intervals, and in descending order.

Note that traceio.stp also uses the local variable $return, which is also used by disktop.stp from Section 4.2.1, "Summarizing Disk Read/Write Traffic".

Example 4.7. traceio.stp Sample Output

[...]   Xorg r:   583401 KiB w: 0 KiB   floaters r:   96 KiB w: 7130 KiBmultiload-apple r:  538 KiB w:  537 KiB   sshd r:   71 KiB w:   72 KiBpam_timestamp_c r:  138 KiB w: 0 KiB staprun r:   51 KiB w:   51 KiB  snmpd r:   46 KiB w: 0 KiB  pcscd r:   28 KiB w: 0 KiB irqbalance r:   27 KiB w: 4 KiB  cupsd r: 4 KiB w:   18 KiB   Xorg r:   588140 KiB w: 0 KiB   floaters r:   97 KiB w: 7143 KiBmultiload-apple r:  543 KiB w:  542 KiB   sshd r:   72 KiB w:   72 KiBpam_timestamp_c r:  138 KiB w: 0 KiB staprun r:   51 KiB w:   51 KiB  snmpd r:   46 KiB w: 0 KiB  pcscd r:   28 KiB w: 0 KiB irqbalance r:   27 KiB w: 4 KiB  cupsd r: 4 KiB w:   18 KiB

4.2.4. I/O Monitoring (By Device)

This section describes how to monitor I/O activity on a specific device.

traceio2.stp

#! /usr/bin/env stapglobal device_of_interestprobe begin {  /* The following is not the most efficient way to do this.  One could directly put the result of usrdev2kerndev()  into device_of_interest.  However, want to test out  the other device functions */  dev = usrdev2kerndev($1)  device_of_interest = MKDEV(MAJOR(dev), MINOR(dev))}probe vfs.write, vfs.read{  if (dev == device_of_interest) printf ("%s(%d) %s 0x%x\n", execname(), pid(), probefunc(), dev)}

traceio2.stp takes 1 argument: the whole device number. To get this number, use stat -c "0x%D" directory, where directory is located in the device you wish to monitor.

The usrdev2kerndev() function converts the whole device number into the format understood by the kernel. The output produced by usrdev2kerndev() is used in conjunction with the MKDEV(), MINOR(), and MAJOR() functions to determine the major and minor numbers of a specific device.

The output of traceio2.stp includes the name and ID of any process performing a read/write, the function it is performing (i.e. vfs_read or vfs_write), and the kernel device number.

The following example is an excerpt from the full output of stap traceio2.stp 0x805, where 0x805 is the whole device number of /home. /home resides in /dev/sda5, which is the device we wish to monitor.

Example 4.8. traceio2.stp Sample Output

[...]synergyc(3722) vfs_read 0x800005synergyc(3722) vfs_read 0x800005cupsd(2889) vfs_write 0x800005cupsd(2889) vfs_write 0x800005cupsd(2889) vfs_write 0x800005[...]

4.2.5. Monitoring Reads and Writes to a File

This section describes how to monitor reads from and writes to a file in real time.

inodewatch.stp

#! /usr/bin/env stapprobe vfs.write, vfs.read{  # dev and ino are defined by vfs.write and vfs.read  if (dev == MKDEV($1,$2) # major/minor device  && ino == $3) printf ("%s(%d) %s 0x%x/%u\n",  execname(), pid(), probefunc(), dev, ino)}

inodewatch.stp takes the following information about the file as arguments on the command line:

The file's major device number.
The file's minor device number.
The file's inode number.

To get this information, use stat -c '%D %i' filename, where filename is an absolute path.

For instance: if you wish to monitor /etc/crontab, run stat -c '%D %i' /etc/crontab first. This gives the following output:

805 1078319

805 is the base-16 (hexadecimal) device number. The lower two digits are the minor device number and the upper digits are the major number. 1078319 is the inode number. To start monitoring /etc/crontab, run stap inodewatch.stp 0x8 0x05 1078319 (The 0x prefixes indicate base-16 values).

The output of this command contains the name and ID of any process performing a read/write, the function it is performing (i.e. vfs_read or vfs_write), the device number (in hex format), and the inode number. Example 4.9, "inodewatch.stp Sample Output" contains the output of stap inodewatch.stp 0x8 0x05 1078319 (when cat /etc/crontab is executed while the script is running) :

Example 4.9. inodewatch.stp Sample Output

cat(16437) vfs_read 0x800005/1078319cat(16437) vfs_read 0x800005/1078319

4.2.6. Monitoring Changes to File Attributes

This section describes how to monitor if any processes are changing the attributes of a targeted file, in real time.

inodewatch2-simple.stp

global ATTR_MODE = 1probe kernel.function("inode_setattr") {  dev_nr = $inode->i_sb->s_dev  inode_nr = $inode->i_ino  if (dev_nr == ($1 << 20 | $2) # major/minor device  && inode_nr == $3  && $attr->ia_valid & ATTR_MODE) printf ("%s(%d) %s 0x%x/%u %o %d\n",  execname(), pid(), probefunc(), dev_nr, inode_nr, $attr->ia_mode, uid())}

Like inodewatch.stp from Section 4.2.5, "Monitoring Reads and Writes to a File", inodewatch2-simple.stp takes the targeted file's device number (in integer format) and inode number as arguments. For more information on how to retrieve this information, refer to Section 4.2.5, "Monitoring Reads and Writes to a File".

The output for inodewatch2-simple.stp is similar to that of inodewatch.stp, except that inodewatch2-simple.stp also contains the attribute changes to the monitored file, as well as the ID of the user responsible (uid()). Example 4.10, "inodewatch2-simple.stp Sample Output" shows the output of inodewatch2-simple.stp while monitoring /home/joe/bigfile when user joe executes chmod 777 /home/joe/bigfile and chmod 666 /home/joe/bigfile.

Example 4.10. inodewatch2-simple.stp Sample Output

chmod(17448) inode_setattr 0x800005/6011835 100777 500chmod(17449) inode_setattr 0x800005/6011835 100666 500

4.3. Profiling

The following sections showcase scripts that profile kernel activity by monitoring function calls.

4.3.1. Counting Function Calls Made

This section describes how to identify how many times the system called a specific kernel function in a 30-second sample. Depending on your use of wildcards, you can also use this script to target multiple kernel functions.

functioncallcount.stp

#! /usr/bin/env stap# The following line command will probe all the functions# in kernel's memory management code:## stap  functioncallcount.stp "*@mm/*.c"probe kernel.function(@1).call {  # probe functions listed on commandline  called[probefunc()] <<< 1  # add a count efficiently}global calledprobe end {  foreach (fn in called-)  # Sort by call count (in decreasing order)  #   (fn+ in called)  # Sort by function name printf("%s %d\n", fn, @count(called[fn]))  exit()}

functioncallcount.stp takes the targeted kernel function as an argument. The argument supports wildcards, which enables you to target multiple kernel functions up to a certain extent.

The output of functioncallcount.stp contains the name of the function called and how many times it was called during the sample time (in alphabetical order). Example 4.11, "functioncallcount.stp Sample Output" contains an excerpt from the output of stap functioncallcount.stp "*@mm/*.c":

Example 4.11. functioncallcount.stp Sample Output

[...]__vma_link 97__vma_link_file 66__vma_link_list 97__vma_link_rb 97__xchg 103add_page_to_active_list 102add_page_to_inactive_list 19add_to_page_cache 19add_to_page_cache_lru 7all_vm_events 6alloc_pages_node 4630alloc_slabmgmt 67anon_vma_alloc 62anon_vma_free 62anon_vma_lock 66anon_vma_prepare 98anon_vma_unlink 97anon_vma_unlock 66arch_get_unmapped_area_topdown 94arch_get_unmapped_exec_area 3arch_unmap_area_topdown 97atomic_add 2atomic_add_negative 97atomic_dec_and_test 5153atomic_inc 470atomic_inc_and_test 1[...]

4.3.2. Call Graph Tracing

This section describes how to trace incoming and outgoing function calls.

para-callgraph.stp

#! /usr/bin/env stapfunction trace(entry_p, extra) {  %( $# > 1 %? if (tid() in trace) %)  printf("%s%s%s %s\n", thread_indent (entry_p), (entry_p>0?"->":"<-"), probefunc (), extra)}%( $# > 1 %?global traceprobe $2.call {  trace[tid()] = 1}probe $2.return {  delete trace[tid()]}%)probe $1.call   { trace(1, $$parms) }probe $1.return { trace(-1, $$return) }

para-callgraph.stp takes two command-line arguments:

The function(s) whose entry/exit you'd like to trace ($1).
A second optional trigger function ($2), which enables or disables tracing on a per-thread basis. Tracing in each thread will continue as long as the trigger function has not exited yet.

para-callgraph.stp uses thread_indent(); as such, its output contains the timestamp, process name, and thread ID of $1 (i.e. the probe function you are tracing). For more information about thread_indent(), refer to its entry in SystemTap Functions.

The following example contains an excerpt from the output for stap para-callgraph.stp 'kernel.function("*@fs/*.c")' 'kernel.function("sys_read")':

Example 4.12. para-callgraph.stp Sample Output

[...]   267 gnome-terminal(2921): <-do_sync_read return=0xfffffffffffffff5   269 gnome-terminal(2921):<-vfs_read return=0xfffffffffffffff5 0 gnome-terminal(2921):->fput file=0xffff880111eebbc0 2 gnome-terminal(2921):<-fput  0 gnome-terminal(2921):->fget_light fd=0x3 fput_needed=0xffff88010544df54 3 gnome-terminal(2921):<-fget_light return=0xffff8801116ce980 0 gnome-terminal(2921):->vfs_read file=0xffff8801116ce980 buf=0xc86504 count=0x1000 pos=0xffff88010544df48 4 gnome-terminal(2921): ->rw_verify_area read_write=0x0 file=0xffff8801116ce980 ppos=0xffff88010544df48 count=0x1000 7 gnome-terminal(2921): <-rw_verify_area return=0x1000 12 gnome-terminal(2921): ->do_sync_read filp=0xffff8801116ce980 buf=0xc86504 len=0x1000 ppos=0xffff88010544df48 15 gnome-terminal(2921): <-do_sync_read return=0xfffffffffffffff5 18 gnome-terminal(2921):<-vfs_read return=0xfffffffffffffff5 0 gnome-terminal(2921):->fput file=0xffff8801116ce980

4.3.3. Determining Time Spent in Kernel and User Space

This section illustrates how to determine the amount of time any given thread is spending in either kernel or user-space.

thread-times.stp

#! /usr/bin/env stapprobe perf.sw.cpu_clock!, timer.profile {  // NB: To avoid contention on SMP machines, no global scalars/arrays used,  // only contention-free statistics aggregates.  tid=tid(); e=execname()  if (!user_mode()) kticks[e,tid] <<< 1  else uticks[e,tid] <<< 1  ticks <<< 1  tids[e,tid] <<< 1}global uticks, kticks, ticksglobal tidsprobe timer.s(5), end {  allticks = @count(ticks)  printf ("%16s %5s %7s %7s (of %d ticks)\n",  "comm", "tid", "%user", "%kernel", allticks)  foreach ([e,tid] in tids- limit 20) { uscaled = @count(uticks[e,tid])*10000/allticks kscaled = @count(kticks[e,tid])*10000/allticks printf ("%16s %5d %3d.%02d%% %3d.%02d%%\n",  e, tid, uscaled/100, uscaled%100, kscaled/100, kscaled%100)  }  printf("\n")  delete uticks  delete kticks  delete ticks  delete tids}

thread-times.stp lists the top 20 processes currently taking up CPU time within a 5-second sample, along with the total number of CPU ticks made during the sample. The output of this script also notes the percentage of CPU time each process used, as well as whether that time was spent in kernel space or user space.

Example 4.13, "thread-times.stp Sample Output" contains a 5-second sample of the output for thread-times.stp:

Example 4.13. thread-times.stp Sample Output

  tid   %user %kernel (of 20002 ticks) 0   0.00%  87.88%32169   5.24%   0.03% 9815   3.33%   0.36% 9859   0.95%   0.00% 3611   0.56%   0.12% 9861   0.62%   0.01%11106   0.37%   0.02%32167   0.08%   0.08% 3897   0.01%   0.08% 3800   0.03%   0.00% 2886   0.02%   0.00% 3243   0.00%   0.01% 3862   0.01%   0.00% 3782   0.00%   0.00%21767   0.00%   0.00% 2522   0.00%   0.00% 3883   0.00%   0.00% 3775   0.00%   0.00% 3943   0.00%   0.00% 3873   0.00%   0.00%

4.3.4. Monitoring Polling Applications

This section describes how to identify and monitor which applications are polling. Doing so allows you to track unnecessary or excessive polling, which can help you pinpoint areas for improvement in terms of CPU usage and power savings.

timeout.stp

#! /usr/bin/env stap# Copyright (C) 2009 Red Hat, Inc.# Written by Ulrich Drepper <[email protected]># Modified by William Cohen <[email protected]>global process, timeout_count, toglobal poll_timeout, epoll_timeout, select_timeout, itimer_timeoutglobal nanosleep_timeout, futex_timeout, signal_timeoutprobe syscall.poll, syscall.epoll_wait {  if (timeout) to[pid()]=timeout}probe syscall.poll.return {  p = pid()  if ($return == 0 && to[p] > 0 ) { poll_timeout[p]++ timeout_count[p]++ process[p] = execname() delete to[p]  }}probe syscall.epoll_wait.return {  p = pid()  if ($return == 0 && to[p] > 0 ) { epoll_timeout[p]++ timeout_count[p]++ process[p] = execname() delete to[p]  }}probe syscall.select.return {  if ($return == 0) { p = pid() select_timeout[p]++ timeout_count[p]++ process[p] = execname()  }}probe syscall.futex.return {  if (errno_str($return) == "ETIMEDOUT") { p = pid() futex_timeout[p]++ timeout_count[p]++ process[p] = execname()  }}probe syscall.nanosleep.return {  if ($return == 0) { p = pid() nanosleep_timeout[p]++ timeout_count[p]++ process[p] = execname()  }}probe kernel.function("it_real_fn") {  p = pid()  itimer_timeout[p]++  timeout_count[p]++  process[p] = execname()}probe syscall.rt_sigtimedwait.return {  if (errno_str($return) == "EAGAIN") { p = pid() signal_timeout[p]++ timeout_count[p]++ process[p] = execname()  }}probe syscall.exit {  p = pid()  if (p in process) { delete process[p] delete timeout_count[p] delete poll_timeout[p] delete epoll_timeout[p] delete select_timeout[p] delete itimer_timeout[p] delete futex_timeout[p] delete nanosleep_timeout[p] delete signal_timeout[p]  }}probe timer.s(1) {  ansi_clear_screen()  printf ("  pid |   poll  select   epoll  itimer   futex nanosle  signal| process\n")  foreach (p in timeout_count- limit 20) { printf ("%5d |%7d %7d %7d %7d %7d %7d %7d| %-.38s\n", p,  poll_timeout[p], select_timeout[p],  epoll_timeout[p], itimer_timeout[p],  futex_timeout[p], nanosleep_timeout[p],  signal_timeout[p], process[p])  }}

timeout.stp tracks how many times each application used the following system calls over time:

poll
select
epoll
itimer
futex
nanosleep
signal

In some applications, these system calls are used excessively. As such, they are normally identified as "likely culprits" for polling applications. Note, however, that an application may be using a different system call to poll excessively; sometimes, it is useful to find out the top system calls used by the system (refer to Section 4.3.5, "Tracking Most Frequently Used System Calls" for instructions). Doing so can help you identify any additional suspects, which you can add to timeout.stp for tracking.

Example 4.14. timeout.stp Sample Output

  uid |   poll  select   epoll  itimer   futex nanosle  signal| process28937 | 148793   0   0 4727   37288   0   0| firefox22945 |  0   56949   0   1   0   0   0| scim-bridge 0 |  0   0   0   36414   0   0   0| swapper 4275 |  23140   0   0   1   0   0   0| mixer_applet2 4191 |  0   14405   0   0   0   0   0| scim-launcher22941 |   7908   1   0  62   0   0   0| gnome-terminal 4261 |  0   0   0   2   0 7622   0| escd 3695 |  0   0   0   0   0 7622   0| gdm-binary 3483 |  0 7206   0   0   0   0   0| dhcdbd 4189 |   6916   0   0   2   0   0   0| scim-panel-gtk 1863 |   5767   0   0   0   0   0   0| iscsid 2562 |  0 2881   0   1   0 1438   0| pcscd 4257 |   4255   0   0   1   0   0   0| gnome-power-man 4278 |   3876   0   0  60   0   0   0| multiload-apple 4083 |  0 1331   0 1728   0   0   0| Xorg 3921 |   1603   0   0   0   0   0   0| gam_server 4248 |   1591   0   0   0   0   0   0| nm-applet 3165 |  0 1441   0   0   0   0   0| xterm29548 |  0 1440   0   0   0   0   0| httpd 1862 |  0   0   0   0   0 1438   0| iscsid

You can increase the sample time by editing the timer in the second probe (timer.s()). The output of functioncallcount.stp contains the name and UID of the top 20 polling applications, along with how many times each application performed each polling system call (over time). Example 4.14, "timeout.stp Sample Output" contains an excerpt of the script:

4.3.5. Tracking Most Frequently Used System Calls

timeout.stp from Section 4.3.4, "Monitoring Polling Applications" helps you identify which applications are polling by pointing out which ones used the following system calls most frequently:

poll
select
epoll
itimer
futex
nanosleep
signal

However, in some systems, a different system call might be responsible for excessive polling. If you suspect that a polling application is using a different system call to poll, you need to identify first the top system calls used by the system. To do this, use topsys.stp.

topsys.stp

#! /usr/bin/env stap## This script continuously lists the top 20 systemcalls in the interval # 5 seconds#global syscalls_countprobe syscall.* {  syscalls_count[name]++}function print_systop () {  printf ("%25s %10s\n", "SYSCALL", "COUNT")  foreach (syscall in syscalls_count- limit 20) { printf("%25s %10d\n", syscall, syscalls_count[syscall])  }  delete syscalls_count}probe timer.s(5) {  print_systop ()  printf("--------------------------------------------------------------\n")}

topsys.stp lists the top 20 system calls used by the system per 5-second interval. It also lists how many times each system call was used during that period. Refer to Example 4.15, "topsys.stp Sample Output" for a sample output.

Example 4.15. topsys.stp Sample Output

--------------------------------------------------------------  SYSCALL  COUNT gettimeofday   1857 read   1821 ioctl   1568 poll   1033 close 638 open 503   select 455 write 391   writev 335 futex 303  recvmsg 251   socket 137 clock_gettime 124   rt_sigprocmask 121   sendto 120 setitimer 106 stat 90 time 81 sigreturn 72 fstat 66--------------------------------------------------------------

4.3.6. Tracking System Call Volume Per Process

This section illustrates how to determine which processes are performing the highest volume of system calls. In previous sections, we've described how to monitor the top system calls used by the system over time (Section 4.3.5, "Tracking Most Frequently Used System Calls"). We've also described how to identify which applications use a specific set of "polling suspect" system calls the most (Section 4.3.4, "Monitoring Polling Applications"). Monitoring the volume of system calls made by each process provides more data in investigating your system for polling processes and other resource hogs.

syscalls_by_proc.stp

#! /usr/bin/env stap# Copyright (C) 2006 IBM Corp.## This file is part of systemtap, and is free software.  You can# redistribute it and/or modify it under the terms of the GNU General# Public License (GPL); either version 2, or (at your option) any# later version.## Print the system call count by process name in descending order.#global syscallsprobe begin {  print ("Collecting data... Type Ctrl-C to exit and display results\n")}probe syscall.* {  syscalls[execname()]++}probe end {  printf ("%-10s %-s\n", "#SysCalls", "Process Name")  foreach (proc in syscalls-) printf("%-10d %-s\n", syscalls[proc], proc)}

syscalls_by_proc.stp lists the top 20 processes performing the highest number of system calls. It also lists how many system calls each process performed during the time period. Refer to Example 4.16, "topsys.stp Sample Output" for a sample output.

Example 4.16. topsys.stp Sample Output

Collecting data... Type Ctrl-C to exit and display results#SysCalls  Process Name1577   multiload-apple692 synergyc408 pcscd376 mixer_applet2299 gnome-terminal293 Xorg206 scim-panel-gtk95 gnome-power-man90 artsd85 dhcdbd84 scim-bridge78 gnome-screensav66 scim-launcher[...]

If you prefer the output to display the process IDs instead of the process names, use the following script instead.

syscalls_by_pid.stp

#! /usr/bin/env stap# Copyright (C) 2006 IBM Corp.## This file is part of systemtap, and is free software.  You can# redistribute it and/or modify it under the terms of the GNU General# Public License (GPL); either version 2, or (at your option) any# later version.## Print the system call count by process ID in descending order.#global syscallsprobe begin {  print ("Collecting data... Type Ctrl-C to exit and display results\n")}probe syscall.* {  syscalls[pid()]++}probe end {  printf ("%-10s %-s\n", "#SysCalls", "PID")  foreach (pid in syscalls-) printf("%-10d %-d\n", syscalls[pid], pid)}

As indicated in the output, you need to manually exit the script in order to display the results. You can add a timed expiration to either script by simply adding a timer.s() probe; for example, to instruct the script to expire after 5 seconds, add the following probe to the script:

probe timer.s(5){exit()}

4.4. Identifying Contended User-Space Locks

This section describes how to identify contended user-space locks throughout the system within a specific time period. The ability to identify contended user-space locks can help you investigate hangs that you suspect may be caused by futex contentions.

Simply put, a futex contention occurs when multiple processes are trying to access the same region of memory. In some cases, this can result in a deadlock between the processes in contention, thereby appearing as an application hang.

To do this, futexes.stp probes the futex system call.

futexes.stp

#! /usr/bin/env stap# This script tries to identify contended user-space locks by hooking# into the futex system call.global thread_thislock # shortglobal thread_blocktime # global FUTEX_WAIT = 0 /*, FUTEX_WAKE = 1 */global lock_waits # long-lived stats on (tid,lock) blockage elapsed timeglobal process_names # long-lived pid-to-execname mappingprobe syscall.futex { if (op != FUTEX_WAIT) next # don't care about WAKE event originator  t = tid ()  process_names[pid()] = execname()  thread_thislock[t] = $uaddr  thread_blocktime[t] = gettimeofday_us()}probe syscall.futex.return { t = tid()  ts = thread_blocktime[t]  if (ts) { elapsed = gettimeofday_us() - ts lock_waits[pid(), thread_thislock[t]] <<< elapsed delete thread_blocktime[t] delete thread_thislock[t]  }}probe end {  foreach ([pid+, lock] in lock_waits) printf ("%s[%d] lock %p contended %d times, %d avg us\n", process_names[pid], pid, lock, @count(lock_waits[pid,lock]), @avg(lock_waits[pid,lock]))}

futexes.stp needs to be manually stopped; upon exit, it prints the following information:

Name and ID of the process responsible for a contention
The region of memory it contested
How many times the region of memory was contended
Average time of contention throughout the probe

Example 4.17, "futexes.stp Sample Output" contains an excerpt from the output of futexes.stp upon exiting the script (after approximately 20 seconds).

Example 4.17. futexes.stp Sample Output

[...]automount[2825] lock 0x00bc7784 contended 18 times, 999931 avg ussynergyc[3686] lock 0x0861e96c contended 192 times, 101991 avg ussynergyc[3758] lock 0x08d98744 contended 192 times, 101990 avg ussynergyc[3938] lock 0x0982a8b4 contended 192 times, 101997 avg us[...]

SystemTap Beginners Guide