现代 Java 捉虫指南


gnu.hl@antgroup.com

2020-12-01


本文是作者 2021 年在蚂蚁内部分享时的 slides 经过脱敏之后的版本。

当我们排查问题的时候,我们关注什么?

横向来看:

  • context/critical path/caller/callee (tracing)
  • statistics/metrics (profiling)

纵向来看:

  • application
  • runtime
  • kernel

Arthas

Arthas 是基于字节码增强的调试工具.

功能:

  • 观察方法的入参/返回值/异常等数据
  • 观察内存对象的值
  • 跟踪方法的耗时和调用栈
  • 查看类加载来源/热更新类定义

安装

curl -L http://start.alibaba-inc.com/install.sh | sh

Arthas 使用方法

  • 观察方法入参、返回值、异常
watch <class_fqcn> <method_name> "{params,returnObj,throwExp}" [condition] [-f] [-b] [-xN]
  • 拦截指定线程执行的方法
watch <class_fqcn> <method_name> "{params,returnObj,throwExp}" \
  "@java.lang.Thread@currentThread().getName()=='<thread_name>'" [-f] [-b] [-xN]
  • 在观察入参同时打印方法栈
watch <class_fqcn> <method_name> "{params,returnObj,throwExp, \
@java.lang.Thread@currentThread().getStackTrace()}" [condition] [-f] [-b] [-xN]

Arthas 使用方法(cont.)

  • 耗时较长的方法
watch <class_fqcn> <method_name> "{params,returnObj,throwExp, @java.lang.Thread@currentThread().getStackTrace()}" \
'#cost>100 [&& other_cond]' [-f] [-b] [-xN]
  • classloading 问题
sc -d '<class_name>'
  • 构造对象
watch com.alipay.antq.broker.processor.SendMessageProcessor sendMessage \
"{new java.lang.String(params[1].body)}" -x2 -n10

Arthas 使用方法(cont.)

  • Projection & filtering
watch com.alipay.antqnamesrv.core.service.broker.BrokerFileService getBrokerFiles \
    "{returnObj['FILE_CLUSTER']['antq-eu95-3.gz00b.stable.alipay.net']\
    .topicQueues.values().{ #this.{? #this.queueId==32 } }.{? #this.size()!=0 }}" -x3
  • 观察方法执行路径
trace [--skipJDKMethod true|false] <class_fqcn> <method_name>

N/A


Arthas 使用方法(cont.)

  • 制作火焰图
# list all events
profiler list
# profile object allocation
profiler start --event alloc -d 10
# profile lock acquire
profiler start --event lock -d 10
  • 热更新代码
mc/redefine

N/A


Arthas 的缺陷

  • 启动期观测 -> jdb / btrace
  • 测量引入 overhead
  • 注意内存问题

Eclipse MAT / zprofiler

  • OQL select * from io.netty.channel.AbstractChannelhandlerContext$11

Native memory issues

RSS = xmx +  MaxDirectMemory + N * xss [ + gc + code cache + metaspace ]


sa-jdi / clhsdb

Off-process vs. in-process instrumentation

  • sa: serviceability agent
  • clhsdb: command-line hotspot debugger
  • sa-jdi 提供了获取 JVM 内部数据结构的编程接口
  • sa-jdi 不仅可以 attach 活着的进程, 也可以分析 coredump
  • jstack/jstat等工具均提供了基于 attach api(tools.jar) 和 sa-jdi(sa-jdi.jar)的实现
  • 借HSDB来探索HotSpot VM的运行时数据
ls $JAVA_HOME/lib
java -cp $JAVA_HOME/lib/sa-jdi.jar sun.jvm.hotspot.CLHSDB

sa-jdi / clhsdb(cont.)

使用 sa-jdi 分析反射生成的类
import sun.jvm.hotspot.oops.InstanceKlass;
import sun.jvm.hotspot.tools.jcore.ClassFilter;

public class MethodAccessorFilter implements ClassFilter {
    @Override
    public boolean canInclude(InstanceKlass instanceKlass) {
        return instanceKlass.getName().asString().startsWith("sun/reflect/Generated");
    }
}
java -classpath ".:$JAVA_HOME/lib/sa-jdi.jar" \
-Dsun.jvm.hotspot.tools.jcore.filter=MethodAccessorFilter sun.jvm.hotspot.tools.jcore.ClassDump <pid>

SA-JDI is deprecated since Java9 – JDK-8158050


JFR

  • 域内要求 JVM 版本 ^ajdk-8_5_10_fp2-b9
  • -XX:+EnableJFR -XX:+FlightRecorder -XX:FlightRecorderOptions=<opt>=<val> -XX:StartFlightRecording=<opt>=<val>
  • 可分析的事件类型:
    • Lock contention
    • Memory allocation
    • IO performance (Network/File)
    • Code execution performance
    • Garbage Collector
    • JIT performance

JFR (cont.)

使用步骤

  • 启动进程时增加开启 JFR 的参数
  • 开始 JFR 记录 jcmd <pid> JFR.start settings=profile name=<record_name>
  • dumpJFR 记录 jcmd <pid> JFR.dump filename=jfr.output name=<record_name>
  • 使用 JDK Mission Control 分析 JFR dump

JFR (cont.)

参考资料


Interesting Cases

  • N/A
  • N/A
  • N/A

One more thing

ComponentTool
Application on runtime(Java/Node/Ruby/PHP)Runtime debugger,eBPF
Application (native code)System debugger,eBPF
System libraries: /lib/*ltrace(1),eBPF
System call interfacestrace(1), perf(1),eBPF
Network Traffictcpdump(8),eBPF
Kernel: Scheduler, file systems, TCP, IP, etcftrace(1), perf(1),eBPF
Hardware: CPU internals, devicesperf, sar, eBPF

Perf

  • Why is the kernel on-CPU so much? What code-paths?
  • Which code-paths are causing CPU level 2 cache misses?
  • Are the CPUs stalled on memory I/O?
  • Which code-paths are allocating memory, and how much?
  • What is triggering TCP retransmits?
  • Is a certain kernel function being called, and how often?
  • What reasons are threads leaving the CPU?

– Brendan Greg


Perf (cont.)

perf_event:

  • Software events
  • Hardware events
  • Kernel static tracepoint events
  • USDT

sudo perf list to list all available events.


# sudo perf record -e block:block_rq_issue -e block:block_rq_complete -a sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.428 MB perf.data (~18687 samples) ]
# sudo perf script
        run 30339 [000] 2083345.722767: block:block_rq_complete: 202,1 W () 12984648 + 8 [0]
        run 30339 [000] 2083345.722857: block:block_rq_complete: 202,1 W () 12986336 + 8 [0]
        run 30339 [000] 2083345.723180: block:block_rq_complete: 202,1 W () 12986528 + 8 [0]
    swapper     0 [000] 2083345.723489: block:block_rq_complete: 202,1 W () 12986496 + 8 [0]
    swapper     0 [000] 2083346.745840: block:block_rq_complete: 202,1 WS () 1052984 + 144 [0]
  supervise 30342 [000] 2083346.746571: block:block_rq_complete: 202,1 WS () 1053128 + 8 [0]
  supervise 30342 [000] 2083346.746663: block:block_rq_complete: 202,1 W () 12986608 + 8 [0]
        run 30342 [000] 2083346.747003: block:block_rq_complete: 202,1 W () 12986832 + 8 [0]

                                                                            
              /---------------------------------------- #1  supervise: on-CPU cmd                                                             
             /       /--------------------------------- #2  30342: on-CPU cmd tid                                                       
            /       /     /---------------------------- #3  [000]: CPU running cmd                                                 
           /       /     /          /------------------ #4  2083346.746571: reltime                                        
          /       /     /          /                /-- #5  block:block_rq_complete: tracepoint  
         /       /     /          /                /                  /------------- #6  202,1 : storage major,minor number, ref lsblk
        /       /     /          /                /                  /    /--------- #7  IO type: W-Write,R-Read,A-ReadAheader,O-Sync,WS-Sync Write   
       /       /     /          /                /                  /    /  /------- #8  (): Block command details
      /       /     /          /                /                  /    /  /     /-- #9  12986336: storage device offset
     /       /     /          /                /                  /    /  /     /     /---- #10 +8: size of IO (in sectors)
    /       /     /          /                /                  /    /  /     /     /  /-- #11 [0]: if errors happened
supervise 30342 [000] 2083346.746571: block:block_rq_complete: 202,1 WS () 1053128 + 8 [0]
# how can I get this format?  cat /sys/kernel/debug/tracing/events/{event_cat}/{event_name}/format

perf script | awk '{ gsub(/:/, "") } $5 ~ /issue/ { ts[$6, $10] = $4 }
    $5 ~ /complete/ { if (l = ts[$6, $9]) { printf "%.f %.f\n", $4 * 1000000,
    ($4 - l) * 1000000; ts[$6, $10] = 0 } }'
sudo yum install -y kernel-devels
sudo yum install -y kernel-headers
sudo mount -t debugfs debugfs /sys/kernel/debug

Using perf trace Java IO issues

sudo perf record -e block:block_rq_issue  -a -g  -p <pid> sleep 10
sudo perf report -G # No java stack info ?

sudo yum install -y gcc-c++ cmake
git clone https://github.com/jvm-profiling-tools/perf-map-agent
cd perf-map-agent && cmake . && make -j8
./bin/create-java-perf-map.sh <pid>
cd -
perf report -G # gotcha!

Flame Graph

git clone https://github.com/brendangregg/FlameGraph
sudo perf script | \
    <path-to>/stackcollapse-perf.pl | \
    <path-to>/flamegraph.pl > perf-kernel.svg

Reference


内存增长的另外一种思路

  • 通过LD_PRELOAD替换分配器为 jemalloc
  • 开启 jemalloc 的泄漏分析
  • 启动进程, 找到泄漏的 native 栈
  • perf record -e <tp> -ag 抓取分配的事件
  • create-java-perf-map.sh 生成符号表
  • perf report/script 查看访问线程

eBPF/XDP

eBPF does to Linux what JavaScript does to HTML. – Brendan Gregg


eBPF probes


eBPF vs. Perf


eBPF Hello world

#!/usr/bin/python
from bcc import BPF

prog = """
int hello(void *ctx) {
    bpf_trace_printk("Hello world\\n");
    return 0;
}
"""

b = BPF(text=prog)
#print(b.get_syscall_fnname("clone")) # platform-dependent syscall kprobe name
b.attach_kprobe(event="__x64_sys_clone", fn_name="hello")
#        ^~~~~~ this is a kprobe 
b.trace_print()

BCC (BPF Compiler Collection)

—eBPF made simple


Use BCC to explore JVM USDT

readelf -n $JAVA_HOME/jre/lib/amd64/server/libjvm.so | grep -A4 NT_STAPSDT
sudo <path-to>/trace.py -p <java_pid> 'u::thread__sleep__begin'

bpftrace

one-liner eBPF tools

# list tracepoints
bpftrace -l 'tracepoint:syscalls:sys_enter_*'
# trace file open
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'
# bio size dist
bpftrace -e 'tracepoint:block:block_rq_issue { @ = hist(args->bytes); }'

bpftrace Reference Guide


Using eBPF to debug Hotspot JVM

  • Compile OpenJDK with tracepoints enabled
  • Use BCC trace tools to trace JVM methods
  • More Info ref - Probing the JVM with BPF/BCC/[Enable USDT probes for eBPF tracing on Linux #1173

](https://github.com/AdoptOpenJDK/openjdk-build/issues/1173)

eBPF caveats and limitations

  • Kernel requirement: ^4.4
  • Byte code instruction size < 4096
  • No control loop

eBPF & XDP(eXpress Data Path)

Two ways toward kernel bypass

  • kernel in userspace
  • user code in kernel
    • XDP

XDP Operating modes

  • Native mode
    • Play with DMA buffer
    • NO SKB ALLOCATION
    • Least overhead
    • Need driver modification
  • SKB mode
    • From netif_receive_skb()
    • After SKB and DMA allocation
    • More instructions
    • Driver-Independent

XDP(cont.)

  • Flannel: L2(IP Overlay)
  • Calico: L3(BGP)
  • Cilium: L3/4/7 (eBPF/XDP)

     Container Network    


Cilium: eBPF & XDP based CNI plugin


总结

  • 勇于尝试新 kernel/runtime 和它们的新 feature .
  • 工具可以快速验证思路, 过度依赖工具可能导致一叶障目不见泰山, 核心还是建立对系统的理解.


Reference


Tribute To