Feb 12, 2021

现代 Java 捉虫指南

Lei HUANG

JavaProgramming

1873 1873 Words 8 Minutes, 30 Seconds

2021-02-12 06:01 -0800

gnu.hl@antgroup.com

2020-12-01

本文是作者 2021 年在蚂蚁内部分享时的 slides 经过脱敏之后的版本。

当我们排查问题的时候，我们关注什么？

横向来看:

context/critical path/caller/callee (tracing)
statistics/metrics (profiling)

纵向来看:

application
runtime
kernel

Arthas

Arthas 是基于字节码增强的调试工具.

功能:

观察方法的入参/返回值/异常等数据
观察内存对象的值
跟踪方法的耗时和调用栈
查看类加载来源/热更新类定义

安装

curl -L http://start.alibaba-inc.com/install.sh | sh

Arthas 使用方法

观察方法入参、返回值、异常

watch <class_fqcn> <method_name> "{params,returnObj,throwExp}" [condition] [-f] [-b] [-xN]

拦截指定线程执行的方法

watch <class_fqcn> <method_name> "{params,returnObj,throwExp}" \
  "@java.lang.Thread@currentThread().getName()=='<thread_name>'" [-f] [-b] [-xN]

在观察入参同时打印方法栈

watch <class_fqcn> <method_name> "{params,returnObj,throwExp, \
@java.lang.Thread@currentThread().getStackTrace()}" [condition] [-f] [-b] [-xN]

Arthas 使用方法(cont.)

耗时较长的方法

watch <class_fqcn> <method_name> "{params,returnObj,throwExp, @java.lang.Thread@currentThread().getStackTrace()}" \
'#cost>100 [&& other_cond]' [-f] [-b] [-xN]

classloading 问题

sc -d '<class_name>'

构造对象

watch com.alipay.antq.broker.processor.SendMessageProcessor sendMessage \
"{new java.lang.String(params[1].body)}" -x2 -n10

Arthas 使用方法(cont.)

Projection & filtering

watch com.alipay.antqnamesrv.core.service.broker.BrokerFileService getBrokerFiles \
    "{returnObj['FILE_CLUSTER']['antq-eu95-3.gz00b.stable.alipay.net']\
    .topicQueues.values().{ #this.{? #this.queueId==32 } }.{? #this.size()!=0 }}" -x3

观察方法执行路径

trace [--skipJDKMethod true|false] <class_fqcn> <method_name>

N/A

Arthas 使用方法(cont.)

制作火焰图

# list all events
profiler list
# profile object allocation
profiler start --event alloc -d 10
# profile lock acquire
profiler start --event lock -d 10

热更新代码

mc/redefine

N/A

Arthas 的缺陷

启动期观测 -> jdb / btrace
测量引入 overhead
注意内存问题

Eclipse MAT / zprofiler

OQL select * from io.netty.channel.AbstractChannelhandlerContext$11

Native memory issues

RSS = xmx + MaxDirectMemory + N * xss [ + gc + code cache + metaspace ]

case1) DirectByteBuffer 未释放->通过观察堆内的DirectMemory
case2) [G]ZIPInput[/Output]Stream/De[/In]flater native memory leak
case3) Netty<= 4.0.24 epoll native memory leak
case4) JDK-8164293:HotSpot leaking memory (fixed @8u152) )

sa-jdi / clhsdb

Off-process vs. in-process instrumentation

sa: serviceability agent
clhsdb: command-line hotspot debugger
sa-jdi 提供了获取 JVM 内部数据结构的编程接口
sa-jdi 不仅可以 attach 活着的进程, 也可以分析 coredump
jstack/jstat等工具均提供了基于 attach api(tools.jar) 和 sa-jdi(sa-jdi.jar)的实现
借HSDB来探索HotSpot VM的运行时数据

ls $JAVA_HOME/lib
java -cp $JAVA_HOME/lib/sa-jdi.jar sun.jvm.hotspot.CLHSDB

sa-jdi / clhsdb(cont.)

使用 sa-jdi 分析反射生成的类

import sun.jvm.hotspot.oops.InstanceKlass;
import sun.jvm.hotspot.tools.jcore.ClassFilter;

public class MethodAccessorFilter implements ClassFilter {
    @Override
    public boolean canInclude(InstanceKlass instanceKlass) {
        return instanceKlass.getName().asString().startsWith("sun/reflect/Generated");
    }
}

java -classpath ".:$JAVA_HOME/lib/sa-jdi.jar" \
-Dsun.jvm.hotspot.tools.jcore.filter=MethodAccessorFilter sun.jvm.hotspot.tools.jcore.ClassDump <pid>

SA-JDI is deprecated since Java9 – JDK-8158050

JFR

域内要求 JVM 版本 ^ajdk-8_5_10_fp2-b9
-XX:+EnableJFR -XX:+FlightRecorder -XX:FlightRecorderOptions=<opt>=<val> -XX:StartFlightRecording=<opt>=<val>
可分析的事件类型:
- Lock contention
- Memory allocation
- IO performance (Network/File)
- Code execution performance
- Garbage Collector
- JIT performance

JFR (cont.)

使用步骤

启动进程时增加开启 JFR 的参数
开始 JFR 记录 jcmd <pid> JFR.start settings=profile name=<record_name>
dumpJFR 记录 jcmd <pid> JFR.dump filename=jfr.output name=<record_name>
使用 JDK Mission Control 分析 JFR dump

JFR (cont.)

参考资料

Interesting Cases

One more thing

Component	Tool
Application on runtime(Java/Node/Ruby/PHP)	Runtime debugger,eBPF
Application (native code)	System debugger,eBPF
System libraries: /lib/*	ltrace(1),eBPF
System call interface	strace(1), perf(1),eBPF
Network Traffic	tcpdump(8),eBPF
Kernel: Scheduler, file systems, TCP, IP, etc	ftrace(1), perf(1),eBPF
Hardware: CPU internals, devices	perf, sar, eBPF

Perf

Why is the kernel on-CPU so much? What code-paths?
Which code-paths are causing CPU level 2 cache misses?
Are the CPUs stalled on memory I/O?
Which code-paths are allocating memory, and how much?
What is triggering TCP retransmits?
Is a certain kernel function being called, and how often?
What reasons are threads leaving the CPU?

– Brendan Greg

Perf (cont.)

perf_event:

Software events
Hardware events
Kernel static tracepoint events
USDT

sudo perf list to list all available events.

# sudo perf record -e block:block_rq_issue -e block:block_rq_complete -a sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.428 MB perf.data (~18687 samples) ]
# sudo perf script
        run 30339 [000] 2083345.722767: block:block_rq_complete: 202,1 W () 12984648 + 8 [0]
        run 30339 [000] 2083345.722857: block:block_rq_complete: 202,1 W () 12986336 + 8 [0]
        run 30339 [000] 2083345.723180: block:block_rq_complete: 202,1 W () 12986528 + 8 [0]
    swapper     0 [000] 2083345.723489: block:block_rq_complete: 202,1 W () 12986496 + 8 [0]
    swapper     0 [000] 2083346.745840: block:block_rq_complete: 202,1 WS () 1052984 + 144 [0]
  supervise 30342 [000] 2083346.746571: block:block_rq_complete: 202,1 WS () 1053128 + 8 [0]
  supervise 30342 [000] 2083346.746663: block:block_rq_complete: 202,1 W () 12986608 + 8 [0]
        run 30342 [000] 2083346.747003: block:block_rq_complete: 202,1 W () 12986832 + 8 [0]

                                                                            
              /---------------------------------------- #1  supervise: on-CPU cmd                                                             
             /       /--------------------------------- #2  30342: on-CPU cmd tid                                                       
            /       /     /---------------------------- #3  [000]: CPU running cmd                                                 
           /       /     /          /------------------ #4  2083346.746571: reltime                                        
          /       /     /          /                /-- #5  block:block_rq_complete: tracepoint  
         /       /     /          /                /                  /------------- #6  202,1 : storage major,minor number, ref lsblk
        /       /     /          /                /                  /    /--------- #7  IO type: W-Write,R-Read,A-ReadAheader,O-Sync,WS-Sync Write   
       /       /     /          /                /                  /    /  /------- #8  (): Block command details
      /       /     /          /                /                  /    /  /     /-- #9  12986336: storage device offset
     /       /     /          /                /                  /    /  /     /     /---- #10 +8: size of IO (in sectors)
    /       /     /          /                /                  /    /  /     /     /  /-- #11 [0]: if errors happened
supervise 30342 [000] 2083346.746571: block:block_rq_complete: 202,1 WS () 1053128 + 8 [0]
# how can I get this format?  cat /sys/kernel/debug/tracing/events/{event_cat}/{event_name}/format

perf script | awk '{ gsub(/:/, "") } $5 ~ /issue/ { ts[$6, $10] = $4 }
    $5 ~ /complete/ { if (l = ts[$6, $9]) { printf "%.f %.f\n", $4 * 1000000,
    ($4 - l) * 1000000; ts[$6, $10] = 0 } }'

sudo yum install -y kernel-devels
sudo yum install -y kernel-headers
sudo mount -t debugfs debugfs /sys/kernel/debug

Using perf trace Java IO issues

sudo perf record -e block:block_rq_issue  -a -g  -p <pid> sleep 10
sudo perf report -G # No java stack info ?

sudo yum install -y gcc-c++ cmake
git clone https://github.com/jvm-profiling-tools/perf-map-agent
cd perf-map-agent && cmake . && make -j8
./bin/create-java-perf-map.sh <pid>
cd -
perf report -G # gotcha!

Flame Graph

git clone https://github.com/brendangregg/FlameGraph
sudo perf script | \
    <path-to>/stackcollapse-perf.pl | \
    <path-to>/flamegraph.pl > perf-kernel.svg

Reference

内存增长的另外一种思路

通过LD_PRELOAD替换分配器为 jemalloc
开启 jemalloc 的泄漏分析
启动进程, 找到泄漏的 native 栈
用 perf record -e <tp> -ag 抓取分配的事件
create-java-perf-map.sh 生成符号表
perf report/script 查看访问线程

eBPF/XDP

eBPF does to Linux what JavaScript does to HTML. – Brendan Gregg

eBPF probes

Probes (Dynamic, Using trap)
- k{,ret}probes
- u{,ret}probes
Tracepoints (Static, predefined)
- Kernel tracepoints
  - cat /sys/kernel/debug/tracing/available_events
- USDT
  - readelf -n X.so | grep -A4 NT_STAPSDT
  - tplist.py -p <running_process>

eBPF vs. Perf

eBPF Hello world

#!/usr/bin/python
from bcc import BPF

prog = """
int hello(void *ctx) {
    bpf_trace_printk("Hello world\\n");
    return 0;
}
"""

b = BPF(text=prog)
#print(b.get_syscall_fnname("clone")) # platform-dependent syscall kprobe name
b.attach_kprobe(event="__x64_sys_clone", fn_name="hello")
#        ^~~~~~ this is a kprobe 
b.trace_print()

BCC (BPF Compiler Collection)

—eBPF made simple

Use BCC to explore JVM USDT

readelf -n $JAVA_HOME/jre/lib/amd64/server/libjvm.so | grep -A4 NT_STAPSDT
sudo <path-to>/trace.py -p <java_pid> 'u::thread__sleep__begin'

`bpftrace`

one-liner eBPF tools

# list tracepoints
bpftrace -l 'tracepoint:syscalls:sys_enter_*'
# trace file open
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'
# bio size dist
bpftrace -e 'tracepoint:block:block_rq_issue { @ = hist(args->bytes); }'

bpftrace Reference Guide

Using eBPF to debug Hotspot JVM

Compile OpenJDK with tracepoints enabled
Use BCC trace tools to trace JVM methods
More Info ref - Probing the JVM with BPF/BCC/[Enable USDT probes for eBPF tracing on Linux #1173

](https://github.com/AdoptOpenJDK/openjdk-build/issues/1173)

eBPF caveats and limitations

Kernel requirement: ^4.4
Byte code instruction size < 4096
No control loop

eBPF & XDP(eXpress Data Path)

Two ways toward kernel bypass

kernel in userspace
- DPDK
- Netmap
user code in kernel
- XDP

XDP Operating modes

Native mode
- Play with DMA buffer
- NO SKB ALLOCATION
- Least overhead
- Need driver modification
SKB mode
- From netif_receive_skb()
- After SKB and DMA allocation
- More instructions
- Driver-Independent

XDP(cont.)

Flannel: L2(IP Overlay)
Calico: L3(BGP)
Cilium: L3/4/7 (eBPF/XDP)
- GKE2 data plane

Container Network

Cilium: eBPF & XDP based CNI plugin

总结

勇于尝试新 kernel/runtime 和它们的新 feature .
工具可以快速验证思路, 过度依赖工具可能导致一叶障目不见泰山, 核心还是建立对系统的理解.

当我们排查问题的时候，我们关注什么？

Arthas

安装

Arthas 使用方法

Arthas 使用方法(cont.)

Arthas 使用方法(cont.)

Arthas 使用方法(cont.)

Arthas 的缺陷

Eclipse MAT / zprofiler

Native memory issues

sa-jdi / clhsdb

sa-jdi / clhsdb(cont.)

使用 sa-jdi 分析反射生成的类

JFR

JFR (cont.)

JFR (cont.)

Interesting Cases

One more thing

Perf

Perf (cont.)

Using perf trace Java IO issues

Flame Graph

Reference

内存增长的另外一种思路

eBPF/XDP

eBPF probes

eBPF vs. Perf

eBPF Hello world

BCC (BPF Compiler Collection)

Use BCC to explore JVM USDT

bpftrace

Using eBPF to debug Hotspot JVM

eBPF caveats and limitations

eBPF & XDP(eXpress Data Path)

XDP(cont.)

Cilium: eBPF & XDP based CNI plugin

总结

Reference

Tribute To

`bpftrace`