最近一部分服务器上,遇到UDP发包速率太高会出现大量丢包的情况。这个丢包不是发生在中间网络设备上丢,也不在接收方上,而是发生在发送方kernel中。为什么会知道是在kernel丢的?因为用户空间程序的统计的发包量,跟内核统计的有很大差距,所以可以肯定用户空间把包交给内核后,内核并没有全发出去。
通过kernel的snmp,发现UDP的SndbufErrors计数器有很高的值:
1 2 3 |
root@Server:~# grep "^Udp:" /proc/net/snmp | column -t Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors IgnoredMulti Udp: 52452374 5805 2117 247991616 2117 6669891 0 1 |
这个计数器是在哪、什么情况下增长的?翻了一下kernel的源码,在net/ipv4/udp.c中找到两个,一个在udp_sendmsg()中:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
... int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) { ... /* * ENOBUFS = no kernel mem, SOCK_NOSPACE = no sndbuf space. Reporting * ENOBUFS might not be good (it's not tunable per se), but otherwise * we don't have a good statistic (IpOutDiscards but it can be too many * things). We could add another new stat but at least for now that * seems like overkill. */ if (err == -ENOBUFS || test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) { UDP_INC_STATS(sock_net(sk), UDP_MIB_SNDBUFERRORS, is_udplite); } return err; ... } ... |
另一个在udp_send_skb()中:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
... static int udp_send_skb(struct sk_buff *skb, struct flowi4 *fl4, struct inet_cork *cork) { ... send: err = ip_send_skb(sock_net(sk), skb); if (err) { if (err == -ENOBUFS && !inet->recverr) { UDP_INC_STATS(sock_net(sk), UDP_MIB_SNDBUFERRORS, is_udplite); err = 0; } } else ... return err; } ... |
在udp_sendmsg()的流程中,是有流向udp_send_skb()的,另外,在udp_send_skb()调用的ip_send_skb()中,有递增另一个计数器OutDiscards的代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
... int ip_send_skb(struct net *net, struct sk_buff *skb) { int err; err = ip_local_out(net, skb->sk, skb); if (err) { if (err > 0) err = net_xmit_errno(err); if (err) IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS); } return err; } ... |
再查看一下snmp中的IP计数器,发现OutDiscards的值跟SndbufErrors非常接近:
1 2 3 4 5 |
root@server:~# grep -E "^Udp:|^Ip:" /proc/net/snmp | column -t Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates Ip: 1 64 125551038 0 0 30477602 0 0 94936817 332998032 6671194 29 0 0 0 0 0 1 0 Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors IgnoredMulti Udp: 52453984 5808 2117 247992185 2117 6669891 0 1 |
所以这里可以肯定这个计数器的增长是发生在udp_send_skb()。
按照代码中的注释,ENOBUFS代表no kernel mem,首先,可以排除是send buffer不足的问题,因为如果send buffer满了,相关写调用会被阻塞或者返回EWOULDBLOCK;另外,net_xmit_errno()这个宏,会把非NET_XMIT_CN的错误都转成ENOBUFS:
1 2 3 4 5 |
/* NET_XMIT_CN is special. It does not guarantee that this packet is lost. It * indicates that the device will soon be dropping packets, or already drops * some packets of the same priority; prompting us to send less aggressively. */ #define net_xmit_eval(e) ((e) == NET_XMIT_CN ? 0 : (e)) #define net_xmit_errno(e) ((e) != NET_XMIT_CN ? -ENOBUFS : 0) |
所以这里ip_local_out()返回的很可能并不是ENOBUFS。
ip_local_out()后有函数指针,不太好看出来实际调用的是哪个函数,用perf看了下调用栈,后面涉及的函数还是挺多的,懒得一个个翻:
找了个更直接的工具dropwatch:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
root@server:~# dropwatch -l kas Initializing kallsyms db dropwatch> start Enabling monitoring... Kernel monitoring activated. Issue Ctrl-C to stop monitoring ... 917 drops at kfree_skb_list+1d (0xffffffff923211ed) [software] 129 drops at kfree_skb_list+1d (0xffffffff923211ed) [software] 2108 drops at kfree_skb_list+1d (0xffffffff923211ed) [software] 1499 drops at kfree_skb_list+1d (0xffffffff923211ed) [software] 4960 drops at kfree_skb_list+1d (0xffffffff923211ed) [software] 336 drops at kfree_skb_list+1d (0xffffffff923211ed) [software] ... |
其实结果还是不算太直接,不过可以确定丢包的时候会调用kfree_skb_list(),在调用栈的函数中,顺利找到了对这个函数的调用,在net/core/dev.c中:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
... static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, struct net_device *dev, struct netdev_queue *txq) { ... if (q->flags & TCQ_F_NOLOCK) { rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; qdisc_run(q); if (unlikely(to_free)) kfree_skb_list(to_free); return rc; } if ... { ... } else { rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; } ... if (unlikely(to_free)) kfree_skb_list(to_free); ... } ... |
到这里算是找到了问题出在哪了,q->enqueue()调用的是qdisc部分的机制,赶紧看了下qdisc的统计数据:
1 2 3 4 5 6 7 8 |
root@server:~# tc -s qdisc ... qdisc fq 0: dev eno1 root refcnt 2 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 3028b initial_quantum 15140b low_rate_threshold 550Kbit refill_delay 40.0ms Sent 365865434072 bytes 306319714 pkt (dropped 6674253, overlimits 0 requeues 19) backlog 0b 0p requeues 19 flows 1121 (inactive 1119 throttled 0) gc 19508 highprio 22 throttled 1045756 latency 13.442us flows_plimit 6674253 ... |
统计中的dropped计数器与SndbufErrors的值非常接近,eno1的队列是fq,fq的源码在net/sched/sch_fq.c中:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
... static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free) { ... if (unlikely(sch->q.qlen >= sch->limit)) return qdisc_drop(skb, sch, to_free); ... if (unlikely(f->qlen >= q->flow_plimit && f != &q->internal)) { q->stat_flows_plimit++; return qdisc_drop(skb, sch, to_free); } ... } ... |
如果超出了flow_limit的限制,会增加flows_plimit计数器的值,从统计数据中可以看到flows_plimit计数器与dropped一致,到这里又可以肯定,是排队的包超出了flow_limit,所以无法入队的包都被丢弃了。
前面的tc命令看到flow_limit只有100,系统设置的send buffer是212992,包的MTU是1400,212992 / 1400 ≈ 152,的确会超出100,于是调大到200:
1 |
root@server:~# tc qdisc add dev eno1 root fq limit 20000 flow_limit 200 |
再次运行UDP协议的程序发送数据,丢包率恢复正常,查看qdisc的统计,dropped计数器一直为0,问题解决!