TCP流量遏制重要由鉴于接受方确认的滑行窗口体制和鉴于搜集情景的发送方的堵塞遏制算法来实行!
TCP Traffic Control
TCP流量遏制重要波及滑行窗口,堵塞遏制算法,RTT(Round-Trip Time,往复功夫),RTO(Retransmission Timeout,重传超时)计划之类。
滑行窗口常常觉得是接受方反应给发送方的流量和重传遏制(不商量搜集);堵塞遏制算法常常觉得是发送方的流量遏制算法(商量搜集时延丢包等);RTT计划会感化到RTO计划及感化某些堵塞算法运转截止;RTO计划会波及到超时重传准时器树立。滑行窗口
滑行窗口(sliding window)是数据传输时接受方公布窗口给发送方,而后发送方按照窗口确定一次发送多少量据的体制,实质是ARQ(Automatic Repeat Request, 停等和议),滑行窗口一上面运用接受方确认报文反应的窗口巨细遏制历次签发承包合约个数,另一上面按照接受方确认报文确认号确定数据帧重传,滑行窗口仅经过接受方反应的消息举行流量遏制,不会商量搜集带宽推迟等成分。
ARQ重要有贯串ARQGo-Back-N和采用ARQSelective Repeat两种,TCP中运用Go-back-N,Selective Repeat动作可选实行。Go-back-N和Selective Repeat辨别在乎,Go-back-N一次签发承包合约惟有一个准时器,中央展示丢包,则丢的包之后的一切包都须要重传,Selective Repeat一次签发承包合约须要和包个数一律的准时器数目,中央展示丢包则仅重传丧失的谁人包,大略来说Go-back-N比拟耗搜集带宽,Selective Repeat比拟耗计划资源。
两种ARQ滑行窗口发送进程可参考如次网站的动画:https://www2.tkn.tu-berlin.de/teaching/rn/animations/gbn_sr/
TCP滑效果制道理可简述为:TCP发送方包括一个发送窗口,按序列号发送数据包,TCP接受方按照收到的包恢复确认包,TCP发送包按照收到简直认包中公布窗口反应来动静安排发送窗口巨细、滑行窗口右移并发送后续的数据包。
如次tcpipguide网站TCP滑行窗口刻画,TCP发送方发送缓存辨别为4块地区:已发送已确认、已发送未确认、未发送接受方已筹备好和未发送接受方未筹备好,TCP接受方接受缓存辨别为3块地区:已接受已确认、未接受被承诺发送、未收到爆发方大概未发送。
Linux内核4.x中滑行窗口代码示例,大概不妨看出来和tcpipguid网站中刻画普遍:
/* Update our send window. * * Window update algorithm, described in RFC793/RFC1122 (used in linux-2.2 * and in FreeBSD. NetBSD's one is even worse.) is wrong. */static int tcp_ack_update_window(struct sock *sk, const struct sk_buff *skb, u32 ack, u32 ack_seq){ struct tcp_sock *tp = tcp_sk(sk); int flag = 0; u32 nwin = ntohs(tcp_hdr(skb)->window); if (likely(!tcp_hdr(skb)->syn)) nwin <<= tp->rx_opt.snd_wscale; if (tcp_may_update_window(tp, ack, ack_seq, nwin)) { flag |= FLAG_WIN_UPDATE; tcp_update_wl(tp, ack_seq); if (tp->snd_wnd != nwin) { tp->snd_wnd = nwin; /* Note, it is the only place, where * fast path is recovered for sending TCP. */ tp->pred_flags = 0; tcp_fast_path_check(sk); if (!tcp_write_queue_empty(sk)) tcp_slow_start_after_idle_check(sk); if (nwin > tp->max_window) { tp->max_window = nwin; tcp_sync_mss(sk, inet_csk(sk)->icsk_pmtu_cookie); } } } tcp_snd_una_update(tp, ack); return flag;}RTT&RTO计划
TCP流量遏制中RTT及RTO计划是一个很要害的题目,ARQ中对准丢包都须要超时重传,超时的树立就会波及到RTO,RTO树立过短(<2RTT)则大概引导不需要的重传(丢的包大概在传输赶快能到),超时重传功夫树立过长则大概感化TCP通讯功效(从来在等丢的包重传过来),RTO计划须要鉴于RTT,RTT计划值径直感化RTO,不少堵塞算法也和RTT径直关系,TCP流端到端的RTT和RTO都不是一个恒定的值,都是一段功夫搜集中端到端的预算值,跟着功夫和搜集的变革而变革。
搜集就像高速铁路网,时而拥挤时而流利无阻,对应在搜集里即是RTT颤动,怎样计划RTT和RTO径直感化重传从而感化通讯功效,其计划也过程长功夫的表面和试验进程。
首先Phil Karn和Craig Partridge提出SRTT(Smoothed RTT,预算重传功夫)和RTO计划如次:
SRTT = (α * SRTT) + ( (1-α) *RTT)RTO = min[UBOUND, max[LBOUND,(β * SRTT)]]个中α取0.8 ~ 0.9,β取1.3 ~ 2.0, UBOUND和LBOUND辨别为下限功夫和下限功夫厥后Jacobson 和 Karels做了确定的矫正,最后产生RFC6298https://datatracker.ietf.org/doc/html/rfc6298,SRTT,DevRTT(RTT缺点) ,RTO计划公式,简直如次:
SRTT = SRTT + α * (RTT - SRTT)DevRTT = (1-β) * DevRTT + β *(|RTT - SRTT|)RTO = μ * SRTT + δ * DevRTT个中α取0.125,β取0.25,μ取1,δ取4Linux内核4.x中SRTT计划代码如次:
/* Called to compute a smoothed rtt estimate. The data fed to this * routine either comes from timestamps, or from segments that were * known _not_ to have been retransmitted [see Karn/Partridge * Proceedings SIGCOMM 87]. The algorithm is from the SIGCOMM 88 * piece by Van Jacobson. * NOTE: the next three routines used to be one big routine. * To save cycles in the RFC 1323 implementation it was better to break * it up into three procedures. -- erics */static void tcp_rtt_estimator(struct sock *sk, long mrtt_us){ struct tcp_sock *tp = tcp_sk(sk); long m = mrtt_us; /* RTT */ u32 srtt = tp->srtt_us; /* The following amusing code comes from Jacobson's * article in SIGCOMM '88. Note that rtt and mdev * are scaled versions of rtt and mean deviation. * This is designed to be as fast as possible * m stands for "measurement". * * On a 1990 paper the rto value is changed to: * RTO = rtt + 4 * mdev * * Funny. This algorithm seems to be very broken. * These formulae increase RTO, when it should be decreased, increase * too slowly, when it should be increased quickly, decrease too quickly * etc. I guess in BSD RTO takes ONE value, so that it is absolutely * does not matter how to _calculate_ it. Seems, it was trap * that VJ failed to avoid. 8) */ if (srtt != 0) { m -= (srtt >> 3); /* m is now error in rtt est */ srtt += m; /* rtt = 7/8 rtt + 1/8 new */ if (m < 0) { m = -m; /* m is now abs(error) */ m -= (tp->mdev_us >> 2); /* similar update on mdev */ /* This is similar to one of Eifel findings. * Eifel blocks mdev updates when rtt decreases. * This solution is a bit different: we use finer gain * for mdev in this case (alpha*beta). * Like Eifel it also prevents growth of rto, * but also it limits too fast rto decreases, * happening in pure Eifel. */ if (m > 0) m >>= 3; } else { m -= (tp->mdev_us >> 2); /* similar update on mdev */ } tp->mdev_us += m; /* mdev = 3/4 mdev + 1/4 new */ if (tp->mdev_us > tp->mdev_max_us) { tp->mdev_max_us = tp->mdev_us; if (tp->mdev_max_us > tp->rttvar_us) tp->rttvar_us = tp->mdev_max_us; } if (after(tp->snd_una, tp->rtt_seq)) { if (tp->mdev_max_us < tp->rttvar_us) tp->rttvar_us -= (tp->rttvar_us - tp->mdev_max_us) >> 2; tp->rtt_seq = tp->snd_nxt; tp->mdev_max_us = tcp_rto_min_us(sk); } } else { /* no previous measure. */ srtt = m << 3; /* take the measured time to be rtt */ tp->mdev_us = m << 1; /* make sure rto = 3*rtt */ tp->rttvar_us = max(tp->mdev_us, tcp_rto_min_us(sk)); tp->mdev_max_us = tp->rttvar_us; tp->rtt_seq = tp->snd_nxt; } tp->srtt_us = max(1U, srtt);}Linux内核4.x中RTO计划代码:
/* Calculate rto without backoff. This is the second half of Van Jacobson's * routine referred to above. */void tcp_set_rto(struct sock *sk){ const struct tcp_sock *tp = tcp_sk(sk); /* Old crap is replaced with new one. 8) * * More seriously: * 1. If rtt variance happened to be less 50msec, it is hallucination. * It cannot be less due to utterly erratic ACK generation made * at least by solaris and freebsd. "Erratic ACKs" has _nothing_ * to do with delayed acks, because at cwnd>2 true delack timeout * is invisible. Actually, Linux-2.4 also generates erratic * ACKs in some circumstances. */ inet_csk(sk)->icsk_rto = __tcp_set_rto(tp); /* 2. Fixups made earlier cannot be right. * If we do not estimate RTO correctly without them, * all the algo is pure shit and should be replaced * with correct one. It is exactly, which we pretend to do. */ /* NOTE: clamping at TCP_RTO_MIN is not required, current algo * guarantees that rto is higher. */ tcp_bound_rto(sk);}static inline u32 __tcp_set_rto(const struct tcp_sock *tp){ return usecs_to_jiffies((tp->srtt_us >> 3) + tp->rttvar_us);}堵塞遏制算法
因为滑行窗口仅商量接受方的接受本领不商量搜集成分,搜集带宽是共享的,搜集延时是颤动的,搜集的变革带来的题目必然会形成所有体例的解体,这就像两个列车站输送货色,仅商量列车站的包含量,不商量列车轨迹的装载本领,列车车次发的再多大概也是堵在轨迹上,还大概引导轨迹体例疯瘫。
堵塞遏制算法是发送方按照暂时搜集情景安排签发承包合约速度的流量遏制体制,TCP中堵塞遏制算法兴盛于今,从典范的reno到google连年来提出的bbr,功夫连接有新的堵塞遏制算法提出来,堵塞遏制也产生了如次RFC。RFC5681https://datatracker.ietf.org/doc/html/rfc5681以linux内核为例,最新的内核本子里面实行了十多种堵塞遏制算法实行,囊括少许比拟新的算法实行(比方bbr),linux较早本子内核运用reno,linux 2.6.8此后默许沿用bic算法,linux 2.6.19此后又默许运用了cubic算法,暂时linux 5.x仍旧默许运用cubic算法, windows同样也是扶助多种堵塞遏制算法的,暂时windows11默许也是运用cubic算法。
堵塞算法道理(reno为例)
TCP堵塞算法完全框架基础普遍,以典范的TCP reno堵塞遏制算法(教科书常常讲的本子)举行扼要引见,TCP堵塞遏制RFC5681规建都包括慢启用、堵塞制止、快重传、快回复阶段,这又波及到几个基础参数(cwnd、ssthresh),cwnd指堵塞窗口的巨细,确定了一次发送多少量据包,ssthresh指慢启用阈值。
慢启用慢启用阶段分为两种,一种是流刚发端cwnd<ssthresh时cwnd初始为1MSS(跟着硬件革新,Linux最新内核初始为10MSS)并跟着RTT呈指数延长,一种是重传超时cwnd回复初始值从新加入慢启用,慢启用并不表白发送速度延长慢(指数级延长),部分领会是指从比拟低的开始速度来摸索搜集本领,对准慢启用有提出了ABC(Appropriate Byte Count)算法对慢启用爆发而形成胜过搜集装载本领举行建设,如次两篇对于ABC的RFC。RFC3465https://datatracker.ietf.org/doc/html/rfc3465RFC3742https://datatracker.ietf.org/doc/html/rfc3742初次慢启用首先初始ssthresh都是树立为一个很大的值,如许直到丢包才会加入堵塞制止,后又展示了hystart优化(搀和慢启用),比方猜测出理念的ssthresh,进而让初次慢启用不至于丢包才加入堵塞制止(丢包价格太大)。
堵塞制止当cwnd>=ssthresh时会加入堵塞制止阶段,cwnd会跟着RTT呈线性延长,这个开始是比拟顽固地摸索最大搜集本领,不至于搜集解体。
快重传堵塞制止阶段当收到3个贯串ACK,表白大概展示了丢包,表白搜集展示微弱拥挤,这个功夫会加入快重传阶段,ssthresh会树立为0.5 * cwnd,cwnd会树立为ssthresh+3MSS,举行数据包重传加入快回复阶段。
快回复快回复阶段即使重传数据包后即使仍旧收不到新数据包ACK并且RTO超时了,表白搜集并没有回复,就会从新加入慢启用阶段,ssthresh会树立为0.5 * cwnd,cwnd会树立为初始值,即使收到了新数据的ACK包,表白搜集已回复,cwnd会树立为ssthresh,加入堵塞制止阶段。
reno堵塞遏制算法状况图如次
reno算法iperf打击流氓犯罪wireshark抓包io图如次:
堵塞算法比较
堵塞遏制算法主假如鉴于搜集丢包和推迟(RTT)来实行,以是有的算法丢包敏锐,有的算法推迟敏锐,有的贯串丢包和推迟,各别的算法重要的辨别大概在乎堵塞制止阶段怎样去拟合理念发送速度弧线又不至于丢包,如次https://en.wikipedia.org/wiki/TCP_congestion_control对于各别堵塞算法比较。
堵塞算法搜集反应须要窜改点运用场景公道性规则(New) RenoLoss——DelayVegasDelaySenderLess lossProportionalHigh SpeedLossSenderHigh bandwidthBICLossSenderHigh bandwidthCUBICLossSenderHigh bandwidthC2TCP[10][11]Loss/DelaySenderUltra-low latency and high bandwidthNATCP[12]Multi-bit signalSenderNear Optimal PerformanceElastic-TCPLoss/DelaySenderHigh bandwidth/short & long-distanceAgile-TCPLossSenderHigh bandwidth/short-distanceH-TCPLossSenderHigh bandwidthFASTDelaySenderHigh bandwidthProportionalCompound TCPLoss/DelaySenderHigh bandwidthProportionalWestwoodLoss/DelaySenderLJerseyLoss/DelaySenderLBBR[13]DelaySenderBLVC, BufferbloatCLAMPMulti-bit signalReceiver, RouterVMax-minTFRCLossSender, ReceiverNo RetransmissionMinimum delayXCPMulti-bit signalSender, Receiver, RouterBLFCMax-minVCP2-bit signalSender, Receiver, RouterBLFProportionalMaxNetMulti-bit signalSender, Receiver, RouterBLFSCMax-minJetMaxMulti-bit signalSender, Receiver, RouterHigh bandwidthMax-minREDLossRouterReduced delayECNSingle-bit signalSender, Receiver, RouterReduced loss这边还要提一下TCP流的公道性题目,有的堵塞算法大概会生存带宽抢占,鉴于延时关系的算法就很简单展示,由于是鉴于RTT来抢占带宽,RTT越小占用则更多,比方有两条TCP流,TCP流中推迟更小的会抢占更多的带宽,以至实足占领一切带宽,其余流都没辙平常处事。比方reno和bic都生存抢占题目,cubic是经过3次弧线来探测理念带宽,和RTT无关系,公道性做得较好。
如上海图书馆运用reno算法,运用iperf打击流氓犯罪到两个各别效劳端,stream2延时低径直抢占了大局部带宽
如上海图书馆运用cubic算法,同样运用iperf打击流氓犯罪到两个各别效劳端,steam2延时低抢占带宽则没有那么重要
Linux内核中堵塞算法
Linux 2.6.x内核本子简直可参考https://linuxgazette.net/135/pfeiffer.html举行堵塞算法采用,最新的linux 5.x内核多扶助了几种,内核代码Kconfig中也有扼要刻画,特地提一下mptcp内核也包括了多路途场景的几种堵塞算法。
Linux 4.x内核中扶助的tcp堵塞算法,默许运用cubic:
config DEFAULT_TCP_CONGstringdefault "bic" if DEFAULT_BICdefault "cubic" if DEFAULT_CUBICdefault "htcp" if DEFAULT_HTCPdefault "hybla" if DEFAULT_HYBLAdefault "vegas" if DEFAULT_VEGASdefault "westwood" if DEFAULT_WESTWOODdefault "veno" if DEFAULT_VENOdefault "reno" if DEFAULT_RENOdefault "dctcp" if DEFAULT_DCTCPdefault "cdg" if DEFAULT_CDGdefault "bbr" if DEFAULT_BBRdefault "cubic"Linux mptcp v0.95内核扶助的mptcp堵塞算法:
config TCP_CONG_LIAtristate "MPTCP Linked Increase"depends on MPTCPdefault n---help---MultiPath TCP Linked Increase Congestion ControlTo enable it, just put 'lia' in tcp_congestion_controlconfig TCP_CONG_OLIAtristate "MPTCP Opportunistic Linked Increase"depends on MPTCPdefault n---help---MultiPath TCP Opportunistic Linked Increase Congestion ControlTo enable it, just put 'olia' in tcp_congestion_controlconfig TCP_CONG_WVEGAStristate "MPTCP WVEGAS CONGESTION CONTROL"depends on MPTCPdefault n---help---wVegas congestion control for MPTCPTo enable it, just put 'wvegas' in tcp_congestion_controlconfig TCP_CONG_BALIAtristate "MPTCP BALIA CONGESTION CONTROL"depends on MPTCPdefault n---help---Multipath TCP Balanced Linked Adaptation Congestion ControlTo enable it, just put 'balia' in tcp_congestion_controlconfig TCP_CONG_MCTCPDESYNCtristate "DESYNCHRONIZED MCTCP CONGESTION CONTROL (EXPERIMENTAL)"depends on MPTCPdefault n---help---Desynchronized MultiChannel TCP Congestion Control. This is experimentalcode that only supports single path and must have set mptcp_ndiffportslarger than one.To enable it, just put 'mctcpdesync' in tcp_congestion_controlFor further details see:http://ieeexplore.ieee.org/abstract/document/6911722/https://doi.org/10.1016/j.comcom.2015.07.010Linux4.x内核中堵塞算法扩充时参考struct,堵塞算法重要重写内里因变量来实行:
struct tcp_congestion_ops { struct list_head list; u32 key; u32 flags; /* initialize private data (optional) */ void (*init)(struct sock *sk); /* cleanup private data (optional) */ void (*release)(struct sock *sk); /* return slow start threshold (required) */ u32 (*ssthresh)(struct sock *sk); /* do new cwnd calculation (required) */ void (*cong_avoid)(struct sock *sk, u32 ack, u32 acked); /* call before changing ca_state (optional) */ void (*set_state)(struct sock *sk, u8 new_state); /* call when cwnd event occurs (optional) */ void (*cwnd_event)(struct sock *sk, enum tcp_ca_event ev); /* call when ack arrives (optional) */ void (*in_ack_event)(struct sock *sk, u32 flags); /* new value of cwnd after loss (required) */ u32 (*undo_cwnd)(struct sock *sk); /* hook for packet ack accounting (optional) */ void (*pkts_acked)(struct sock *sk, const struct ack_sample *sample); /* override sysctl_tcp_min_tso_segs */ u32 (*min_tso_segs)(struct sock *sk); /* returns the multiplier used in tcp_sndbuf_expand (optional) */ u32 (*sndbuf_expand)(struct sock *sk); /* call when packets are delivered to update cwnd and pacing rate, * after all the ca_state processing. (optional) */ void (*cong_control)(struct sock *sk, const struct rate_sample *rs); /* get info for inet_diag (optional) */ size_t (*get_info)(struct sock *sk, u32 ext, int *attr, union tcp_cc_info *info); char name[TCP_CA_NAME_MAX]; struct module *owner;};Linux4.x内核cubic算法从新如次因变量来实行:
static struct tcp_congestion_ops cubictcp __read_mostly = { .init = bictcp_init, .ssthresh = bictcp_recalc_ssthresh, .cong_avoid = bictcp_cong_avoid, .set_state = bictcp_state, .undo_cwnd = tcp_reno_undo_cwnd, .cwnd_event = bictcp_cwnd_event, .pkts_acked = bictcp_acked, .owner = THIS_MODULE, .name = "cubic",};Linux 体例流量遏制要害体例参数
# 每个套接字所承诺的最大缓冲区的巨细net.core.optmem_max = 20480# 接受套接字缓冲区巨细的缺省值(以字节为单元)。net.core.rmem_default = 229376# 接受套接字缓冲区巨细的最大值(以字节为单元)。net.core.rmem_max = 16777216# socket监听的backlog(监听部队)下限net.core.somaxconn = 128# tcp默许堵塞算法net.ipv4.tcp_congestion_control = cubic# 能否打开tcp窗口夸大net.ipv4.tcp_window_scaling = 1# tcp外存巨细区(以页为单元, 1页=4096字节), 3个数值辨别为 low, pressure, high# low: tcp运用低于该值页数时不商量开释外存, pressure: tcp运用外存大于该值则试图宁静其运用外存low以次# high: 承诺tcp用列队缓存数据报文的页面量, 胜过该值中断调配socket, 展示TCP: too many of orphaned socketsnet.ipv4.tcp_mem = 381705 508942 763410# tcp读缓存区(以字节为单元), 3个数值辨别为 默许值 最小值 最大值net.ipv4.tcp_rmem = 10240 87380 16777216# tcp写缓存区(以字节为单元), 3个数值辨别为 默许值 最小值 最大值net.ipv4.tcp_wmem = 10240 87380 16777216Reference
https://tools.ietf.org/html/std7https://datatracker.ietf.org/doc/html/rfc1122https://datatracker.ietf.org/doc/html/rfc1323https://datatracker.ietf.org/doc/html/rfc1379https://datatracker.ietf.org/doc/html/rfc1948https://datatracker.ietf.org/doc/html/rfc2018https://datatracker.ietf.org/doc/html/rfc5681https://datatracker.ietf.org/doc/html/rfc6247https://datatracker.ietf.org/doc/html/rfc6298https://datatracker.ietf.org/doc/html/rfc6824https://datatracker.ietf.org/doc/html/rfc7323https://datatracker.ietf.org/doc/html/rfc7414https://en.wikipedia.org/wiki/Transmission_Control_Protocolhttps://linuxgazette.net/135/pfeiffer.htmlhttp://www.tcpipguide.com/free/t_TCPSlidingWindowDataTransferandAcknowledgementMech.htmhttps://www2.tkn.tu-berlin.de/teaching/rn/animations/gbn_sr/https://zhuanlan.zhihu.com/p/144273871http://lxr.linux.no/linux+v3.2.8/Documentation/networking/ip-sysctl.txt#L464