Slow SCP through network

I don't want to bug Linode's staff too much with support tickets, so I hope someone can help me out here.

The main issue is that I have large performance changes with "scp". One time it would take 8 seconds, another time 4 minutes. I am using "scp" to copy data back and furth between a node in Germany, and Linode.

Here's an example. These two commands were executed within a minute of each other:

/srv# time scp -p -r -P 1234 -i /root/a.pem file1 node.in.europe:/tmp/test
file1                                                                                                                                                                      100% 2184KB   2.1MB/s   00:01

real    0m6.368s
user    0m0.078s
sys     0m0.033s

/srv# time scp -p -r -P 1234 -i /root/a.pem file1 node.in.europe:/tmp/test
file1                                                                                                                                                                      100% 2184KB 546.1KB/s   00:04

real    1m5.746s
user    0m0.062s
sys     0m0.048s

On the latter one, SCP would go from 0% to 100% within seconds, and then hang at 100% for minutes. Running SCP with -vvv gives me:

file1                                                                                                                                                                      100% 2184KB 728.1KB/s   00:03
debug3: Wrote 28960 bytes for a total of 156975
debug3: Wrote 28960 bytes for a total of 185935
debug3: Wrote 27512 bytes for a total of 213447
debug3: Wrote 28960 bytes for a total of 242407
debug2: channel 0: rcvd adjust 114688
debug3: Wrote 27512 bytes for a total of 269919
debug3: Wrote 24616 bytes for a total of 294535
debug3: Wrote 24616 bytes for a total of 319151
[...]
Transferred: sent 2241120, received 2712 bytes, in 59.2 seconds
Bytes per second: sent 37832.0, received 45.8
debug1: Exit status 0

real    1m1.056s
user    0m0.084s
sys     0m0.046s

for an instance where it takes long, and

file1                                                                                                                                                                      100% 2184KB   2.1MB/s   00:01
debug3: Wrote 94120 bytes for a total of 400239
debug3: Wrote 77848 bytes for a total of 478087
debug2: channel 0: rcvd adjust 114688
debug3: Wrote 131264 bytes for a total of 609351
debug3: Wrote 32816 bytes for a total of 642167
debug3: Wrote 32816 bytes for a total of 674983
debug2: channel 0: rcvd adjust 131072
debug3: Wrote 65632 bytes for a total of 740615
debug3: Wrote 32816 bytes for a total of 773431
[...]
Transferred: sent 2241120, received 2712 bytes, in 3.8 seconds
Bytes per second: sent 592957.3, received 717.5
debug1: Exit status 0

real    0m5.235s
user    0m0.086s
sys     0m0.028s

for an instance where it went fast. From that, I conclude that SCP can get the data ready very quickly, and spends most of its time waiting to be able to push the data out of the network interface.

With that, I ran a couple of traceroutes, and I'm getting different routes every time:

/srv# traceroute node.in.europe
traceroute to node.in.europe (87.230.ooo.ooo), 30 hops max, 60 byte packets
 1  a1.7.1243.static.theplanet.com (67.18.7.161)  0.551 ms  0.658 ms  0.645 ms
 2  xe-2-0-0.car03.dllstx2.networklayer.com (67.18.7.89)  0.178 ms  0.206 ms  0.191 ms
 3  po101.dsr02.dllstx2.networklayer.com (70.87.254.77)  0.582 ms  0.661 ms  0.611 ms
 4  te4-3.dsr02.dllstx3.networklayer.com (70.87.255.129)  0.768 ms  0.760 ms te3-2.dsr02.dllstx3.networklayer.com (70.87.253.133)  0.812 ms
 5  ae17.bbr02.eq01.dal03.networklayer.com (173.192.18.230)  50.695 ms  50.724 ms ae17.bbr01.eq01.dal03.networklayer.com (173.192.18.226)  0.477 ms
 6  dls-bb1-link.telia.net (213.248.102.173)  0.490 ms  0.548 ms  0.534 ms
 7  ash-bb1-link.telia.net (213.155.133.178)  60.992 ms  60.413 ms ae2-20G.scr2.DAL1.gblx.net (67.16.141.237)  5.790 ms
 8  ldn-bb1-link.telia.net (80.91.246.69)  109.003 ms * po6.ar4.AMS2.gblx.net (67.17.107.174)  124.616 ms
 9  ldn-b5-link.telia.net (80.91.248.216)  109.141 ms  109.125 ms *
10  * * *
11  xe-0-0-1.dr-master.r1.cgn3.hosteurope.de (176.28.4.14)  130.149 ms xe-0-2-0.cr-merak.fra2.hosteurope.de (176.28.4.2)  123.437 ms xe-0-0-1.dr-master.r1.cgn3.hosteurope.de (176.28.4.14)  128.728 ms
12  xe-2-2-0.cr-pollux.cgn3.hosteurope.de (80.237.129.169)  128.345 ms  128.334 ms  128.287 ms

I can repeat the traceroute, and it'll be different hosts everytime, however the dropouts are usually close to the Germany node, in the telia.net network.

Now, using a traceroute from the Germany node to Linode uses an entire different route, avoiding telia.net altogether:

# traceroute node.in.the.us
traceroute to node.in.the.us (66.228.ooo.ooo), 30 hops max, 40 byte packets
 1  * * *
 2  xe-3-3-0.cr-pollux.cgn3.hosteurope.de (176.28.4.9)  0.232 ms  0.231 ms  0.213 ms
 3  xe-0-2-0.cr-antares.ams1.hosteurope.de (80.237.129.182)  4.509 ms  4.514 ms xe-0-3-0.cr-antares.ams1.hosteurope.de (80.237.129.118)  4.523 ms
 4  tengigabitethernet6-2.ar4.ams2.gblx.net (206.165.75.1)  4.854 ms  4.816 ms  4.809 ms
 5  ar4.scr4.AMS2.gblx.net (67.17.107.173)  4.656 ms  4.643 ms  4.634 ms
 6  ae13.scr4.NYC1.gblx.net (67.16.166.214)  83.645 ms  83.511 ms  83.483 ms
 7  e5-1-30G.ar9.NYC1.gblx.net (67.16.142.54)  82.889 ms  80.544 ms  89.375 ms
 8  softlayer-technologies-inc.ethernet11-3.ar9.nyc1.gblx.net (206.165.75.234)  79.368 ms  79.391 ms  79.359 ms
 9  ae7.bbr02.tl01.nyc01.networklayer.com (173.192.18.177)  86.988 ms  86.961 ms  86.947 ms
10  ae1.bbr01.eq01.chi01.networklayer.com (173.192.18.132)  106.135 ms  106.122 ms  106.080 ms
11  ae20.bbr01.eq01.dal03.networklayer.com (173.192.18.136)  125.737 ms  125.731 ms  125.687 ms
12  po31.dsr01.dllstx3.networklayer.com (173.192.18.225)  121.404 ms  118.953 ms  122.796 ms
13  te4-4.dsr02.dllstx2.networklayer.com (70.87.255.134)  125.490 ms * te2-1.dsr01.dllstx2.networklayer.com (70.87.255.66)  129.907 ms
14  po2.car01.dllstx2.networklayer.com (70.87.254.78)  126.373 ms po1.car01.dllstx2.networklayer.com (70.87.254.74)  125.014 ms po2.car01.dllstx2.networklayer.com (70.87.254.78)  168.917 ms
15  5a.7.1243.static.theplanet.com (67.18.7.90)  128.726 ms  125.728 ms  129.486 ms

I guess my question basically is - what can I do? Is routing into the telia.net network something any of these providers can influence? Or am I barking up the wrong tree altogether and this isn't the real reason I'm getting these differences in performance?

8 Replies

@stw:

With that, I ran a couple of traceroutes, and I'm getting different routes every time:

/srv# traceroute node.in.europe
traceroute to node.in.europe (87.230.ooo.ooo), 30 hops max, 60 byte packets
 1  a1.7.1243.static.theplanet.com (67.18.7.161)  0.551 ms  0.658 ms  0.645 ms
 2  xe-2-0-0.car03.dllstx2.networklayer.com (67.18.7.89)  0.178 ms  0.206 ms  0.191 ms
 3  po101.dsr02.dllstx2.networklayer.com (70.87.254.77)  0.582 ms  0.661 ms  0.611 ms
 4  te4-3.dsr02.dllstx3.networklayer.com (70.87.255.129)  0.768 ms  0.760 ms te3-2.dsr02.dllstx3.networklayer.com (70.87.253.133)  0.812 ms
 5  ae17.bbr02.eq01.dal03.networklayer.com (173.192.18.230)  50.695 ms  50.724 ms ae17.bbr01.eq01.dal03.networklayer.com (173.192.18.226)  0.477 ms
 6  dls-bb1-link.telia.net (213.248.102.173)  0.490 ms  0.548 ms  0.534 ms
 7  ash-bb1-link.telia.net (213.155.133.178)  60.992 ms  60.413 ms ae2-20G.scr2.DAL1.gblx.net (67.16.141.237)  5.790 ms
 8  ldn-bb1-link.telia.net (80.91.246.69)  109.003 ms * po6.ar4.AMS2.gblx.net (67.17.107.174)  124.616 ms
 9  ldn-b5-link.telia.net (80.91.248.216)  109.141 ms  109.125 ms *
10  * * *
11  xe-0-0-1.dr-master.r1.cgn3.hosteurope.de (176.28.4.14)  130.149 ms xe-0-2-0.cr-merak.fra2.hosteurope.de (176.28.4.2)  123.437 ms xe-0-0-1.dr-master.r1.cgn3.hosteurope.de (176.28.4.14)  128.728 ms
12  xe-2-2-0.cr-pollux.cgn3.hosteurope.de (80.237.129.169)  128.345 ms  128.334 ms  128.287 ms

I can repeat the traceroute, and it'll be different hosts everytime, however the dropouts are usually close to the Germany node, in the telia.net network.

Looks like the varying routes start already in the networklayer.com network, as the traffic sometimes seem to go out through telia.net and sometimes through gblx.net? (Based on the varying hosts at the same hopcount in the trace above.)

Speculation follows:

Seems like it's networklayer.com that for whatever reason switch back and forth between these two… Which quite possibly is because the route to whichever one they prefer is flapping or something to that regard.

It's unclear if your problem related to which path is used, ie if one is actually notably better than the other, or if the problem is the actual switching back and forth.

@stw:

I don't want to bug Linode's staff too much with support tickets
And yet Linode's Accounting Dept bugs me every month wanting payment.

You pay for service - don't be afraid to use it. Worse that can happen is they say it's not a hardware/infrastructure problem so you're on your own.

I would lean towards a network thing, as well. scp's progress indicator is based on when the network stack accepts the data, not necessarily when it is actually delivered. (Try scping a file from home to somewhere else… it sits on 100% like its a Windows OS installation.) This smells a lot like inconvenient packet loss.

Try laying down a ping while doing the scp, or maybe even mtr. Whatever is causing the routing to change is probably also dropping packets for a few seconds.

First of all, thanks everyone for your insights.

@hawk7000:

Looks like the varying routes start already in the networklayer.com network, as the traffic sometimes seem to go out through telia.net and sometimes through gblx.net? (Based on the varying hosts at the same hopcount in the trace above.)

That's true - I noticed some of the replies came from different hosts, but I didn't see a pattern in there. I agree with you, networklayer.com is probably switching routes between telia.net and gblx.net - which makes it hard to tell what network (telia.net or gblx.net) makes SCP take this long - or whether it's the switching altogether that causes it.

It still puzzles me that sometimes SCP finishes within seconds, and sometimes only after minutes have passed - yet the routes change within a traceroute. I'd assume that at some point, the speed of SCP would pick up if the route changes, or at least that it remains consistently low if route flapping itself is the issue - but instead the speed is either consistently slow, or consistently fast during a transfer.

@hoopycat:

This smells a lot like inconvenient packet loss.

Try laying down a ping while doing the scp, or maybe even mtr. Whatever is causing the routing to change is probably also dropping packets for a few seconds.

That's a good idea - it's funny how I use traceroute and all, and then forget using one of the most basic tools. I guess I neglected that because I assumed that commercial connections would not possibly have packet loss, and that it'd be a problem constrained to poor ADSL lines.

Anyway, you were correct; I ran a ping during the scp, and it would have a packet loss of around 20%.

For a 25 second SCP, I got

--- node.in.europe ping statistics ---
26 packets transmitted, 20 received, 23% packet loss, time 25026ms
rtt min/avg/max/mdev = 127.911/129.334/131.692/0.947 ms

and for a 2 minute SCP I got

--- node.in.europe ping statistics ---
140 packets transmitted, 114 received, 18% packet loss, time 139113ms
rtt min/avg/max/mdev = 126.246/128.959/133.576/1.009 ms

None of the replies were delivered out of order. During the "fast" scp (8 seconds) I had no packet loss. I tries to run SCP over a different port (1235) in an attempt to see whether port 1234 would be throttled, but I get the same figures.

So, thanks to you I realized that the problem is (a pretty significant?) packet loss, even though I am not sure, and probably can't find out, whether it's congestion caused or caused by the route switching. With that, is there anything I can do? Is this something Linode (or the German hoster) have any influence over? As in, could (and would) Linode choose a different route, or is it out of their hands anyway (in which case I wouldn't bother asking), because it's no longer in their network?

Telia.net is in Stockholm, networklayer.com's whois information is proxied (Domains By Proxy), which strikes me a bit as odd, seeing hiding contact information is something I would only expect individuals to do. Still, who would have more influence over the route - Linode or the provider in Germany?

Oh wow. TCP performance tends to decay pretty hard beyond about 5% packet loss, so the fact that it is working at all ought to be a pleasant surprise. Keep in mind that a lot of this is hidden from you because of a very large transmit buffer (see the Send-Q column on netstat -nt)… what looks smooth and constant from scp's perspective is very likely fits-and-starts under the hood. (scp really crams a lot into the sendq.)

This is probably worth tickets from both ends. In general, contact the party/parties with whom you have a business relationship. Neither Softlayer (theplanet.com, networklayer.com) nor Global Crossing nor Telia will deal with you directly, so start from the ends. (This is also handy because, in all but the most trivial cases, the return path will be totally different than the forward path, and packet loss could occur on either with similar effect. Did the packet get lost on the way there, or did the acknowledgement get lost on the way here?)

@hoopycat:

… (This is also handy because, in all but the most trivial cases, the return path will be totally different than the forward path, and packet loss could occur on either with similar effect. Did the packet get lost on the way there, or did the acknowledgement get lost on the way here?)
…Use 2ping and find out!

@hoopycat:

(scp really crams a lot into the sendq.)
If you use the -l (lower-case L) option, it seems to prevent scp from doing that. I've found it to be useful for getting more honest status reports from scp when transferring small files (which scp would normally just report 100% completion on immediately, as it's dumped the entire file into the send queue).

Hi Folks, I have a found a solution for this, but when using the windows app of WinSCP. Still, maybe it will give you a direction for the Linux scp command (although I did not find there, or for the ssh command, any matching attribute to tweak):

I found this - https://winscp.net/forum/viewtopic.php?t=25705, meaning the issue is at the SCP/SSH level.
So I disabled "Connection -> Optimize connection buffer size" in the WinSCP GUI location of Site Manager > Select the needed site > Edit > Advanced > Connection pane".

This changed my download speed (from the Linode server to my PC) to reach 50 Mbps as max speed but the average was about 35-40, while without this change it was about 12-13 Mbps.
https://winscp.net/eng/docs/ui_login_connection

FYI.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct