Can we have a way to turn off passive health checks on NodeBalancers? They pose a pretty severe security vulnerability.
The problem is that while a 500 response may indicate that a server is misbehaving and needs to be replaced, oftentimes it just means there's a bug in your code, and in those cases all requests that don't hit the specific buggy endpoint would still serve normally, so removing the node from circulation is counterproductive at best and disastrous at worst.
In the worst case, this escalates inconsequential errors into DOS attack avenues. Suppose I have a bug in my webapp such that someone navigating to, say, /foo/bar/ will cause a 500 error. The rest of the app works fine, and in fact /foo/bar/ may not even be a path that I expect any user to hit. But if someone discovers this, and knows that I'm using a Linode NodeBalancer, then they can just send a stream of requests for /foo/bar/. These requests will cause whatever node they reach to return 500s and be removed from circulation; within a few dozen requests, all my nodes will be marked as unhealthy and my entire application will be offline.
We tested this against a toy application with a deliberate bug, and confirmed that the NodeBalancer would even remove the last surviving node from circulation.
This doesn't even require malice - you have a minor bug, several users stumble across it at the same time, some of them reload a few times to see if the problem goes away, and instead your entire application goes away. Or, if you DO have malicious users, they wouldn't need a botnet or even a real computer - they could make enough requests to DOS you from a 1990s cell phone.
We brought this up in a support ticket with Linode a while ago, but were told that we should just refrain from using 5XX status codes, which is nonsensical advice - obviously we never
intend for our software to fail, but bugs are an inevitable part of software development, and the 5XX codes are precisely what you should respond with when there's a server-side error.
Right now we're dealing with it by having nginx replace all 500s with the otherwise-unused 418 and adapting our client-side code to treat 418s as 500s, but A) this is a gross hack that we shouldn't have to do, and B) this is a very fragile solution - every time we change our nginx configuration we have make sure there are no cases where a 5XX slips through, because even one would be a vulnerability.
You can see some notes from our previous discussion here:
https://gist.github.com/rduplain/8be86b4df8cbfd7d1830So could we please get a way to disable these checks in the NodeBalancer configuration?
Thanks,
Dan