This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
network_outage_logbook [2016/04/29 02:58]
iwilcox 2016-04-28 update
— (current)
Line 1: Line 1:
-BW=Bristol Wireless. ​ PoP=point of presence. 
-==== 2016-04-28 ==== 
-Not really a new outage so much as the old one continuing, but we have an explanation and a workaround: 
-Every time we do an ARP "​who-has ''​''",​ two OpenMesh boxes reply claiming to have it: ''​AC:​86:​74:​57:​3C:​92''​ (the correct one) and ''​AC:​86:​74:​13:​A6:​F2''​ (the wrong one).  Typically the fastest wins, and from the router'​s point in the network, more often than not the fastest is the correct one (but it varies a lot, without much obvious explanation). ​ The winner dictates our outbound route until the ARP entry is revalidated. 
-If we don't make ARP requests and instead add a private permanent entry to our table, mapping ''​''​ to ''​AC:​86:​74:​57:​3C:​92'',​ we reliably get routed out via Spectrum, and that's our workaround for now.  Tarim noted the effect can be achieved by waiting for the correct ARP entry then constantly refreshing it without revalidating,​ by leaving an ''​arping''​ running. 
-This explains: 
-  * long-lived (tens of seconds and above) connections dropping 
-  * the shorter delays observed in replies to our DHCP requests (the first request goes to and might use the wrong MAC; since ''​AC:​86:​74:​13:​A6:​F2''​ is not always reachable, the first may go unanswered; in that case the second always goes to the limited broadcast address so will always be answered by ''​AC:​86:​74:​57:​3C:​92''​) 
-It does not explain the [[#​2016-04-19 |total DHCP outage 2016-04-19]]. 
-==== 2016-04-26 ==== 
-From ~12:00 to at least 2016-04-27 13:00, same symptoms as previous disruption. 
-Again mailed the list asking someone present to power-cycle our G11 mesh unit.  Again unknown whether anyone did.  With the route changing back and forth between Zen and Spectrum again, perhaps it'd need power-cycling so often that it's unworkable even as a temporary fix. 
-With what little connectivity there was, I think I determined that something upstream (BW, or one of BW's ISPs?) is intercepting our DNS requests. ​ Lookups of ''​myip.opendns.com''​ specifically directed at ''​resolver1.opendns.com''​ come back ''​NXDOMAIN'',​ strongly suggesting they'​re in fact being answered by an intermediary which is trying to look them up using ''​auth1.opendns.com''​ instead. ​ This would mean we're not as independent of BW's infrastructure as we'd like to be.  (New graph below confirms the G11 mesh unit fails on ~10% of lookups during these disruptions,​ which seems worth bypassing.) 
-There was also a crazy load spike overnight which needs investigating,​ but doesn'​t look like the cause of anything. 
-==== 2016-04-22 ==== 
-From 04:50 the BW kit was misbehaving:​ 
-  * Upstream connectivity was patchy at best from 04:30 up until about 10:30; from 10:30 barely any packets got through at all.  Specifically,​ DHCP requests (made every five minutes owing to BW granting very short leases) ceased to be answered in a timely manner, requiring multiple attempts. 
-  * Over the same period, BW's DNS was failing even when there was a connection. ​ This may be related to a routing change at BW; up until at least 04:30 our PoP was only ever Spectrum Internet (''​''​);​ between 04:30 and 14:20 it changed back and forth between Spectrum (but ''​'',​ not ''​.75''​) and Zen (''​''​). ​ It seems likely that the BW mesh unit in G11 ("​NEW1"​) failed to pick up on this change, and was trying to forward requests to Spectrum over Zen's connection, and being ignored. ​ Perhaps that's also somehow responsible for the DHCP lag. 
-For DNS, I've changed the router to use OpenDNS and Google'​s DNS for now which (when there'​s an upstream connection at all) should work around the BW mesh box's failure to adapt. 
-Mailed the list asking someone present to power-cycle our G11 mesh unit, in the hope it comes back up in a more useful state, and not to interrupt the BT router. ​ Unknown whether anyone did. 
-From about 14:20 DHCP requests started being answered in a timely manner again, and the route stabilised as Spectrum-only (returning to ''​''​). 
-Total disruption time was ~10h. 
-==== 2016-04-19 ==== 
-Most logs were lost, but what we know is: 
-  * The problem started at lunchtime Tuesday, and the BT router got rebooted by folks present, unfortunately wiping the logs. 
-  * The router came back up just fine when rebooted (the suspected issue with the boot config turned out not to exist --- correctly configured all along, and certainly wasn't the cause). 
-  * After the reboot, and very probably before, the router wasn't getting a DHCP lease from the BW mesh box in G11. 
-  * Upstream connectivity didn't come back until 03:30 Thursday, when a DHCP lease was finally granted. 
-  * When connectivity did come back, the router resumed all services without any intervention. 
-Total outage was ~40 hours. 
-==== Ancient ==== 
-^ when      ^ what       ^ how fixed          ^ 
-| 3/​1/​13 ​  | internet link up but black router doesn'​t do dns resolution ​ | overwrote /​etc/​resolv.conf in hackspace desktop with public dns: | 
-| 21/​1/​13 ​  | internet link up but black router doesn'​t do dns resolution ​ | overwrote /​etc/​resolv.conf in hackspace desktop with public dns: | 
-| 22/​1/​13 ​  | internet link up but black router doesn'​t do dns resolution ​ | overwrote /​etc/​resolv.conf in hackspace desktop with public dns: | 
-| 1/​2/​2013 ​ | g11a only routing through mesh, not through wired (black router) connection | reset g11a and black router (Tarim) | 
-| 19/​2/​2013 ​ | g11a not giving DNS (apparently for a while but no bugger reported it) | reset black router and unchecked DNS relay in router (Tarim/​Matt) | 
  • network_outage_logbook.1461895117
  • Last modified: 3 years ago
  • (external edit)