November 17, 2017
5 min Read
In 2015 we launched Awex, the first IPv6-only project powered by Hostrangers. It’s fully automated starting from application deployments to network gear. Since that moment we have been automating almost everything what’s possible to avoid manual work, and increase efficiency.
As we wrote in previous posts we use Chef and Ansible for most of the tasks.
Due to huge demand for premium shared hosting we have been shipping an increased amount of new servers in the last few months.
Before having it automated, we had to manually setup servers. This means creating hardware RAIDs, setting IP addresses, installing OS, etc. Fortunately, we decided to stop here, and create a streamline process. We employed PXE for this kind of job.
- It allows us to fully converge our infrastructure with minimal costs;
- We use phpIPAM to store our inventory information like Chef run lists, environments, IP addresses, MAC addresses, etc. Hence we ended up with a simple solution: single PXE virtual machine per data center serving kickstart configuration files from phpIPAM.
The process itself is trivial:
- Put all the necessary information about the new server into phpIPAM (IP/MAC addresses/hostname/etc.);
- Generate kickstart config (which is also stored in phpIPAM along with the server as additional metadata field);
- Install OS (custom microservice (written with Ruby/Sinatra) downloads kickstart configuration from phpIPAM dynamically for given MAC address);
- After OS is installed, we run full Ansible playbook to converge all roles;
- Later we do not run full converges because it takes much time to complete.
We use incremental converges where we just involved versioning control for roles to speed up the process 10x.
We also considered using FAI, Foreman but this tool looked oover-engineeredfor our simple needs.
This year we are delivering more and more L3 networks across all of our data centers (DC).
We started the new data center in Netherlands which will bring new opportunities for clients sourcing from Europe. This DC is fully L3-only. This means no fucking stretched VLANs between racks, rooms and so on. On that occasion, I want to say that we switched Singapore’s data center over L3 as well. It’s just the right architecture in place. Yeah, I worked in three ISPs over my career and already found that most interesting networking stuff lies in data centers architectures.
We use Cumulus Linux and switched to 40G between leaf and spines. Everything is automated with Ansible. We connect every commodity server to a 10G network using fiber, somewhere copper. The project without issues cannot be successful in the first phase:
- We dealt with some of the hardware issues bootstrapping new data center;
- 40G adapters were not recognized properly – one end expects copper, while other – fiber;
- 10G over QSFP+ using breakout cable;
- Quagga –> FRRouting;
- Migration from L2 to L3 without any downtime.
40G is Something New
root@nl-srv-spine1:~# cat /cumulus/switchd/config/interface/swp1/interface_mode CR4 root@nl-srv-leaf1:~# cat /cumulus/switchd/config/interface/swp50/interface_mode XLAUI
It is really weird how it works while both sides have different interface XLAUI (fiber) and CR4 (copper). After few minutes the link is down. Why is the hell CR4 used at all?
I tried to override cable settings with
/etc/cumulus/portwd.conf by setting in both cable sides:
[ports] [cables] 00:17:6a,QSFP-BD-40G=40G-CR4
But no joy.
I did not find any documentation about
portwd daemon at all, thus decided to take a look at the source code.
@classmethod def decode_ethcodes_base(self, ethcodes, hacks_dict, quirks_dict, eth_map_dict, ext): ... if ethcodes in quirks_dict.keys(): quirk = quirks_dict[ethcodes] log_throttle('enabling ethcode quirk: 0x%02x -> %s' % (ethcodes, quirk), level="debug") # ethcodes with 0 belongs to this test, hence returning 40g-cr4 as default return (quirk,) @classmethod def decode_ethcodes(self, eeprom): ... # this returns 0 ethcodes = struct.unpack('B', data) if ethcodes == 0: ouipn = '%s,%s' % (self.decode_vendor_oui(eeprom), self.decode_vendor_pn(eeprom)) # skipping this test because certain ouipn is not listed if ouipn in self.ETH_CODE_ZERO_OVERRIDES.keys(): override = self.ETH_CODE_ZERO_OVERRIDES[ouipn] log_throttle('enabling zero ethcode override: %s -> %s' % (ouipn, override), once=True, level="info") return (override,) base = self.decode_ethcodes_base(ethcodes, self.ETH_CODE_SANCTIONED_HACKS, self.ETH_CODE_QUIRKS, self.ETH_CODE_MAP, 0) ... @property def ethcodes(self): if self._ethcode_override_map is not None: if self._ethcode_override_map.has_key('port'): override = self._ethcode_override_map['port'] log_throttle('enabling port ethcode override: %s' % (override), once=True, level="info") return override # we tried to override by vendorpn which is nothing else but '00:17:6a,QSFP-BD-40G' elif self._ethcode_override_map.has_key(self.vendorpn): override = self._ethcode_override_map[self.vendorpn] log_throttle('enabling vendorpn ethcode override: %s -> %s' % (self.vendorpn, override), once=True, level="info") return override
The default interface is
40G-CR4 (cannot decode interface information from ASIC), hence we have it. We ordered another kind of transceivers recommended by Cumulus – and it works! CR4 becomes SR4.
Another one interesting problem faced with QSFP+ was using an adapter to connect both ends with 10G. This was needed because our DC provider didn’t have 40G ports to connect to. We use EdgeCore 6712-32X-O-AC-F in our spines layer which doesn’t have 10G ports, hence we moved to 10G over QSFP+ adapter (breakout cable).
2 –> 3
Most interesting part was migrating existing L2 network with almost full three racks in Singapore to L3. I picked using private AS per rack numbering from 65031, 65032, … This architecture allows keeping eBGP rules regardless if it’s an upstream neighbor or just a ToR switch.
Existing infrastructure was built using separate sessions for IPv4 and IPv6 thus this option allowed us to migrate smoothly without any downtime by testing one protocol before switching to another. Before that, we had single exit upstream where we announced our prefixes. After setting up another one (for migration only) we needed to make sure traffic flows as expected to avoid asymmetric traffic using next-hop manipulation, AS prepends and so on.
This is an example of Ansible’s YAML hash to configure peers:
upstreams: nme: port: swp48 ip: 10.13.128.46/30 ipv6: <ipv6>/126 peers: nme4: as: 38001 local_as: 47583 as_prepend: 47583 47583 ip: 10.13.128.45 announce: - 22.214.171.124/20 - 126.96.36.199/24 nme6: as: 38001 local_as: 47583 as_prepend: 47583 47583 ip: <upstream-ipv6> announce: - 2a02:4780:3::/48 hostinger: port: swp1 ip: 10.0.31.1/30 ipv6: <ipv6>/64 peers: hostinger6: default_route: true as: 65032 ip: <ipv6> hostinger-r1: port: swp2 ip: 10.0.32.1/30 ipv6: <ipv6>/64 peers: hostinger6-r1: as: 65033 ip: <ipv6> hostinger-r2: port: swp3 speed: 1000 ip: 10.0.33.1/30 ipv6: <ipv6>/64 peers: hostinger6-r2: as: 65034 ip: <ipv6> ...
This snippet is taken from one of the Singapore’s spine switches. You should notice that we use IPv6-only BGP sessions between internal peers for both protocols and different AS numbers for different racks.
For new deployments, we started using the newest version of Cumulus which replaced Quagga with FRRouting daemon for routing protocols. In general, FRRouting is just a community fork of Quagga. Hence the configuration wasn’t so much different, basically, only paths had to be changed for configuration files. Before provisioning new version in production we test it in the kitchen with the same playbooks and Cumulus-VX. If you push stuff that’s untested you are just blocking your fire escape.
- Just like a power company shouldn’t change its voltage every year, we believe that our clients should be able to trust our service to be stable;
- Cumulus Linux still remains the best OS for networking;
- Do NOT design network topology according to application needs. Today applications are smart enough to be network agnostic and adaptive to any infrastructure. If you care about your infrastructure do not turn it into unfrastructure.
Hence we are trying to do less manual work while spending our time on improving the quality of the service.