Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mclag-HLD document #325

Merged
merged 33 commits into from Feb 21, 2020
Merged

Mclag-HLD document #325

merged 33 commits into from Feb 21, 2020

Conversation

shine4chen
Copy link
Contributor

add mclag hld document

@shine4chen
Copy link
Contributor Author

Sorry to submit it long time from community review meeting. Please review it. Nephos plan to submit code soon to catch up with 201903 branch. @lguohan

@lguohan
Copy link
Contributor

lguohan commented Feb 28, 2019

please describe why ebfilter is used?

@shine4chen
Copy link
Contributor Author

please describe why ebfilter is used?
We use ebfilter to isolate mclag peer link and mclag member port in linux kernel. In asic we use acl mechanism to do it.

…ffic from peer-link to mclag enable LAG. No SAI changed is required for MCLAG support.

2. mclag docker can start on demand.
3. clarify that redundancy peer link can be used for peer-link broken scenario
4. add diagram number to improve reader experience
5. update document version to 0.4
@shine4chen
Copy link
Contributor Author

@lguohan I have revised mclag HLD document per community review meeting opnion. Please help to review and approve it if appropriate.

shine added 3 commits May 6, 2019 01:40
Signed-off-by: shine <shine.chen@nephosinc.com>
Signed-off-by: shine.chen <shine.chen@nephosinc.com>
Signed-off-by: shine.chen <shine.chen@nephosinc.com>
Signed-off-by: shine <shine.chen@nephosinc.com>
@boralt
Copy link

boralt commented Oct 8, 2019

Which community SONIC release this feature is targeting? I didn't see it in October 2019 release planning schedule.

@shine4chen
Copy link
Contributor Author

@boralt It will be listed in 201910 release soon.

1. Add ND sync-up description
2. Add command  'mclagdctl config loglevel -l <level>'
Copy link
Collaborator

@stephenxs stephenxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just post some questions i'm curious about.

- In the above diagram, PortChannel0001 and PortChannel0002 areis mclag MC-LAG enabled interfaces, status is up.
- The data flow path is presented by the red line.
- The data flow path from PA to CE1: When the traffic reach PEER1, it will match the direct route, such as 10.1.1.0/24, and forwarded through PortChannel0001.
- The data flow path from CE1 to PA: CE1 may send the traffic to PEER1 or PEER2. PEER2 must has route entry that can reach PA. This route entry is installed by routing protocol.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can CE1 connect to RG?
To use BGP isn't an option because it requires both switches in RG act as one switch in terms of BGP from the CE's perspective of view. But it's not true.
Is it static routing or directly connected routing or protocols like VRRP?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, there is no need to establish BGP neighbor relationship between CE and PE. If L3 forwarding is adopted, direct connection or static routing is OK.

- MCLAG domain consists of only two systems.
- Each system only join one MC-LAG domain
- Supports Known Unicast and BUM traffic
- L3 interface on MLAG ports will have vMAC generated from VRRP algorithm using the same IP address assigned to the L3 LIF (logical interface);(Not supported currently)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if VRRP runs in one interface, can it sync information (like vMAC here) for another interface?
for example, in diagram 6.2 VRRP is likely to run in peer link, but it's going to negotiate vMAC for other interfaces like portchannel0001

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If VRRP is used, in diagram 6.2, VRRP needs to run on portchannel0001.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to diagram 6.2 the portchannel0001 is a routed interface rather than a switching interface (which belongs to a vlan). in this case which interface can vrrp on peer1/2 communicate on?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the expression is not accurate enough. We may use VRRP algorithm to generate VMAC instead of VRRP protocol.

### 7.1.4. ARP and ND sync-up between MC-LAG peers

- If one peer learns an ARP entry, it will send the ARP entry to the other peer via ICCP. For example, PEER1 learns ARP entry of CE1 from PortChannel0001, it will send this ARP to PEER2 via ICCP. PEER2 receives this ARP entry, and install it into Linux kernel, the learned interface name is PortChannel0001. This requires the name of MC-LAG enabled PortChannel interface in both peer devices must be the same.
- ICCP don’t flood ARP entry to peer periodically. To prevent the ARP entry from aging, ICCP uses Netlink socket to monitor ARP reply received by Linux kernel. For example, when an ARP entry in PEER2 is aged, the Linux kernel will send an ARP request via PortChannel0001. CE1 receives the ARP request, and send back one ARP reply. For CE1, PEER1 and PEER2 are viewed as the same device, the ARP reply may send to PEER2 or PEER1. If PEER2 receives the ARP reply, the ARP entry is learned again and information is updated in the kernel. At the same time, PEER2 will notify PEER1 via ICCP sync message. If PEER1 receives the ARP reply, since the ARP entry already exists in the kernel, kernel will use Netlink to send the ARP packet to its applications, ICCP will collect the ARP information from the ARP reply packet and send to PEER2, so PEER2 can update the ARP entry in the Linux kernel.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that its ARP entry aged while the downstream traffic keeping sending if a CE sends traffic to PEER1 while receiving traffic from PEER2? in this case the downstream traffic will lost while arp aged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before ARP entry is aged, Linux kernel will send ARP request, and CE will respond to ARP reply after receiving it. No matter which PE receives the ARP reply, it will be synchronized to the other peer, so ARP entries will not be aged.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the traffic is forwarded by ASIC rather than kernel protocol stack, from the kernel's perspective of view it doesn't receive any packet from the host and treats the arp entry as stale. in this case will the kernel still send arp request ahead of aging it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switches are forwarded by ASIC, and ARP entries in Linux kernel may age periodically. In SONIC, the default ARP aging time is 1800s. Even if no any packet is received from the host, the initialization state of ARP entry is reachable, and will send arp request ahead of aging it. If ARP reply is received, the state will change from stale to reachable.


- If one peer learns a MAC entry from a MC-LAG enabled PortChannel, it will send this MAC to other peer via ICCP. For example, PEER1 learns MAC entry of CE1 from PortChannel0001, it will send this MAC to PEER2 via ICCP. PEER2 receives this MAC, and installs the MAC into Linux kernel, the learned interface is also PortChannel0001. This means the name of MC-LAG enabled PortChannel interface in both peer devices must be the same.
- If one peer learns a MAC entry from an orphan port, it will also send this MAC to other peer via ICCP. For example, PEER1 learns MAC entry of CE2 from Eth4, it will send this MAC to PEER2 via ICCP. PEER2 receives this MAC, and installs the MAC into Linux kernel, the learned interface is peer link interface PortChannel0002.
- ICCP don't flood MAC entry to peer periodically. To prevent the MAC entry from aging, ICCP defines two flags for each MAC entry, MAC_AGE_LOCAL and MAC_AGE_PEER. MAC_AGE_LOCAL indicates the MAC entry in my device is aged, and MAC_AGE_PEER indicates the same MAC entry in peer device is aged. The MAC entry will be deleted from my FDB only when the two flags are both set for this MAC. For example, if the MAC of CE1 ages out in PEER2, the MAC entry will set MAC_AGE_LOCAL. If this MAC entry is not set MAC_AGE_PEER flag at the same time (because the MAC entry on PEER1 isn't aged, hence it doesn't tell PEER2 to set the flag), it will be installed back to the ASIC. Then PEER2 notifies the MAC age event to PEER1, PEER1 will set MAC_AGE_PEER for the same MAC.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flow of handling a local aged mac entry is to reinstall it into the ASIC if it is not aged by PEER. will it cause traffic broadcast during the time between mac aged and reinstalled?
seems only MAC_AGE_LOCAL and MAC_AGE_PEER isn't enough. consider the following flow:

  1. mac is aged in local device, MAC_AGE_LOCAL is set.
  2. mac is hit locally, is there a way to remove MAC_AGE_LOCAL flag? since the mac has already been in the ASIC FDB, the ASIC won't notify the software when the MAC is hit.
  3. mac is aged in the remote device, MAC_AGE_PEER set. if the MAC_AGE_LOCAL flag isn't be removed in the step 2, the mac will be aged, which is not correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in this scenario, Mac will be deleted, but it will be learned by ASIC immediately(millisecond level).

### 7.2.5. Peer link MAC learning

- When the MC-LAG enabled interface is up, peer link is the backup link for data traffic. MAC learning must be disabled on peer link to prevent data traffic from forwarding. If the learning is enabled, the same MAC (e.g. MAC of CE1) may be learned via MC-LAG port or peer link, and the output port of this MAC will keep toggling.
- When all local member links in an MC-LAG interface on one peer are down, MAC learning is also disabled in peer link, dynamic MAC entries will be installed to FDB pointing to peer link as the next hop, so traffic destined to those dynamic MAC entries will take the peer link path.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if multiple mc-lags share the one peer-link? consider the following sequence:

  1. portchannel0001 and portchannel0002 share the portchannel0003 as the peer link.
  2. members of portchannel0001 and portchannel0002 on PEER1 are all down so on PEER1 the MAC entries will be reprogrammed with portchannel0003 as nexthop.
  3. and then member(s) of portchannel0001 on PEER1 become up. should the MAC entries originally belonging to porthannel0001 be reprogrammed with portchannel0001 as nexthop on PEER1 while remaining the MAC entries originally belonging to portchannel0002 untouched? how to distinguish this two kind of MAC addresses?
    or the MAC just be refreshed by regular mac learning mechanism?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FDB table entries include MAC address, VLAN and port. Through port, you can distinguish MAC learned by portchannel0001 and portchannel0002.

- In this scenario, peers may be directly connected, or use other tools such as BFD to detect the status of peer-link(Not supported currently).
- If peer link and peer keepalive link is the same link, peer link down may cause peer connection down. In the case when keepalive connection is down, please see the above section. User should not design the network in this way.
- When peer link is down, as shown above, all the MACs that point to the peer-link will be removed in both peers. Data forwarding for CE continues as usual. If ICCP connection uses this peer link interface, the action is the same as described in "peer connection down". If ICCP connection doesn’t use this peer link interface, this is not a split-brain scenario because the state can still be synchronized by keepalive link. If one MC-LAG enabled port is down, data traffic may get lost since the peer link as a backup path is down.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it be an issue that PEER1 and PEER2 share the same ip address?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If receives a message with the same IP or MAC address as own, Linux kernel may print an warning message. In addition, no other problems were found.

shine4chen and others added 2 commits February 3, 2020 10:04
Signed-off-by: shine.chen <shine.chen@mediatek.com>
@rlhui rlhui merged commit 806c906 into sonic-net:master Feb 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants