IPTV: Power, Tune, then Crash

Let me share with you a story of the IPTV stream that wanted to.

Problem Description

End users start complaining about intermittent television freezing at a newly opened facility once business hours start. A common theme is that the TV turns on, tunes into its last channel, then after about 30 seconds or so “crashes”. This is particularly present during business hours, but there doesn’t seem to be much of a pattern.

The problem has been going on for a couple of weeks, though it is uncertain if the problem was previously observable.

Problem Impact

The inability to maintain a steady stream does make for bored folks who are waiting in the lobby, and among patients undergoing procedures where the televisions serve as a distraction.

Initial Diagnostics

The audio/visual team responsible for configuring the television receivers verified their configurations were correctly applied and that IGMPv3 was configured to support the source-specific multicast groups that have been in use all along.

Televisions were moved between floors and cables were swapped among switches. Still, no definitive pattern lending to the root cause could be determined. A different third-party receiver was also tested, and while it could receive the channel list, it was unable to tune into any streams at all.

Pre-Dispatch Impressions

IPTV at this particular institution is part of long-running infrastructure that provides all the television needs of the institution on its main campus and within its satellite networks. It employs SSM for the channel sources and ASM for channel beacons and ProCentric information propagation.

The problem itself was isolated to the one site, which led to a couple of places to look off the top:

  • There was a problem with the site distirbution layer.
  • There was a problem with the site access layer.
  • The end equipment was not correctly configured.

Unfortunately, one of the pitfalls of the access layer at this location is that it is made up entirely of Arista 755 campus switches. Unlike their Cisco 9407 counterparts, there was no ERSPAN capability available which necessitated a site visit. (Arista’s implementation of ERSPAN is different and not one I was set up to use at the time of this issue)

Spoiler Alert: The problem was with none of the things I originally suspected.

Let’s Go For A Drive

One of the things I find most enjoyable about my hybrid position is that when I do find myself in the office, I generally have a reason to be there. The downside: Travel to the South Shore from Central Massachusetts isn’t ever really a good drive. Some parts of the day just suck less.

In any event, once I got to the site, the first step in troubleshooting was to jack into the switch and see if I could bring up any of the programs in VLC. Even though I was able to receive all my SAP beacons, I was unable to tune in anything. Time to look at the access layer.

Access Layer Configuration

The access layer multicast configuration on the access layer was correctly configured to the organization standard. The organization standard is to use a routed access layer, which the Arista fit with below:

ip access-list standard SSM-RANGE
   10 permit 232.0.0.0/8
   20 permit 239.128.0.0/9
!
router multicast
   ipv4
      routing
!
router pim sparse-mode
   ipv4
      ssm range SSM-RANGE
      rp address 192.168.1.10
!
interface Ethernet1/1/1
   ...
   pim ipv4 sparse-mode
   pim ipv4 bfd
!
interface Ethernet2/1/1
   ...
   pim ipv4 sparse-mode
   pim ipv4 bfd
!
interface Vlan103
   pim ipv4 sparse-mode
!

I also had my PIM neighbors, and was receiving the bootstrap router information as I expected.

ACCRT1(s1)#sh ip pim nei
PIM Neighbor Table for default VRF
Neighbor Address  Interface      Uptime   Expires   Mode    Transport  
10.21.61.16       Ethernet1/1/1  180d16h  00:01:24  sparse  datagram   
10.21.61.18       Ethernet2/1/1  180d16h  00:01:18  sparse  datagram
ACCRT1(s1)#sh ip pim bsr
  Zone : 224.0.0.0/4 ( Global )
    BSR address: 192.168.1.10
    Uptime:      180d16h, BSR Priority: 224, Hash mask length: 0
    Next bootstrap message in 0:01:16
ACCRT1(s1)#

An improper configuration should have resulted in no service whatsoever, so this didn’t fit the bill of an intermittent outage. This wasn’t it, but it’s good to be thorough.

Distribution Layer Configuration

The distribution layer was a much more familiar Cisco IOS-XE device which enjoys multiple deployments. Just to be sure, I also verified the multicast configuration there.

ip multicast-routing
!
ip access-list standard SSM-RANGE
 10 permit 232.0.0.0 0.255.255.255
 20 permit 239.128.0.0 0.127.255.255
!
ip pim accept-rp 192.168.1.10
ip pim ssm range SSM-RANGE
!
! Access Downstream Interface
!
interface TenGigabitEthernet1/0/1
 ...
 ip pim sparse-mode
 ip pim bfd
 ip igmp version 3
 ...
!
! Uplink interface
!
interface TenGigabitEthernet1/0/15
 ...
 ip pim sparse-mode
 ip pim bfd
 ...
!

As the course went, all the required information was reporting in PIM as well:

DISTRT1#sh ip pim nei
PIM Neighbor Table
Mode: B - Bidir Capable, DR - Designated Router, N - Default DR Priority,
      P - Proxy Capable, S - State Refresh Capable, G - GenID Capable,
      L - DR Load-balancing Capable
Neighbor          Interface                Uptime/Expires    Ver   DR
Address                                                            Prio/Mode
10.21.61.1        TenGigabitEthernet1/0/1  30w3d/00:01:42    v2    1 / DR G
...
10.0.0.192        TenGigabitEthernet1/0/15 21w0d/00:01:34    v2    1 / S P G
...
DISTRT1#sh ip pim bsr
PIMv2 Bootstrap information
  BSR address: 192.168.1.10 (mprr1)
  Uptime:      25w5d, BSR Priority: 224, Hash mask length: 0
  Expires:     00:01:47
DISTRT1#

Since this solution has always worked in Cisco land, I configured a routed port to use as a test comparison.

interface TenGigabitEthernet1/0/9
 description L3A-DIAG
 no switchport
 ip address 10.21.56.129 255.255.255.128
 ...
 ip pim sparse-mode
 ip igmp version 3
 ...
end

Apples to Apples

With a few interfaces at my disposal, I resumed troubleshooting again, and made the following observations:

  • While connected to the routed port on distribution, I was able to stream programs without interruption.
  • When connected to access switch 1 in the IDF, I could tune in shortly after I connected, but could do nothing further after that. I noticed that I had to tune in quickly before my mac lost the plot.
  • When connected to access switch 2 in the IDF, tuning was hit or miss.

Enter Wireshark

When something works on one port and fails on another, it’s time to break out the packet analyzer.

On the distribution, a packet trace focused on IGMP looks good:

However, on the access switch, the packet trace gets weird.

(NOTE: I thought I saved the capture … but I did not. Go figure. I’ll recreate this soon.)

The capture condition was repeatable. I would start out with a fresh link and start running IGMPv3. Every time I received an IGMPv2 general query, I shortly stopped receiving the program I intended to tune into.

The behavior I was seeing was clear. All I needed on the VLAN was a single host sending out an IGMPv2 membership report and my Mac went and downgraded all its other responses. Effectively, this meant that it stopped sending out IGMPv3 membership reports.

Well … fuck me sideways. It was getting really late at this point, so this is pretty much when I packed up on site.

Let’s Check The RFC

IGMPv3 is defined in RFC 3376. This standard was defined more than 20 years ago at this point, so IGMPv3 support should be a part of pretty much anything that has is currently manufactured or in use today.

And the reason this whole thing went to shit. Section 7.2.1 states:

In order to be compatible with older version routers, IGMPv3 hosts MUST operate in version 1 and version 2 compatibility modes. IGMPv3 hosts MUST keep state per local interface regarding the compatibility mode of each attached network. A host’s compatibility mode is determined from the Host Compatibility Mode variable which can be in one of three states: IGMPv1, IGMPv2 or IGMPv3. This variable is kept per interface and is dependent on the version of General Queries heard on that interface as well as the Older Version Querier Present timers for the interface.

https://datatracker.ietf.org/doc/html/rfc3376#section-7.2.1

So really, all it took to sink IGMPv3 was a host on the network either accidentally or by default still sending out an IGMPv2 general query. The local interface more or less dumbed down the IGMP responses sent. Effectively, the router stopped hearing IGMPv3 membership reports because of this.

According to the RFC, the compatibility mode on the host will switch back after a timeout has passed, but that wasn’t always consistent and would explain the behavior.

The next section in the RFC states that the suppression of IGMPv3 Membership Reports in the presence of IGMPv1 or IGMPv2 is entirely optional:

An IGMPv3 host may be placed on a network where there are hosts that have not yet been upgraded to IGMPv3. A host MAY allow its IGMPv3 Membership Record to be suppressed by either a Version 1 Membership Report, or a Version 2 Membership Report.

https://datatracker.ietf.org/doc/html/rfc3376#section-7.2.2

The router itself was unaffected by the other general query, as it starts tracking timers per group:

In order to switch gracefully between versions of IGMP, routers keep an IGMPv1 Host Present timer and an IGMPv2 Host Present timer per group record.

https://datatracker.ietf.org/doc/html/rfc3376#section-7.3.2

How I’d Modify The RFC

I don’t even know where to begin with the process, but if I had the time and knew what to do, and could stand the bureaucracy of doing so, I’d modify this section of the RFC so that IGMPv3 hosts could disable compatibility mode. This would probably be the simplest approach to take.

Another approach, which might be more ideal from a compatibility perspective would be to not dumb down the interface to IGMPv2 entirely, by changing section 7.2.2 to where IGMPv3 compatible hosts would not suppress their IGMPv3 membership reports.

The Full Root Cause

Now that we know, the full root cause analysis can explain the observed problem with the TVs:

  • Turn on Television
  • Ethernet link comes up.
  • TV receives IGMPv3 General Membership Query from the switch.
  • TV sends IGMPv3 SSM Join in a membership report for the last program it was tuned to.
  • TV receives an IGMPv2 General Membership Query on the VLAN from another host.
  • TV dumbs down IGMP version per RFC to IGMPv2 compatibility mode.
  • TV sends IGMPv2 Membership Report and suppresses IGMPv3 membership reports, which doesn’t work with SSM ranges.
  • Switch times out multicast stream after no IGMPv3 membership reports are heard.
  • Programming stops and the user reports a crash.

Thus, the whole set of symptoms is explained.

Let’s dig a little deeper…

So … what device in 2024 isn’t working with IGMP properly? Looks like at least one HP printer.

Looking at the printer itself, it was indeed configured for IPv4 Multicast, mDNS and Bonjour printing. I suspect this part is a default setting.

However, this does raise another question. Since there are multiple printers on this subnet all talking w/ IGMPv2 general queries … does that mean that IGMPv2 Host Present timer reference in the RFC Section 7.3.2 will never expire? This is food for another post, perhaps.

… And Effect

The resolution of the problem was to deploy a separate SVI just for television services, effectively isolating them from any hosts or devices that would send out an IGMPv2 general membership query. At the end of the day, the business of the network is to support the business.

Not A New Problem

This is purely speculative, but I suspect that in most environments that are either using SSM or are just ASM with an RP, there hasn’t been a need or a compatibility issue to raise the flag. Virtually everything that is produced or configured today doesn’t have to deal with compatibility issues that were required when the standard was first launched. It’s just unfortunate that it takes one piece of errant equipment to dumb down a whole subnet for a time.

I also expect in most environments where multicast is being used and deployed, that special care is taken to ensure all hosts that are participating are explicitly configured and not part of a general user network. When I was working with multicast for television distribution in cable, all our IPTV receivers were IGMPv3 configured and no other devices existed on their assigned SVIs.

Final Thoughts

In some respects, this deep dive has raised more questions than it answers, even though the reported problem has been finally resolved.

If more than one host suppresses IGMPv3 because it heard an IGMPv2 query, does the whole subnet stay stuck there until all of the devices fall asleep? Does everything on the subnet dumb itself down, or are there other hosts that keep sending their IGMPv3 membership reports regardless? Will things go back to IGMPv3 if all the IGMPv2 speakers are port-shut and then re-enabled, therefore resetting the IGMP compatibility state for all the IGMPv2 devices? Finally, is it possible to find the real source of the problem and request IGMPv3 to be the vendor default?

It’s always the weird stuff … but at least it’s the weekend and it’s time to relax!

Acknowledgements

Thanks to the following folks for cursory review:

This post originally appeared on Underwood HQ.