Files
element-call-Github/doc/mid-based-signalling.md
David Baker 6a76873cee Typo
Co-authored-by: Daniel Abramov <inetcrack2@gmail.com>
2023-03-08 11:14:24 +00:00

10 KiB

MID Based Signalling

We intend to migrate away from refering to tracks by WebRTC track ID since this has known problems with ID consistency on either side of the WebRTC connection (we currently parse the SDP to work around this).

We also propose switch the structure of MSC3401 call events to be compatible with extensible events. Legacy calls would have to remain the same for backwards compatibility. It is probably impractical to have any call events support both formats because this would mean the SDP, which can be quite large,would have to be duplicated in the JSON. Therefore, there's a significant advantage to taking to opportunity to migrate to an extensible event format now.

We use the same event types for call invites and answers, so it is important that these events are only send via to device message directly to client known to support them. They must not be sent in rooms not supporting extensible events.

This outlines how a switch to mid + media UUID based signalling would work in more detail.

Note: I've made this mixin about describing the tracks that are being sent on the transceivers and the keys are the mids rather than the media UUIDs. Hoping that sketching this out will give us an idea of whether it could work or not.

m.call.invite:

{
    "m.call.negotiate": {
        "version": 1,
        "call_id": "35657a5b793ce",
        "conf_id": "bbe53499f82e3",
        "invitee": "@bob:example.org",
        "lifetime": 60000,
        "party_id": "123456",
        "description": {
            "type": "offer",
            "sdp": "[...]",
        }
    },
    "m.call.capabilities": {
        "m.call.transferee": false,
        "m.call.dtmf": false,
    },
    "m.call.describe": {
        "1": { // transceiver mid 1
            "media_uuid": "aaaa-aaaa-aaaaaa-aaaa-aaaa",
            "media_group_uuid": "1234-1234-123456-1234-1234", // rather than 'track group ID' to match media UUID?
            "purpose": "m.usermedia",
            "kind": "video"
        },
        "2": {
            "media_uuid": "bbbb-bbbb-bbbbbb-bbbb-bbbb",
            "media_group_uuid": "1234-1234-123456-1234-1234",
            "purpose": "m.usermedia",
            "kind": "audio",
        },
    },
}

Note that the SDP content is now in a section called m.negotiate. This is a mixin block common to all negotiation events (invite, answer, negotiate).

The m.call.describe block is a 'mixin' block in externsible events terms and describes the media being sent on each transceiver by the sending user. It always refers only to the media being sent by the device that sends the event.

The same is sent either to a focus or a peer client.

A focus will gather the various streams that it's receiving by the conf_id from the m.negotiate section. In future when stream are advertised via the room state, this could be unnecessary and foci need not have any knowledge of what group calls are happening (assuming we don't need them to enforce viewship based on this).

The peer or focus then answers, again over to-device message. A peer will start sending media tracks automatically and therefore describe them in the answer:

m.call.answer (full mesh)

{
    "m.negotiate": {
        "version": 1,
        "call_id": "35657a5b793ce",
        "conf_id": "bbe53499f82e3",
        "party_id": "678910",
        "description": {
            "type": "answer",
            "sdp": "[...]",
        }
    },
    "m.call.capabilities": {
        "m.call.transferee": false,
        "m.call.dtmf": false,
    },
    "m.call.describe": {
        "1": { // transceiver mid 1
            "media_uuid": "cccc-cccc-cccccc-cccc-cccc",
            "media_group_uuid": "2345-2345-234567-2345-2345",
            "purpose": "m.usermedia",
            "kind": "video"
        },
        "2": {
            "media_uuid": "bbbb-bbbb-bbbbbb-bbbb-bbbb",
            "media_group_uuid": "1234-1234-123456-1234-1234",
            "purpose": "m.usermedia",
            "kind": "audio",
        },
    },
}

A focus, however, will not send any tracks by default and therefore does not include an m.call.describe block. Instead, it includes an m.track.advertise block advertising what tracks are available for that conf_id.

m.call.answer (focus)

{
    "m.negotiate": {
        "version": 1,
        "call_id": "35657a5b793ce",
        "conf_id": "bbe53499f82e3",
        "party_id": "678910",
        "description": {
            "type": "answer",
            "sdp": "[...]",
        }
    },
    "m.call.capabilities": {
        "m.call.transferee": false,
        "m.call.dtmf": false,
    },
    "m.call.advertise": {
        "alice:example.org": { // user ID
            "88888888": { // device ID
                "2345-2345-234567-2345-2345": [{ // media group uuid
                    "media_uuid": "aaaa-aaaa-aaaaaa-aaaa-aaaa":
                    "purpose": "m.usermedia",
                    "kind": "video",
                }, {
                    "media_uuid": "bbbb-bbbb-bbbbbb-bbbb-bbbb":
                    "purpose": "m.usermedia",
                    "kind": "audio",
                },
            },
        }
    },
}

XXX: How flat vs deep do we want the structure to be here? I've done it quite deep here, organised by user ID / device ID / media group UUID, but they could also just be a flat list of tracks. It would be more duplication but maybe less effort to read.

The expected behaviour here would be for foci to essentially maintain a structure with all tracks being pushed to it. This structure would probably have call IDs as a top level index, then look very similar to the structure of the m.call.advertise event. It could either keep a reference to the transceiver it was receiving media on in the structure itself alongside the media UUID, or maintain a separate map of media UUID to transceiver / peer connection such that the first structure could be marshalled to JSON and sent to clients as-is.

On the client side, the client will essentially take the m.call.advertise data and save it almost as-is. It would probably cross-reference it against the call member state events to ensure that it wasn't showing feeds for any users that did not have state events indicating that they were in the call, although if we trust the focus and assume that conf IDs are unique enough to be unguessable, this may be unnecessary.

In the simplest case, the client will simply iterate through this structure and add every media UUID it finds to an m.call.subscribe message.

The most complex part is that when the m.call.describe message arrives back from the focus, it will have to search through the data from the m.call.advertise message to map the tracks it is now receving to the right user IDs (XXX: it could build them into a map to make this lookup efficient, although if we make the advertise message indexed by media UUID then it already has a map indexed by the correct thing...)

The select_answer is also tweaked to be more extensible-event like although is essentially the same:

m.select_answer

"m.select_answer": {
    "version": "1",
    "conf_id": "1674732106391mz4ygIc84Q2Z6mJ5",
    "call_id": "35657a5b793ce",
    "party_id": "123456",
    "selected_party_id": "678910",
}

Once the client receives this, it decides what tracks it wants to receive and then sends a subscribe message over the data channel:

m.call.subscribe

"m.call.subscribe": {
    "seq": 1,
    "media_uuids": {
        "aaaa-aaaa-aaaaaa-aaaa-aaaa": {
            "width": 1024,
            "height": 576,
        },
        "bbbb-bbbb-bbbbbb-bbbb-bbbb": {},
    },
},

This has also been rearranged a little to make the media UUIDs the keys and remove the unsubscribe section which is unnecessary if we always send the complete set of tracks we want to receive (we unsubscribe by just removing the media UUID from the dict).

This also now contains a sequence number. This is a monotonically increasing integer, starting at 0 and scoped to the lifetime of the peer connection. The focus will send a reply containing this sequence number to acknowledge that it has processed the message. This can be a positive ack:

m.call.ack

"m.call.ack": {
    "seq": 1,
    "result": "success",
}

...or an error:

m.call.ack

"m.call.ack": {
    "seq": 1,
    "result": "error",
    "errcode": "M_UNKNOWN",
    "error": "Internal server error",
}

This may give some indication as to why some tracks were not available (should it have errors per media UUID, perhaps?)

If the focus needs to renegotiate to send the tracks, it does so, describing the media UUIDs it intends to send on the transceivers once the negotiation is complete:

m.call.negotiate

{
    "m.negotiate": {
        "version": 1,
        "call_id": "35657a5b793ce",
        "conf_id": "bbe53499f82e3",
        "lifetime": 60000,
        "party_id": "123456",
        "description": {
            "type": "offer",
            "sdp": "[...]",
        }
    },
    "m.call.describe": {
        "1": { // transceiver mid 1
            "media_uuid": "aaaa-aaaa-aaaaaa-aaaa-aaaa",
            "media_group_uuid": "1234-1234-123456-1234-1234", // rather than 'track group ID' to match media UUID?
            "purpose": "m.usermedia",
            "kind": "video"
        },
        "2": {
            "media_uuid": "bbbb-bbbb-bbbbbb-bbbb-bbbb",
            "media_group_uuid": "1234-1234-123456-1234-1234",
            "purpose": "m.usermedia",
            "kind": "audio",
        },
    },
}

Note that the content of this event is practically identical to the invite sent in a full mesh call. The purpose is the same: to describe what tracks are being sent on each transceiver. In this case, the purpose and kind fields are redundant since the client already knows them: they're included for symmetry. User IDs and device IDs are omitted, howerver, as the client equally already knows what user IDs the media UUIDs correspond to, and this keeps it the same as a full mesh track description.

Or it may already have enough spare transceivers and not need to negotiate, in which case it simply sends the same track description block without a negotiation (and with event type m.call.describe.