mirror of
https://github.com/vector-im/element-call.git
synced 2026-03-28 06:50:26 +00:00
279 lines
10 KiB
Markdown
279 lines
10 KiB
Markdown
# MID Based Signalling
|
|
|
|
We intend to migrate away from refering to tracks by WebRTC track ID since this has known
|
|
problems with ID consistency on either side of the WebRTC connection (we currently parse the
|
|
SDP to work around this).
|
|
|
|
We also propose switch the structure of MSC3401 call events to be compatible with extensible
|
|
events. Legacy calls would have to remain the same for backwards compatibility. It is probably
|
|
impractical to have any call events support both formats because this would mean the SDP, which
|
|
can be quite large,would have to be duplicated in the JSON. Therefore, there's a significant
|
|
advantage to taking to opportunity to migrate to an extensible event format now.
|
|
|
|
We use the same event types for call invites and answers, so it is important that these events
|
|
are only send via to device message directly to client known to support them. They must not be
|
|
sent in rooms not supporting extensible events.
|
|
|
|
This outlines how a switch to mid + media UUID based signalling would work in more detail.
|
|
|
|
**Note: I've made this mixin about describing the tracks that are being sent on the transceivers
|
|
and the keys are the mids rather than the media UUIDs. Hoping that sketching this out will give
|
|
us an idea of whether it could work or not.**
|
|
|
|
m.call.invite:
|
|
```
|
|
{
|
|
"m.call.negotiate": {
|
|
"version": 1,
|
|
"call_id": "35657a5b793ce",
|
|
"conf_id": "bbe53499f82e3",
|
|
"invitee": "@bob:example.org",
|
|
"lifetime": 60000,
|
|
"party_id": "123456",
|
|
"description": {
|
|
"type": "offer",
|
|
"sdp": "[...]",
|
|
}
|
|
},
|
|
"m.call.capabilities": {
|
|
"m.call.transferee": false,
|
|
"m.call.dtmf": false,
|
|
},
|
|
"m.call.describe": {
|
|
"1": { // transceiver mid 1
|
|
"media_uuid": "aaaa-aaaa-aaaaaa-aaaa-aaaa",
|
|
"media_group_uuid": "1234-1234-123456-1234-1234", // rather than 'track group ID' to match media UUID?
|
|
"purpose": "m.usermedia",
|
|
"kind": "video"
|
|
},
|
|
"2": {
|
|
"media_uuid": "bbbb-bbbb-bbbbbb-bbbb-bbbb",
|
|
"media_group_uuid": "1234-1234-123456-1234-1234",
|
|
"purpose": "m.usermedia",
|
|
"kind": "audio",
|
|
},
|
|
},
|
|
}
|
|
```
|
|
|
|
Note that the SDP content is now in a section called `m.negotiate`. This is a mixin block common to all negotiation events
|
|
(invite, answer, negotiate).
|
|
|
|
The `m.call.describe` block is a 'mixin' block in externsible events terms and describes the media being sent on each
|
|
transceiver by the sending user. It *always* refers only to the media being sent by the device that sends the event.
|
|
|
|
The same is sent either to a focus or a peer client.
|
|
|
|
A focus will gather the various streams that it's receiving by the `conf_id` from the `m.negotiate` section. In future
|
|
when stream are advertised via the room state, this could be unnecessary and foci need not have any knowledge of what
|
|
group calls are happening (assuming we don't need them to enforce viewship based on this).
|
|
|
|
The peer or focus then answers, again over to-device message. A peer will start sending media tracks automatically and
|
|
therefore describe them in the answer:
|
|
|
|
m.call.answer (full mesh)
|
|
```
|
|
{
|
|
"m.negotiate": {
|
|
"version": 1,
|
|
"call_id": "35657a5b793ce",
|
|
"conf_id": "bbe53499f82e3",
|
|
"party_id": "678910",
|
|
"description": {
|
|
"type": "answer",
|
|
"sdp": "[...]",
|
|
}
|
|
},
|
|
"m.call.capabilities": {
|
|
"m.call.transferee": false,
|
|
"m.call.dtmf": false,
|
|
},
|
|
"m.call.describe": {
|
|
"1": { // transceiver mid 1
|
|
"media_uuid": "cccc-cccc-cccccc-cccc-cccc",
|
|
"media_group_uuid": "2345-2345-234567-2345-2345",
|
|
"purpose": "m.usermedia",
|
|
"kind": "video"
|
|
},
|
|
"2": {
|
|
"media_uuid": "bbbb-bbbb-bbbbbb-bbbb-bbbb",
|
|
"media_group_uuid": "1234-1234-123456-1234-1234",
|
|
"purpose": "m.usermedia",
|
|
"kind": "audio",
|
|
},
|
|
},
|
|
}
|
|
```
|
|
|
|
A focus, however, will not send any tracks by default and therefore does not include an
|
|
`m.call.describe` block. Instead, it includes an `m.track.advertise` block advertising
|
|
what tracks are available for that `conf_id`.
|
|
|
|
m.call.answer (focus)
|
|
```
|
|
{
|
|
"m.negotiate": {
|
|
"version": 1,
|
|
"call_id": "35657a5b793ce",
|
|
"conf_id": "bbe53499f82e3",
|
|
"party_id": "678910",
|
|
"description": {
|
|
"type": "answer",
|
|
"sdp": "[...]",
|
|
}
|
|
},
|
|
"m.call.capabilities": {
|
|
"m.call.transferee": false,
|
|
"m.call.dtmf": false,
|
|
},
|
|
"m.call.advertise": {
|
|
"alice:example.org": { // user ID
|
|
"88888888": { // device ID
|
|
"2345-2345-234567-2345-2345": [{ // media group uuid
|
|
"media_uuid": "aaaa-aaaa-aaaaaa-aaaa-aaaa":
|
|
"purpose": "m.usermedia",
|
|
"kind": "video",
|
|
}, {
|
|
"media_uuid": "bbbb-bbbb-bbbbbb-bbbb-bbbb":
|
|
"purpose": "m.usermedia",
|
|
"kind": "audio",
|
|
},
|
|
},
|
|
}
|
|
},
|
|
}
|
|
```
|
|
|
|
XXX: How flat vs deep do we want the structure to be here? I've done it quite deep here,
|
|
organised by user ID / device ID / media group UUID, but they could also just be a flat
|
|
list of tracks. It would be more duplication but maybe less effort to read.
|
|
|
|
The expected behaviour here would be for foci to essentially maintain a structure with all
|
|
tracks being pushed to it. This structure would probably have call IDs as a top level index,
|
|
then look very similar to the structure of the `m.call.advertise` event. It could keep
|
|
a reference to the transceiver it was receiving media on in the structure itself alongside
|
|
the media UUID, or maintain a separate map of media UUID to transceiver / peer connection such
|
|
that the first structure could be marshalled to JSON and sent to clients as-is. These are just
|
|
potential implementations though, all that is important is that the focus maintains sufficient
|
|
information about each track being sent to each client.
|
|
|
|
On the client side, the client will essentially take the `m.call.advertise` data and save it
|
|
almost as-is. It would probably cross-reference it against the call member state events to
|
|
ensure that it wasn't showing feeds for any users that did not have state events indicating
|
|
that they were in the call, although if we trust the focus and assume that conf IDs are unique
|
|
enough to be unguessable, this may be unnecessary.
|
|
|
|
In the simplest case, the client will simply iterate through this structure and add every
|
|
media UUID it finds to an `m.call.subscribe` message.
|
|
|
|
The most complex part is that when the `m.call.describe` message arrives back from the focus,
|
|
it will have to search through the data from the `m.call.advertise` message to map the tracks
|
|
it is now receving to the right user IDs (XXX: it could build them into a map to make this lookup
|
|
efficient, although if we make the advertise message indexed by media UUID then it already has
|
|
a map indexed by the correct thing...)
|
|
|
|
The `select_answer` is also tweaked to be more extensible-event like although is essentially
|
|
the same:
|
|
|
|
m.select\_answer
|
|
```
|
|
"m.select_answer": {
|
|
"version": "1",
|
|
"conf_id": "1674732106391mz4ygIc84Q2Z6mJ5",
|
|
"call_id": "35657a5b793ce",
|
|
"party_id": "123456",
|
|
"selected_party_id": "678910",
|
|
}
|
|
```
|
|
|
|
Once the client receives this, it decides what tracks it wants to receive and then sends
|
|
a subscribe message over the data channel:
|
|
|
|
m.call.subscribe
|
|
```
|
|
"m.call.subscribe": {
|
|
"seq": 1,
|
|
"media_uuids": {
|
|
"aaaa-aaaa-aaaaaa-aaaa-aaaa": {
|
|
"width": 1024,
|
|
"height": 576,
|
|
},
|
|
"bbbb-bbbb-bbbbbb-bbbb-bbbb": {},
|
|
},
|
|
},
|
|
```
|
|
|
|
This has also been rearranged a little to make the media UUIDs the keys and remove the
|
|
unsubscribe section which is unnecessary if we always send the complete set of tracks we
|
|
want to receive (we unsubscribe by just removing the media UUID from the dict).
|
|
|
|
This also now contains a sequence number. This is a monotonically increasing integer, starting
|
|
at 0 and scoped to the lifetime of the peer connection. The focus will send a reply containing
|
|
this sequence number to acknowledge that it has processed the message. This can be a positive ack:
|
|
|
|
m.call.ack
|
|
```
|
|
"m.call.ack": {
|
|
"seq": 1,
|
|
"result": "success",
|
|
}
|
|
```
|
|
|
|
...or an error:
|
|
|
|
m.call.ack
|
|
```
|
|
"m.call.ack": {
|
|
"seq": 1,
|
|
"result": "error",
|
|
"errcode": "M_UNKNOWN",
|
|
"error": "Internal server error",
|
|
}
|
|
```
|
|
|
|
This may give some indication as to why some tracks were not available (should it have errors per
|
|
media UUID, perhaps?)
|
|
|
|
If the focus needs to renegotiate to send the tracks, it does so, describing the media UUIDs it intends to send on the
|
|
transceivers once the negotiation is complete:
|
|
|
|
m.call.negotiate
|
|
```
|
|
{
|
|
"m.negotiate": {
|
|
"version": 1,
|
|
"call_id": "35657a5b793ce",
|
|
"conf_id": "bbe53499f82e3",
|
|
"lifetime": 60000,
|
|
"party_id": "123456",
|
|
"description": {
|
|
"type": "offer",
|
|
"sdp": "[...]",
|
|
}
|
|
},
|
|
"m.call.describe": {
|
|
"1": { // transceiver mid 1
|
|
"media_uuid": "aaaa-aaaa-aaaaaa-aaaa-aaaa",
|
|
"media_group_uuid": "1234-1234-123456-1234-1234", // rather than 'track group ID' to match media UUID?
|
|
"purpose": "m.usermedia",
|
|
"kind": "video"
|
|
},
|
|
"2": {
|
|
"media_uuid": "bbbb-bbbb-bbbbbb-bbbb-bbbb",
|
|
"media_group_uuid": "1234-1234-123456-1234-1234",
|
|
"purpose": "m.usermedia",
|
|
"kind": "audio",
|
|
},
|
|
},
|
|
}
|
|
```
|
|
|
|
Note that the content of this event is practically *identical* to the invite sent in a full mesh call. The purpose
|
|
is the same: to describe what tracks are being sent on each transceiver. In this case, the `purpose` and `kind` fields
|
|
are redundant since the client already knows them: they're included for symmetry. User IDs and device IDs are omitted,
|
|
howerver, as the client equally already knows what user IDs the media UUIDs correspond to, and this keeps it the same as
|
|
a full mesh track description.
|
|
|
|
Or it may already have enough spare transceivers and not need to negotiate, in which case it simply sends the same
|
|
track description block without a negotiation (and with event type `m.call.describe`.
|