Recovering voice from captured RTP-streams

Written by Maxim Klimanov

On our pages not once could you see articles on how hardware capturing of network traffic can be made. In the cases when capturing is made with the help of SOHO-routers the administrator is unable either to save a big amount of data or to carry out the process at a high enough speed. It may seem to become an obsolete problem in modern networks, however today we are going to present a situation in which the capability of existing equipment will be enough. We are talking about voice traffic capturing for its further analysis and audio-data recovering with the help of Wireshark utility.

Beside the ways of hardware data capturing already described on our site, there is a conventional capturing method with the help of Wireshark itself. However, it requires that all streams should also be mirrored to a computer with a sniffer. This can be done by connecting all nodes to a concentrator, which unfortunately will result in all network’s performance drop. Thus we consider this idea to be unacceptable. Another way is to mirror passing data onto a special port. For example, Catalyst 2950 switches by Cisco Systems allow assigning a special interface to which all streams we are interested in will be redirected. In the configuration below, data transferred through Fa0/5 port will be copied to port Fa0/10.

Switch(config)# monitor session 1 source interface fastEthernet0/5
Switch(config)# monitor session 1 destination interface fastEthernet0/10

The shortcoming of such scheme is the inability of the host to perform its usual operations. In other words, a node connected to Fa0/10 interface will only be able to receive mirrored data, but won’t be able to send and receive its own frames.

In wired Linksys RVS4000 routers the problem stated above doesn’t exist, however there’s another one – the administrator can only get data coming into the port but not out of it. Certainly, one could mirror all ports. However, use of IPSec from a WAN-port can significantly complicate the task, if only one of the speakers is behind RVS4000 and the telephone traffic is encrypted with IPSec for transferring through the internet. We have several times addressed Linksys with regard to this problem, but failed to get any comprehensible answer or a promise to solve the problem in future firmware. One can configure incoming data stream mirroring in the Port Mirroring item of the L2 Switch menu.


ASUS RX3042H router allows specifying which streams from which ports to mirror. Alas, there’s no WAN-port among those interfaces. Configuration of the option at hand with this model is made in the Port Mirroring item of the Router Setup menu.

Let’s assume that we have somehow managed to capture precious packets and to save them. Let’s then open this file in Wireshark. In the case when RTP packets keep getting lost in an enormous amount of other data, it’s possible to set up a filter putting rtp in the corresponding field.

We then choose one RTP-packet from the conversation and turn to the Telephony-RTP-Stream Analysis menu.

Today we are not going to study statistical information about the captured voice stream, so we’ll turn to saving audio-data straightaway. For this we’ll have to press the Save payload... button.

In the Save Payload window you have to specify its name, location and format and also to choose the channel direction to save. Unfortunately, not everything is perfect with saving an audio-file in Wireshark. To be able to save both directions in one file, you’ll have to install the latest version, or at least a version higher than 1.2.4 in which the 4120 bug was fixed.

Beside the direct saving the captured telephone conversation, you may need to listen to it first. For doing this you should turn to the Telephony-VoIP Calls menu where all captured conversations will be presented.

Then you should choose the required conversation and press the Player button. In the window which will appear press the Decode button. Use RTP timestamp option allows using time stamps built into RTP instead of addressing to packets arrival times. Changing Jitter buffer [ms] parameter enables emulating real voice stream playback from a specific device with a fixed size of the input buffer.

Straight after decoding you can listen to the captured audio-data stream with the help of Play, Pause, and Stop buttons.

Here the brief overview of Wireshark voice capabilities comes to an end. In the conclusion we should add that office systems for recording telephone conversations are all built on the same principle: whole data stream from a SIP or a H.323 server is captured together with the traffic from telephones themselves for further detailed analysis. The task gets significantly simpler when a specific VLAN is dedicated to telephony.

Add comment

Security code

Found a typo? Please select it and press Ctrl + Enter.