Opened 4 years ago

Last modified 2 years ago

#9307 new defect

Decoding of Opus audio with missing packets can produce noise spikes

Reported by: Misaki Owned by:
Priority: normal Component: undetermined
Version: unspecified Keywords: opus
Cc: Blocked By:
Blocking: Reproduced by developer: no
Analyzed by developer: no

Description (last modified by Misaki)

Summary of the bug:
OLD: With some files, seeking to the start produces a noise spike.

UPDATED: Output levels from Opus decoding can vary greatly if some packets at the start are missing, either because they weren't included in the stream or because ffplay seeks to a video keyframe that comes after those packets.

In the past, I found this was the case with a slight change in volume when encoding audio. So, for example, the filter "volume=0.8734" would produce a 'bugged' file that would cause this noise spike, while "volume=0.8733" would not. I waited to report it until I could use a more recent version of ffmpeg and ffplay.

How to reproduce (see below for interpretation):

$  /usr/bin/ffplay \[pow\ at\ start\]\[crop_1080\]屏東潮州六姐妹in新北市三重正義堂遶境\ Part2\[2012-06-17\]\ \[vUY-EH3gTRU\].webm -af astats
ffplay version 4.3.2-0+deb11u1ubuntu1 Copyright (c) 2003-2021 the FFmpeg developers
  built with gcc 10 (Ubuntu 10.2.1-20ubuntu1)
  configuration: --prefix=/usr --extra-version=0+deb11u1ubuntu1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 51.100 / 56. 51.100
  libavcodec     58. 91.100 / 58. 91.100
  libavformat    58. 45.100 / 58. 45.100
  libavdevice    58. 10.100 / 58. 10.100
  libavfilter     7. 85.100 /  7. 85.100
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  7.100 /  5.  7.100
  libswresample   3.  7.100 /  3.  7.100
  libpostproc    55.  7.100 / 55.  7.100
Input #0, matroska,webm, from '[pow at start][crop_1080]屏東潮州六姐妹in新北市三重正義堂遶境 Part2[2012-06-17] [vUY-EH3gTRU].webm':
  Metadata:
    ENCODER         : Lavf57.83.100
  Duration: 00:00:01.02, start: -0.007000, bitrate: 1804 kb/s
    Stream #0:0(eng): Video: vp9 (Profile 0), yuv420p(tv), 1280x720, SAR 32:27 DAR 512:243, 24 fps, 24 tbr, 1k tbn, 1k tbc (default)
    Metadata:
      DURATION        : 00:00:01.020000000
    Stream #0:1(eng): Audio: opus, 48000 Hz, mono, fltp (default)
    Metadata:
      ENCODER         : Lavc57.107.100 libopus
      DURATION        : 00:00:01.001000000
[Parsed_astats_0 @ 0x7f68b40151c0] Channel: 165KB sq=    0B f=0/0   
[Parsed_astats_0 @ 0x7f68b40151c0] DC offset: -nan
[Parsed_astats_0 @ 0x7f68b40151c0] Min level: 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Max level: -179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Min difference: 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Max difference: 0.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Mean difference: 0.000000
[Parsed_astats_0 @ 0x7f68b40151c0] RMS difference: 0.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Peak level dB: nan
[Parsed_astats_0 @ 0x7f68b40151c0] RMS level dB: -nan
[Parsed_astats_0 @ 0x7f68b40151c0] RMS peak dB: -nan
[Parsed_astats_0 @ 0x7f68b40151c0] RMS trough dB: -nan
[Parsed_astats_0 @ 0x7f68b40151c0] Crest factor: 1.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Flat factor: -nan
[Parsed_astats_0 @ 0x7f68b40151c0] Peak count: 0
[Parsed_astats_0 @ 0x7f68b40151c0] Noise floor dB: nan
[Parsed_astats_0 @ 0x7f68b40151c0] Noise floor count: 0
[Parsed_astats_0 @ 0x7f68b40151c0] Bit depth: 0/0
[Parsed_astats_0 @ 0x7f68b40151c0] Dynamic range: inf
[Parsed_astats_0 @ 0x7f68b40151c0] Zero crossings: 0
[Parsed_astats_0 @ 0x7f68b40151c0] Zero crossings rate: -nan
[Parsed_astats_0 @ 0x7f68b40151c0] Number of NaNs: 0
[Parsed_astats_0 @ 0x7f68b40151c0] Number of Infs: 0
[Parsed_astats_0 @ 0x7f68b40151c0] Number of denormals: 0
[Parsed_astats_0 @ 0x7f68b40151c0] Overall
[Parsed_astats_0 @ 0x7f68b40151c0] DC offset: -nan
[Parsed_astats_0 @ 0x7f68b40151c0] Min level: 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Max level: -179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Min difference: 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Max difference: 0.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Mean difference: 0.000000
[Parsed_astats_0 @ 0x7f68b40151c0] RMS difference: 0.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Peak level dB: nan
[Parsed_astats_0 @ 0x7f68b40151c0] RMS level dB: -nan
[Parsed_astats_0 @ 0x7f68b40151c0] RMS peak dB: -nan
[Parsed_astats_0 @ 0x7f68b40151c0] RMS trough dB: 3082.547156
[Parsed_astats_0 @ 0x7f68b40151c0] Flat factor: -nan
[Parsed_astats_0 @ 0x7f68b40151c0] Peak count: 0.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Noise floor dB: nan
[Parsed_astats_0 @ 0x7f68b40151c0] Noise floor count: 0.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Bit depth: 0/0
[Parsed_astats_0 @ 0x7f68b40151c0] Number of samples: 0
[Parsed_astats_0 @ 0x7f68b40151c0] Number of NaNs: 0.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Number of Infs: 0.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Number of denormals: 0.000000

[end of initial filter output before playback starts]

[Parsed_astats_0 @ 0x7f689c004580] Channel: 1 0KB sq=    0B f=0/0   
[Parsed_astats_0 @ 0x7f689c004580] DC offset: 0.000042
[Parsed_astats_0 @ 0x7f689c004580] Min level: -0.661960
[Parsed_astats_0 @ 0x7f689c004580] Max level: 0.650380
[Parsed_astats_0 @ 0x7f689c004580] Min difference: 0.000000
[Parsed_astats_0 @ 0x7f689c004580] Max difference: 0.132158
[Parsed_astats_0 @ 0x7f689c004580] Mean difference: 0.021959
[Parsed_astats_0 @ 0x7f689c004580] RMS difference: 0.028355
[Parsed_astats_0 @ 0x7f689c004580] Peak level dB: -3.583367
[Parsed_astats_0 @ 0x7f689c004580] RMS level dB: -15.233950
[Parsed_astats_0 @ 0x7f689c004580] RMS peak dB: -13.131743
[Parsed_astats_0 @ 0x7f689c004580] RMS trough dB: -16.290829
[Parsed_astats_0 @ 0x7f689c004580] Crest factor: 3.824099
[Parsed_astats_0 @ 0x7f689c004580] Flat factor: 0.000000
[Parsed_astats_0 @ 0x7f689c004580] Peak count: 2
[Parsed_astats_0 @ 0x7f689c004580] Noise floor dB: -3.921652
[Parsed_astats_0 @ 0x7f689c004580] Noise floor count: 1454
[Parsed_astats_0 @ 0x7f689c004580] Bit depth: 32/32
[Parsed_astats_0 @ 0x7f689c004580] Dynamic range: 318.880510
[Parsed_astats_0 @ 0x7f689c004580] Zero crossings: 2403
[Parsed_astats_0 @ 0x7f689c004580] Zero crossings rate: 0.050390
[Parsed_astats_0 @ 0x7f689c004580] Number of NaNs: 0
[Parsed_astats_0 @ 0x7f689c004580] Number of Infs: 0
[Parsed_astats_0 @ 0x7f689c004580] Number of denormals: 0
[Parsed_astats_0 @ 0x7f689c004580] Overall
[Parsed_astats_0 @ 0x7f689c004580] DC offset: 0.000042
[Parsed_astats_0 @ 0x7f689c004580] Min level: -0.661960
[Parsed_astats_0 @ 0x7f689c004580] Max level: 0.650380
[Parsed_astats_0 @ 0x7f689c004580] Min difference: 0.000000
[Parsed_astats_0 @ 0x7f689c004580] Max difference: 0.132158
[Parsed_astats_0 @ 0x7f689c004580] Mean difference: 0.021959
[Parsed_astats_0 @ 0x7f689c004580] RMS difference: 0.028355
[Parsed_astats_0 @ 0x7f689c004580] Peak level dB: -3.583367
[Parsed_astats_0 @ 0x7f689c004580] RMS level dB: -15.233950
[Parsed_astats_0 @ 0x7f689c004580] RMS peak dB: -13.131743
[Parsed_astats_0 @ 0x7f689c004580] RMS trough dB: -16.290829
[Parsed_astats_0 @ 0x7f689c004580] Flat factor: 0.000000
[Parsed_astats_0 @ 0x7f689c004580] Peak count: 2.000000
[Parsed_astats_0 @ 0x7f689c004580] Noise floor dB: -3.921652
[Parsed_astats_0 @ 0x7f689c004580] Noise floor count: 1454.000000
[Parsed_astats_0 @ 0x7f689c004580] Bit depth: 32/32
[Parsed_astats_0 @ 0x7f689c004580] Number of samples: 47688
[Parsed_astats_0 @ 0x7f689c004580] Number of NaNs: 0.000000
[Parsed_astats_0 @ 0x7f689c004580] Number of Infs: 0.000000
[Parsed_astats_0 @ 0x7f689c004580] Number of denormals: 0.000000

[end of first playback]

[Parsed_astats_0 @ 0x7f689c0429c0] Channel: 1 0KB sq=    0B f=0/0   
[Parsed_astats_0 @ 0x7f689c0429c0] DC offset: 0.012485
[Parsed_astats_0 @ 0x7f689c0429c0] Min level: -1.093806
[Parsed_astats_0 @ 0x7f689c0429c0] Max level: 5.133211
[Parsed_astats_0 @ 0x7f689c0429c0] Min difference: 0.000000
[Parsed_astats_0 @ 0x7f689c0429c0] Max difference: 1.122307
[Parsed_astats_0 @ 0x7f689c0429c0] Mean difference: 0.024617
[Parsed_astats_0 @ 0x7f689c0429c0] RMS difference: 0.035662
[Parsed_astats_0 @ 0x7f689c0429c0] Peak level dB: 14.207783
[Parsed_astats_0 @ 0x7f689c0429c0] RMS level dB: -9.882940
[Parsed_astats_0 @ 0x7f689c0429c0] RMS peak dB: -13.131624
[Parsed_astats_0 @ 0x7f689c0429c0] RMS trough dB: -16.263167
[Parsed_astats_0 @ 0x7f689c0429c0] Crest factor: 16.015339
[Parsed_astats_0 @ 0x7f689c0429c0] Flat factor: 0.000000
[Parsed_astats_0 @ 0x7f689c0429c0] Peak count: 2
[Parsed_astats_0 @ 0x7f689c0429c0] Noise floor dB: -3.921652
[Parsed_astats_0 @ 0x7f689c0429c0] Noise floor count: 1454
[Parsed_astats_0 @ 0x7f689c0429c0] Bit depth: 32/32
[Parsed_astats_0 @ 0x7f689c0429c0] Dynamic range: 336.671657
[Parsed_astats_0 @ 0x7f689c0429c0] Zero crossings: 2399
[Parsed_astats_0 @ 0x7f689c0429c0] Zero crossings rate: 0.050999
[Parsed_astats_0 @ 0x7f689c0429c0] Number of NaNs: 0
[Parsed_astats_0 @ 0x7f689c0429c0] Number of Infs: 0
[Parsed_astats_0 @ 0x7f689c0429c0] Number of denormals: 0
[Parsed_astats_0 @ 0x7f689c0429c0] Overall
[Parsed_astats_0 @ 0x7f689c0429c0] DC offset: 0.012485
[Parsed_astats_0 @ 0x7f689c0429c0] Min level: -1.093806
[Parsed_astats_0 @ 0x7f689c0429c0] Max level: 5.133211
[Parsed_astats_0 @ 0x7f689c0429c0] Min difference: 0.000000
[Parsed_astats_0 @ 0x7f689c0429c0] Max difference: 1.122307
[Parsed_astats_0 @ 0x7f689c0429c0] Mean difference: 0.024617
[Parsed_astats_0 @ 0x7f689c0429c0] RMS difference: 0.035662
[Parsed_astats_0 @ 0x7f689c0429c0] Peak level dB: 14.207783
[Parsed_astats_0 @ 0x7f689c0429c0] RMS level dB: -9.882940
[Parsed_astats_0 @ 0x7f689c0429c0] RMS peak dB: -13.131624
[Parsed_astats_0 @ 0x7f689c0429c0] RMS trough dB: -16.263167
[Parsed_astats_0 @ 0x7f689c0429c0] Flat factor: 0.000000
[Parsed_astats_0 @ 0x7f689c0429c0] Peak count: 2.000000
[Parsed_astats_0 @ 0x7f689c0429c0] Noise floor dB: -3.921652
[Parsed_astats_0 @ 0x7f689c0429c0] Noise floor count: 1454.000000
[Parsed_astats_0 @ 0x7f689c0429c0] Bit depth: 32/32
[Parsed_astats_0 @ 0x7f689c0429c0] Number of samples: 47040
[Parsed_astats_0 @ 0x7f689c0429c0] Number of NaNs: 0.000000
[Parsed_astats_0 @ 0x7f689c0429c0] Number of Infs: 0.000000
[Parsed_astats_0 @ 0x7f689c0429c0] Number of denormals: 0.000000
[end of second playback]

For the above output, I enter the command. It plays the 1 second video, and then stops. I press left arrow to seek to the start. This causes the astats filter to finish processing, so it produces the output that includes 'Peak level dB: -3.583367'. It plays again, with the noise peak at start. I press Q to quit, and the astats filter finishes for this second playback, producing the output that includes 'Peak level dB: 14.207783'.

I'm not sure if this is somehow caused by opus. Specifying '-acodec libopus' gives output that sounds the same; for some reason it seems to result in format s16 as the audio input for filter chain, compared to format 'fltp' for the default codec of 'opus', as seen with -v verbose or filter ashowinfo. This changes the output from astats, with peak of 0 dB but peak count of 184.

When using option '-vn' for no video, the noise spike does not happen when seeking to the start of the file.

It's possible this isn't a bug, though the result I had with a slight change in volume leading the noise spike suggests it is a bug. If it isn't a bug, I'm guessing it's somehow caused by concatenating opus packets in the wrong way. Describing a problem I had when doing that in case it helps with diagnosing this bug: I was trying to make a video which I encoded in segments. I had each segment as H.264 video and Opus audio. When I joined all the segments with 'concat' demuxer and '-c copy' for stream copy, in some places it seemed to work fine, but between some segments there was a noise spike.

That is, 'astats' would report a spike to something like 7 dB at the start of a segment, even though the original audio did not have this spike and '-c copy' was used. I tried uploading the joined audio to YouTube in case it was a problem with my decoding software and the problem was there too. I can only guess that Opus keeps some kind of information state, and packets depend on the state from previous packets. (Can kind of see this if you try to force a DC bias into an audio stream; output visualization with ffplay or something shows it quickly going to zero each time you seek to a new point in the file.) So something like this could be the cause of the current bug. I can't explain why a miniscule change in volume while encoding would lead to greatly diverging results during playback, or why there's no noise spike with '-vn', though.

I do note that in this output, there are fewer audio samples in the playback with the noise spike (47040 instead 47688), and I think this might actually be the key to fixing this bug. With -vn, the second playback gives 48000 samples but the same peak dB; the second playback sounds slightly different, but probably just from my pulseaudio starting later or something.

So I think what is happening here is that, since the first video has a presentation timestamp of 0.021 (due to opus audio becoming 0.007 seconds earlier each time you copy it, which might be another bug which I'm not reporting here), ffplay seeks to the audio packet that matches the start of video. When it starts from this slightly later packet, there is a noise spike.

If this explanation is correct, the questions are
1) is ffmpeg/ffplay following the decoding specifications for opus?
2) can the problem be fixed even if it's due to following the spec?

Attachments (3)

[pow at start][crop_1080]屏東潮州六姐妹in新北市三重正義堂遶境 Part2[2012-06-17] [vUY-EH3gTRU].webm (224.7 KB ) - added by Misaki 4 years ago.
copy.webm (11.6 KB ) - added by Misaki 4 years ago.
Audio starting from 0.01
copy.mp4 (11.7 KB ) - added by Misaki 4 years ago.
Noise spike in mp4 container with opus audio

Download all attachments as: .zip

Change History (10)

by Misaki, 4 years ago

Attachment: copy.webm added

Audio starting from 0.01

comment:1 by Misaki, 4 years ago

The 'seeking starts from a later packet' theory is correct.

ffmpeg -i \[pow\ at\ start\]\[crop_1080\]屏東潮州六姐妹in新北市三重正義堂遶境\ Part2\[2012-06-17\]\ \[vUY-EH3gTRU\].webm -c copy -vn -ss 0.01 copy.webm

Attached.

 ffmpeg -i copy.webm -af astats -f null -
[...]
  Duration: 00:00:00.99, start: 0.004000, bitrate: 95 kb/s
[...]
[Parsed_astats_0 @ 0x55c9a77f8100] Peak level dB: 14.207783
[Parsed_astats_0 @ 0x55c9a77f8100] RMS level dB: -9.854039
[Parsed_astats_0 @ 0x55c9a77f8100] RMS peak dB: -13.131624
[...]
[Parsed_astats_0 @ 0x55c9a77f8100] Number of samples: 46728

(Command in original ticket has unneeded space in filename when copying from terminal: '屏 東')

In case it's helpful, results from silencedetect filter:

 ffmpeg -i \[pow* -t 1 -vn -af silencedetect=d=0.005 -f null -
[silencedetect @ 0x55cdbd13a040] silence_start: 0.007
[silencedetect @ 0x55cdbd13a040] silence_end: 0.0365 | silence_duration: 0.0295

 ffmpeg -i copy.webm -t 1 -vn -af silencedetect=d=0.005 -f null -
[silencedetect @ 0x556a286865c0] silence_start: 0.007
[silencedetect @ 0x556a286865c0] silence_end: 0.0134792 | silence_duration: 0.00

Note that ffmpeg is applying the start offset to these times due to not using '-seek_timestamp 1'.

Mp4 container has same result; attached.

ffmpeg -i [pow* -c copy -vn -ss 0.01 -strict -2 copy.mp4

Last edited 4 years ago by Misaki (previous) (diff)

comment:2 by Misaki, 4 years ago

Description: modified (diff)

by Misaki, 4 years ago

Attachment: copy.mp4 added

Noise spike in mp4 container with opus audio

comment:3 by Misaki, 4 years ago

Description: modified (diff)
Summary: Seeking to the start of some files produces noise spikeDecoding of Opus audio with missing packets produces produces noise spike

comment:4 by Misaki, 4 years ago

Summary: Decoding of Opus audio with missing packets produces produces noise spikeDecoding of Opus audio with missing packets can produce noise spikes

comment:5 by Misaki, 4 years ago

When I play these three files (original, copy.webm and copy.mp4 with just audio starting from 0.01) in Firefox on Linux, I get the audio spike with the two copies, but not when seeking to the start of the original. Testing shows that if video is significantly delayed, like by 0.5 sec, seeking to the start of the file goes to its actual start, so in the original file it plays from 0 (or possibly -0.007; but ashowinfo says start time is 0), not 0.02 which is the first video frame.

So not sure if it would have a noise spike if playback started from 0.01 or 0.02, or if it's fading in the first 0.01 seconds of audio. Trying to seek to around 1% of the playback bar doesn't lead to any such spike.

The point here is that normal files should not be missing packets. Maybe ffmpeg or ffplay could seek to a few audio packets before the playback point and discard the decoded output up to that point.

Version 0, edited 4 years ago by Misaki (next)

comment:6 by Carl Eugen Hoyos, 2 years ago

Keywords: ffplay libopus removed

comment:7 by Carl Eugen Hoyos, 2 years ago

Is the issue still reproducible with current FFmpeg git head?

Note: See TracTickets for help on using tickets.