Opened 4 years ago
Last modified 2 years ago
#9307 new defect
Decoding of Opus audio with missing packets can produce noise spikes
Reported by: | Misaki | Owned by: | |
---|---|---|---|
Priority: | normal | Component: | undetermined |
Version: | unspecified | Keywords: | opus |
Cc: | Blocked By: | ||
Blocking: | Reproduced by developer: | no | |
Analyzed by developer: | no |
Description (last modified by )
Summary of the bug:
OLD: With some files, seeking to the start produces a noise spike.
UPDATED: Output levels from Opus decoding can vary greatly if some packets at the start are missing, either because they weren't included in the stream or because ffplay seeks to a video keyframe that comes after those packets.
In the past, I found this was the case with a slight change in volume when encoding audio. So, for example, the filter "volume=0.8734" would produce a 'bugged' file that would cause this noise spike, while "volume=0.8733" would not. I waited to report it until I could use a more recent version of ffmpeg and ffplay.
How to reproduce (see below for interpretation):
$ /usr/bin/ffplay \[pow\ at\ start\]\[crop_1080\]屏東潮州六姐妹in新北市三重正義堂遶境\ Part2\[2012-06-17\]\ \[vUY-EH3gTRU\].webm -af astats ffplay version 4.3.2-0+deb11u1ubuntu1 Copyright (c) 2003-2021 the FFmpeg developers built with gcc 10 (Ubuntu 10.2.1-20ubuntu1) configuration: --prefix=/usr --extra-version=0+deb11u1ubuntu1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared libavutil 56. 51.100 / 56. 51.100 libavcodec 58. 91.100 / 58. 91.100 libavformat 58. 45.100 / 58. 45.100 libavdevice 58. 10.100 / 58. 10.100 libavfilter 7. 85.100 / 7. 85.100 libavresample 4. 0. 0 / 4. 0. 0 libswscale 5. 7.100 / 5. 7.100 libswresample 3. 7.100 / 3. 7.100 libpostproc 55. 7.100 / 55. 7.100 Input #0, matroska,webm, from '[pow at start][crop_1080]屏東潮州六姐妹in新北市三重正義堂遶境 Part2[2012-06-17] [vUY-EH3gTRU].webm': Metadata: ENCODER : Lavf57.83.100 Duration: 00:00:01.02, start: -0.007000, bitrate: 1804 kb/s Stream #0:0(eng): Video: vp9 (Profile 0), yuv420p(tv), 1280x720, SAR 32:27 DAR 512:243, 24 fps, 24 tbr, 1k tbn, 1k tbc (default) Metadata: DURATION : 00:00:01.020000000 Stream #0:1(eng): Audio: opus, 48000 Hz, mono, fltp (default) Metadata: ENCODER : Lavc57.107.100 libopus DURATION : 00:00:01.001000000 [Parsed_astats_0 @ 0x7f68b40151c0] Channel: 165KB sq= 0B f=0/0 [Parsed_astats_0 @ 0x7f68b40151c0] DC offset: -nan [Parsed_astats_0 @ 0x7f68b40151c0] Min level: 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000 [Parsed_astats_0 @ 0x7f68b40151c0] Max level: -179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000 [Parsed_astats_0 @ 0x7f68b40151c0] Min difference: 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000 [Parsed_astats_0 @ 0x7f68b40151c0] Max difference: 0.000000 [Parsed_astats_0 @ 0x7f68b40151c0] Mean difference: 0.000000 [Parsed_astats_0 @ 0x7f68b40151c0] RMS difference: 0.000000 [Parsed_astats_0 @ 0x7f68b40151c0] Peak level dB: nan [Parsed_astats_0 @ 0x7f68b40151c0] RMS level dB: -nan [Parsed_astats_0 @ 0x7f68b40151c0] RMS peak dB: -nan [Parsed_astats_0 @ 0x7f68b40151c0] RMS trough dB: -nan [Parsed_astats_0 @ 0x7f68b40151c0] Crest factor: 1.000000 [Parsed_astats_0 @ 0x7f68b40151c0] Flat factor: -nan [Parsed_astats_0 @ 0x7f68b40151c0] Peak count: 0 [Parsed_astats_0 @ 0x7f68b40151c0] Noise floor dB: nan [Parsed_astats_0 @ 0x7f68b40151c0] Noise floor count: 0 [Parsed_astats_0 @ 0x7f68b40151c0] Bit depth: 0/0 [Parsed_astats_0 @ 0x7f68b40151c0] Dynamic range: inf [Parsed_astats_0 @ 0x7f68b40151c0] Zero crossings: 0 [Parsed_astats_0 @ 0x7f68b40151c0] Zero crossings rate: -nan [Parsed_astats_0 @ 0x7f68b40151c0] Number of NaNs: 0 [Parsed_astats_0 @ 0x7f68b40151c0] Number of Infs: 0 [Parsed_astats_0 @ 0x7f68b40151c0] Number of denormals: 0 [Parsed_astats_0 @ 0x7f68b40151c0] Overall [Parsed_astats_0 @ 0x7f68b40151c0] DC offset: -nan [Parsed_astats_0 @ 0x7f68b40151c0] Min level: 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000 [Parsed_astats_0 @ 0x7f68b40151c0] Max level: -179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000 [Parsed_astats_0 @ 0x7f68b40151c0] Min difference: 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000 [Parsed_astats_0 @ 0x7f68b40151c0] Max difference: 0.000000 [Parsed_astats_0 @ 0x7f68b40151c0] Mean difference: 0.000000 [Parsed_astats_0 @ 0x7f68b40151c0] RMS difference: 0.000000 [Parsed_astats_0 @ 0x7f68b40151c0] Peak level dB: nan [Parsed_astats_0 @ 0x7f68b40151c0] RMS level dB: -nan [Parsed_astats_0 @ 0x7f68b40151c0] RMS peak dB: -nan [Parsed_astats_0 @ 0x7f68b40151c0] RMS trough dB: 3082.547156 [Parsed_astats_0 @ 0x7f68b40151c0] Flat factor: -nan [Parsed_astats_0 @ 0x7f68b40151c0] Peak count: 0.000000 [Parsed_astats_0 @ 0x7f68b40151c0] Noise floor dB: nan [Parsed_astats_0 @ 0x7f68b40151c0] Noise floor count: 0.000000 [Parsed_astats_0 @ 0x7f68b40151c0] Bit depth: 0/0 [Parsed_astats_0 @ 0x7f68b40151c0] Number of samples: 0 [Parsed_astats_0 @ 0x7f68b40151c0] Number of NaNs: 0.000000 [Parsed_astats_0 @ 0x7f68b40151c0] Number of Infs: 0.000000 [Parsed_astats_0 @ 0x7f68b40151c0] Number of denormals: 0.000000 [end of initial filter output before playback starts] [Parsed_astats_0 @ 0x7f689c004580] Channel: 1 0KB sq= 0B f=0/0 [Parsed_astats_0 @ 0x7f689c004580] DC offset: 0.000042 [Parsed_astats_0 @ 0x7f689c004580] Min level: -0.661960 [Parsed_astats_0 @ 0x7f689c004580] Max level: 0.650380 [Parsed_astats_0 @ 0x7f689c004580] Min difference: 0.000000 [Parsed_astats_0 @ 0x7f689c004580] Max difference: 0.132158 [Parsed_astats_0 @ 0x7f689c004580] Mean difference: 0.021959 [Parsed_astats_0 @ 0x7f689c004580] RMS difference: 0.028355 [Parsed_astats_0 @ 0x7f689c004580] Peak level dB: -3.583367 [Parsed_astats_0 @ 0x7f689c004580] RMS level dB: -15.233950 [Parsed_astats_0 @ 0x7f689c004580] RMS peak dB: -13.131743 [Parsed_astats_0 @ 0x7f689c004580] RMS trough dB: -16.290829 [Parsed_astats_0 @ 0x7f689c004580] Crest factor: 3.824099 [Parsed_astats_0 @ 0x7f689c004580] Flat factor: 0.000000 [Parsed_astats_0 @ 0x7f689c004580] Peak count: 2 [Parsed_astats_0 @ 0x7f689c004580] Noise floor dB: -3.921652 [Parsed_astats_0 @ 0x7f689c004580] Noise floor count: 1454 [Parsed_astats_0 @ 0x7f689c004580] Bit depth: 32/32 [Parsed_astats_0 @ 0x7f689c004580] Dynamic range: 318.880510 [Parsed_astats_0 @ 0x7f689c004580] Zero crossings: 2403 [Parsed_astats_0 @ 0x7f689c004580] Zero crossings rate: 0.050390 [Parsed_astats_0 @ 0x7f689c004580] Number of NaNs: 0 [Parsed_astats_0 @ 0x7f689c004580] Number of Infs: 0 [Parsed_astats_0 @ 0x7f689c004580] Number of denormals: 0 [Parsed_astats_0 @ 0x7f689c004580] Overall [Parsed_astats_0 @ 0x7f689c004580] DC offset: 0.000042 [Parsed_astats_0 @ 0x7f689c004580] Min level: -0.661960 [Parsed_astats_0 @ 0x7f689c004580] Max level: 0.650380 [Parsed_astats_0 @ 0x7f689c004580] Min difference: 0.000000 [Parsed_astats_0 @ 0x7f689c004580] Max difference: 0.132158 [Parsed_astats_0 @ 0x7f689c004580] Mean difference: 0.021959 [Parsed_astats_0 @ 0x7f689c004580] RMS difference: 0.028355 [Parsed_astats_0 @ 0x7f689c004580] Peak level dB: -3.583367 [Parsed_astats_0 @ 0x7f689c004580] RMS level dB: -15.233950 [Parsed_astats_0 @ 0x7f689c004580] RMS peak dB: -13.131743 [Parsed_astats_0 @ 0x7f689c004580] RMS trough dB: -16.290829 [Parsed_astats_0 @ 0x7f689c004580] Flat factor: 0.000000 [Parsed_astats_0 @ 0x7f689c004580] Peak count: 2.000000 [Parsed_astats_0 @ 0x7f689c004580] Noise floor dB: -3.921652 [Parsed_astats_0 @ 0x7f689c004580] Noise floor count: 1454.000000 [Parsed_astats_0 @ 0x7f689c004580] Bit depth: 32/32 [Parsed_astats_0 @ 0x7f689c004580] Number of samples: 47688 [Parsed_astats_0 @ 0x7f689c004580] Number of NaNs: 0.000000 [Parsed_astats_0 @ 0x7f689c004580] Number of Infs: 0.000000 [Parsed_astats_0 @ 0x7f689c004580] Number of denormals: 0.000000 [end of first playback] [Parsed_astats_0 @ 0x7f689c0429c0] Channel: 1 0KB sq= 0B f=0/0 [Parsed_astats_0 @ 0x7f689c0429c0] DC offset: 0.012485 [Parsed_astats_0 @ 0x7f689c0429c0] Min level: -1.093806 [Parsed_astats_0 @ 0x7f689c0429c0] Max level: 5.133211 [Parsed_astats_0 @ 0x7f689c0429c0] Min difference: 0.000000 [Parsed_astats_0 @ 0x7f689c0429c0] Max difference: 1.122307 [Parsed_astats_0 @ 0x7f689c0429c0] Mean difference: 0.024617 [Parsed_astats_0 @ 0x7f689c0429c0] RMS difference: 0.035662 [Parsed_astats_0 @ 0x7f689c0429c0] Peak level dB: 14.207783 [Parsed_astats_0 @ 0x7f689c0429c0] RMS level dB: -9.882940 [Parsed_astats_0 @ 0x7f689c0429c0] RMS peak dB: -13.131624 [Parsed_astats_0 @ 0x7f689c0429c0] RMS trough dB: -16.263167 [Parsed_astats_0 @ 0x7f689c0429c0] Crest factor: 16.015339 [Parsed_astats_0 @ 0x7f689c0429c0] Flat factor: 0.000000 [Parsed_astats_0 @ 0x7f689c0429c0] Peak count: 2 [Parsed_astats_0 @ 0x7f689c0429c0] Noise floor dB: -3.921652 [Parsed_astats_0 @ 0x7f689c0429c0] Noise floor count: 1454 [Parsed_astats_0 @ 0x7f689c0429c0] Bit depth: 32/32 [Parsed_astats_0 @ 0x7f689c0429c0] Dynamic range: 336.671657 [Parsed_astats_0 @ 0x7f689c0429c0] Zero crossings: 2399 [Parsed_astats_0 @ 0x7f689c0429c0] Zero crossings rate: 0.050999 [Parsed_astats_0 @ 0x7f689c0429c0] Number of NaNs: 0 [Parsed_astats_0 @ 0x7f689c0429c0] Number of Infs: 0 [Parsed_astats_0 @ 0x7f689c0429c0] Number of denormals: 0 [Parsed_astats_0 @ 0x7f689c0429c0] Overall [Parsed_astats_0 @ 0x7f689c0429c0] DC offset: 0.012485 [Parsed_astats_0 @ 0x7f689c0429c0] Min level: -1.093806 [Parsed_astats_0 @ 0x7f689c0429c0] Max level: 5.133211 [Parsed_astats_0 @ 0x7f689c0429c0] Min difference: 0.000000 [Parsed_astats_0 @ 0x7f689c0429c0] Max difference: 1.122307 [Parsed_astats_0 @ 0x7f689c0429c0] Mean difference: 0.024617 [Parsed_astats_0 @ 0x7f689c0429c0] RMS difference: 0.035662 [Parsed_astats_0 @ 0x7f689c0429c0] Peak level dB: 14.207783 [Parsed_astats_0 @ 0x7f689c0429c0] RMS level dB: -9.882940 [Parsed_astats_0 @ 0x7f689c0429c0] RMS peak dB: -13.131624 [Parsed_astats_0 @ 0x7f689c0429c0] RMS trough dB: -16.263167 [Parsed_astats_0 @ 0x7f689c0429c0] Flat factor: 0.000000 [Parsed_astats_0 @ 0x7f689c0429c0] Peak count: 2.000000 [Parsed_astats_0 @ 0x7f689c0429c0] Noise floor dB: -3.921652 [Parsed_astats_0 @ 0x7f689c0429c0] Noise floor count: 1454.000000 [Parsed_astats_0 @ 0x7f689c0429c0] Bit depth: 32/32 [Parsed_astats_0 @ 0x7f689c0429c0] Number of samples: 47040 [Parsed_astats_0 @ 0x7f689c0429c0] Number of NaNs: 0.000000 [Parsed_astats_0 @ 0x7f689c0429c0] Number of Infs: 0.000000 [Parsed_astats_0 @ 0x7f689c0429c0] Number of denormals: 0.000000 [end of second playback]
For the above output, I enter the command. It plays the 1 second video, and then stops. I press left arrow to seek to the start. This causes the astats filter to finish processing, so it produces the output that includes 'Peak level dB: -3.583367'. It plays again, with the noise peak at start. I press Q to quit, and the astats filter finishes for this second playback, producing the output that includes 'Peak level dB: 14.207783'.
I'm not sure if this is somehow caused by opus. Specifying '-acodec libopus' gives output that sounds the same; for some reason it seems to result in format s16 as the audio input for filter chain, compared to format 'fltp' for the default codec of 'opus', as seen with -v verbose or filter ashowinfo. This changes the output from astats, with peak of 0 dB but peak count of 184.
When using option '-vn' for no video, the noise spike does not happen when seeking to the start of the file.
It's possible this isn't a bug, though the result I had with a slight change in volume leading the noise spike suggests it is a bug. If it isn't a bug, I'm guessing it's somehow caused by concatenating opus packets in the wrong way. Describing a problem I had when doing that in case it helps with diagnosing this bug: I was trying to make a video which I encoded in segments. I had each segment as H.264 video and Opus audio. When I joined all the segments with 'concat' demuxer and '-c copy' for stream copy, in some places it seemed to work fine, but between some segments there was a noise spike.
That is, 'astats' would report a spike to something like 7 dB at the start of a segment, even though the original audio did not have this spike and '-c copy' was used. I tried uploading the joined audio to YouTube in case it was a problem with my decoding software and the problem was there too. I can only guess that Opus keeps some kind of information state, and packets depend on the state from previous packets. (Can kind of see this if you try to force a DC bias into an audio stream; output visualization with ffplay or something shows it quickly going to zero each time you seek to a new point in the file.) So something like this could be the cause of the current bug. I can't explain why a miniscule change in volume while encoding would lead to greatly diverging results during playback, or why there's no noise spike with '-vn', though.
I do note that in this output, there are fewer audio samples in the playback with the noise spike (47040 instead 47688), and I think this might actually be the key to fixing this bug. With -vn, the second playback gives 48000 samples but the same peak dB; the second playback sounds slightly different, but probably just from my pulseaudio starting later or something.
So I think what is happening here is that, since the first video has a presentation timestamp of 0.021 (due to opus audio becoming 0.007 seconds earlier each time you copy it, which might be another bug which I'm not reporting here), ffplay seeks to the audio packet that matches the start of video. When it starts from this slightly later packet, there is a noise spike.
If this explanation is correct, the questions are
1) is ffmpeg/ffplay following the decoding specifications for opus?
2) can the problem be fixed even if it's due to following the spec?
Attachments (3)
Change History (10)
by , 4 years ago
by , 4 years ago
comment:1 by , 4 years ago
The 'seeking starts from a later packet' theory is correct.
ffmpeg -i \[pow\ at\ start\]\[crop_1080\]屏東潮州六姐妹in新北市三重正義堂遶境\ Part2\[2012-06-17\]\ \[vUY-EH3gTRU\].webm -c copy -vn -ss 0.01 copy.webm
Attached.
ffmpeg -i copy.webm -af astats -f null - [...] Duration: 00:00:00.99, start: 0.004000, bitrate: 95 kb/s [...] [Parsed_astats_0 @ 0x55c9a77f8100] Peak level dB: 14.207783 [Parsed_astats_0 @ 0x55c9a77f8100] RMS level dB: -9.854039 [Parsed_astats_0 @ 0x55c9a77f8100] RMS peak dB: -13.131624 [...] [Parsed_astats_0 @ 0x55c9a77f8100] Number of samples: 46728
(Command in original ticket has unneeded space in filename when copying from terminal: '屏 東')
In case it's helpful, results from silencedetect filter:
ffmpeg -i \[pow* -t 1 -vn -af silencedetect=d=0.005 -f null - [silencedetect @ 0x55cdbd13a040] silence_start: 0.007 [silencedetect @ 0x55cdbd13a040] silence_end: 0.0365 | silence_duration: 0.0295 ffmpeg -i copy.webm -t 1 -vn -af silencedetect=d=0.005 -f null - [silencedetect @ 0x556a286865c0] silence_start: 0.007 [silencedetect @ 0x556a286865c0] silence_end: 0.0134792 | silence_duration: 0.00
Note that ffmpeg is applying the start offset to these times due to not using '-seek_timestamp 1'.
Mp4 container has same result; attached.
ffmpeg -i [pow* -c copy -vn -ss 0.01 -strict -2 copy.mp4
comment:2 by , 4 years ago
Description: | modified (diff) |
---|
comment:3 by , 4 years ago
Description: | modified (diff) |
---|---|
Summary: | Seeking to the start of some files produces noise spike → Decoding of Opus audio with missing packets produces produces noise spike |
comment:4 by , 4 years ago
Summary: | Decoding of Opus audio with missing packets produces produces noise spike → Decoding of Opus audio with missing packets can produce noise spikes |
---|
comment:5 by , 4 years ago
When I play these three files (original, copy.webm and copy.mp4 with just audio starting from 0.01) in Firefox on Linux, I get the audio spike with the two copies, but not when seeking to the start of the original. Testing shows that if video is significantly delayed, like by 0.5 sec, seeking to the start of the file goes to its actual start, so in the original file it plays from 0 (or possibly -0.007; but ashowinfo says start time is 0), not 0.02 which is the first video frame.
So not sure if it would have a noise spike if playback started from 0.01 or 0.02, or if it's fading in the first 0.01 seconds of audio. Trying to seek to around 1% of the playback bar doesn't lead to any such spike.
The point here is that normal files should not be missing packets. Maybe ffmpeg or ffplay could seek to a few audio packets before the playback point and discard the decoded output up to that point.
The fun part here is that if you go to the original video, https://www.youtube.com/watch?v=vUY-EH3gTRU, it does have the spike. This is the case for both format 251 opus and format 140 aac (and format 22, h264+aac, which can sometimes be different). But it might be a different-sounding spike, from a different source (audio level normalization during recording); it seems to give a peak dB of around 0 for both the webm/opus and mp4/aac, compared to the 14 dB peak when starting from the second audio packet of the attached file.
comment:6 by , 2 years ago
Keywords: | ffplay libopus removed |
---|
Audio starting from 0.01