Opened 4 years ago

Closed 19 months ago

#8694 closed enhancement (wontfix)

FFV1 decoding needs a huge number of threads for optimal performance

Reported by: zorr Owned by:
Priority: normal Component: avcodec
Version: git-master Keywords: ffv1
Cc: Blocked By:
Blocking: Reproduced by developer: no
Analyzed by developer: no

Description

I noticed that when decoding FFV1 (especially version 1) you can get much higher performance by increasing the number of decoding threads to a much larger value than the default (and recommended) number. On my system (Ryzen 3900X) the default is 16 threads. Using a 8-bit ffv1 v1 SD (720x576) source 16 threads gives 163 fps but 384 threads gives 1181 fps (7.2x speed-up). Other lossless codecs (huffyuv, magicyuv, utvideo) don't behave this way - the best performance is achieved with 48 threads, but using 24 threads is almost as good and it makes sense because that's the number logical cores on the test machine.

I ran the tests using the null encoder and without audio. The test source is over 30 minutes (44058 frames). I measured the wall clock time and calculated the fps, took the best of three runs. The test script was (just varying the -threads parameter)

ffmpeg -threads 384 -i src.avi -an -f null -

More detailed results below:

ffv1 v1, null encoder
threads		time (ms)	fps	
16		269650		163
24		216301		204
48		130619		337
96		72483		608
128		57245		770
192		46769		942
256		38337		1149
384		37304		1181
512		37352		1180
768		37458		1176

I also ran a more real-world scenario of converting the source to huffyuv. In this case best performance was achieved with 512 threads but 256 is almost as good. Detailed results below.

ffv1 v1 -> huffyuv
threads		time (ms)	fps		
16		279524		158	
24		224079		197	
48		133244		331	
96		75631		583	
128		60817		724		
192		49113		897		
256		41690		1057	
384		41644		1058	
512		41628		1058	
768		41722		1056	

FFV1 v3 doesn't need quite as many threads, the optimal was 128 threads (and even 96 is almost as good).

ffv1 v3 null encoder
threads		time (ms)	fps
16		91734		480
24		72105		611
48		50835		867
64		40670		1083
80		39819		1106
96		37766		1167
128		37621		1171
192		37661		1170

And here are the results for utvideo, magicyuv and huffyuv.

utvideo, null encoder
threads		time (ms)	fps
6		19033		2315
8		14329		3075
12		9785		4503
16		7703		5720
24		5463		8065
48		5436		8105
96		5497		8015

magicyuv, null encoder 
threads		time (ms)	fps
6		30525		1443
8		22947		1920
12		15902		2771
16		12687		3473
24		8956		4919
48		8923		4938
96		8944		4926

huffyuv, null encoder
threads		time (ms)	fps
6		22630		1947
8		17048		2584
12		12210		3608
16		10034		4391
24		7214		6107
48		7189		6129
96		7263		6066

These benchmarks were run with the git build 20200525-6268034 (May 25, 2020 10:44). I have also tested version 4.2.2 and version 3.4.2. The performance is very similar in all of them. User furq on #ffmpeg channel also confirmed that on his Ryzen 2600 (6 cores, 12 logical cores) the best performance was with 128 threads.

I made a couple of charts to better visualize the scaling behaviour of the codecs, see here: https://i.postimg.cc/VNTxgWdw/ffv1-performance.png.

Whenever more than 16 threads are requested, ffmpeg displays a warning "Using a thread count greater than 16 is not recommended." When I asked about this on #ffmpeg IRC channel users furq and Compn were able to find out that the warning message is probably related to H.264 slice threading which seems to be buggy with more than 16 threads https://github.com/FFmpeg/FFmpeg/blob/master/libavcodec/pthread_internal.h#L24-L26. The actual warning message code is here https://github.com/FFmpeg/FFmpeg/blob/master/libavcodec/pthread.c#L64-L67. I have also confirmed that there are no errors on the resulting video even when using 512 threads to decode ffv1, the hashes are equal.

So I think one way to improve things would be to customize the warning message based on the used codec. Perhaps even adjusting the default number of threads based on the codec and the number of available cores. Users are probably not aware that adjusting the number of threads a 7-fold speed-up is possible.

And I think it's worth taking a look at why ffv1 needs so many threads in the first place. Perhaps it is by design but it could also be a symptom of a hidden design flaw or a simple coding error.

Change History (3)

comment:1 by Carl Eugen Hoyos, 4 years ago

Keywords: decoding performance removed

comment:2 by Elon Musk, 2 years ago

Is this still reproducible?

comment:3 by Elon Musk, 19 months ago

Resolution: wontfix
Status: newclosed
Note: See TracTickets for help on using tickets.