Performance of Elementary Functions

by Pavel Holoborodko on April 27, 2016

UPDATE (April 22, 2017): Timings for Mathematica 11.1 have been added to the table, thanks to test script contributed by Bor Plestenjak. I suggest to take a look at his excellent toolbox for multiparameter eigenvalue problems – MultiParEig.



UPDATE (November 17, 2016): All timings in table have been updated to reflect speed improvements in new version of toolbox (4.3.0). Now toolbox computes elementary functions using multi-core parallelism. Also we included timings for the the latest version of MATLAB2016b.



UPDATE (June 1, 2016): Initial version of the post included statement that newest version of MATLAB R2016a uses MAPLE engine for variable precision arithmetic (instead of MuPAD as in previous versions). After more detailed checks we have detected that this is not true. As it turned out, MAPLE 2016 silently replaced VPA functionality of MATLAB during installation. Thus we (without knowing it) tested MAPLE Toolbox for MATLAB instead of MathWorks Symbolic Math Toolbox. We apologize for misinformation. Now post provides correct comparison results with Symbolic Math Toolbox/VPA.

Thanks to Nick Higham, Massimiliano Fasi and Samuel Relton for their help in finding this mistake!



From the very beginning we have been focusing on improving performance of matrix computations, linear algebra, solvers and other high level algorithms (e.g. 3.8.0 release notes).

With time, as speed of advanced algorithms has been increasing, elementary functions started to bubble up in top list of hot-spots more frequently. For example the main bottleneck of the multiquadric collocation method in extended precision was the coefficient-wise power function (.^).

Thus we decided to polish our library for computing elementary functions. Here we present intermediate results of this work and traditional comparison with the latest MATLAB R2016b (Symbolic Math Toolbox/Variable Precision Arithmetic), MAPLE 2016 and Wolfram Mathematica 11.1.0.0.

Timing of logarithmic and power functions in 3.9.4.10481:

>> mp.Digits(34);
>> A = mp(rand(2000)-0.5);
>> B = mp(rand(2000)-0.5);
 
>> tic; C = A.^B; toc;
Elapsed time is 67.199782 seconds.
 
>> tic; C = log(A); toc;
Elapsed time is 22.570701 seconds.

Speed of the same functions after optimization, in 4.3.0.12057:

>> mp.Digits(34);
>> A = mp(rand(2000)-0.5);
>> B = mp(rand(2000)-0.5);
 
>> tic; C = A.^B; toc;             % 130 times faster
Elapsed time is 0.514553 seconds.
 
>> tic; C = log(A); toc;           % 95 times faster
Elapsed time is 0.238416 seconds.

Now toolbox computes 4 millions of logarithms in quadruple precision (including negative arguments) in less than a second!

Inspired by this result, we have applied our ideas to speed-up some other elementary functions. Summary table with timings and comparison against MATLAB R2016b (VPA), MAPLE 2016 and Wolfram Mathematica 11.1.0.0 on Core i7 990x / Windows 7 64-bit:

Computation of elementary functions using quadruple precision
(argument is real pseudo-random 2Kx2K matrix)

Function Timing (sec)Speed-up (times)
MATLAB (VPA)MapleMathematicaAdvanpixOver VPAOver MapleOver Mathematica
Power & exponential:
EXP107.34756.144.540.12886.34 6243.9037.49
LOG1161.18593.986.610.235133.40 2625.9129.21
LOG101438.91639.4611.130.245958.23 2647.8846.09
LOG21442.71643.1711.080.255789.35 2580.9444.48
SQRT28.75427.402.600.27105.74 1571.90 9.55
Trigonometric:
SIN85.28736.896.070.15570.80 4932.3340.62
COS78.96513.736.100.15516.44 3359.9239.89
TAN1261.92844.058.910.177277.514867.6451.37
ASIN105.121181.8312.390.39266.40 2995.0131.39
ACOS100.491330.9923.100.39257.55 3411.0359.19
ATAN131.921039.555.710.14974.28 7677.6442.17
SEC1466.09778.148.000.188199.594352.0144.76
CSC1503.75793.878.350.188490.954482.6047.13
COT1511.671014.7610.460.207728.365187.9553.48
ASEC1610.291962.8718.450.285815.44 7088.7266.62
ACSC1648.311720.7621.960.285965.65 6227.8679.47
ACOT140.371179.8416.610.16867.58 7291.96102.63
SINH117.85781.786.880.13910.78 6041.5953.17
COSH117.73795.347.000.13924.06 6242.8754.92
TANH121.37976.789.200.101198.14 9642.4590.78
ASINH92.55778.4613.510.14656.38 5521.0295.81
ACOSH103.781349.7920.510.31332.10 4319.3165.65
ATANH121.462287.9411.600.32378.49 7129.7636.14
SECH1922.54978.919.100.1711602.535907.7354.93
CSCH1947.11960.788.960.1711652.355749.7253.63
COTH1958.511268.9810.900.1216266.7210539.7290.502
ASECH2378.242921.7818.750.435476.04 6727.5643.18
ACSCH2087.721188.1817.780.1612831.717302.87109.26
ACOTH2117.192335.2319.770.268083.95 8916.4975.47
Selected special:
gamma2491.817734.53228.350.763266.2313018.78299.31
erf104.11321.20125.880.16669.96 2163.26 810.02
bessely(0,x)7855.7014923.53250.380.839482.9818014.89302.25
bessely(1,x)7302.2914964.26267.940.838786.2918005.36322.39
besselj(0,x)7273.299998.6090.540.759684.8113313.72120.55
besselj(1,x)5987.6710153.1391.890.748077.2513696.38123.96

Advanpix toolbox outperforms MATLAB/VPA by 5000 times, MAPLE by 6766 times and Wolfram Mathematica by 100 times by speed in average. Test scripts are available for download:

Run timing_elementary_advanpix to test Advanpix toolbox, and timing_elementary_vpa to test VPA. Don’t forget to add toolbox directory to search path before running the toolbox tests!

***
Toolbox’s timings are higher on GNU Linux & Apple Mac OSX. We can do deeper performance optimization on Windows since we have full license of Intel Developer tools on the platform.

{ 0 comments… add one now }

Leave a Comment

Previous post:

Next post: