Pa3PyX's Band Replication Tool v1.1 =================================== I. Introduction Given that the majority of PC users only understand audio recording and compression methods on the user level, it comes as no surprise that a lot of audio material circulating around the Internet is encoded or captured poorly. One of the most common symptoms of poorly encoded/captured/mastered audio, besides high frequency noise and artifacts, is the absence of high frequencies altogether. Improperly chosen low pass filters, lousy frequency response of the underlying analog equipment, and low encoding bit rate are among the most frequent causes. In the audio material processed in such way, practically all high frequency information beyond a certain sharp cutoff (e.g. 12 kHz) has been completely obliterated, and the audio sounds characteristically "muffled" (or "dull"). The conventional methods of restoring high frequency content, such as equalizers and harmonic exciters, prove to be largely ineffective with such material -- equalizers do exactly squat because they only amplify certain frequency bands, and in such material there is nothing to amplify, while harmonic exciters fail because they introduce harmonics of certain low frequencies, but are not typically aware of which frequencies are already present, and which have been trimmed by low pass filters and need restored. Moreover, harmonic exciters do not typically allow amplification of generated harmonics beyond a certain point -- the amplitude is fixed and is determined based on the frequencies the harmonics are generated from. Therefore, applying a harmonic exciter to such material results in a spectrum that is still not even remotely resembles the original (before the application of low pass filters). Not only do they not completely restore the amplitudes of the deleted high frequencies, but they also needlessly amplify the content below the filter cut off. See Figure 1 for more info. A ^ | |----\ | --------\ | ------------\ Figure 1.1: Fourier spectrum | ----------\ of the original signal | | (before application of the | | low pass filter) | | | | | | +----------------------------------------->F ^ Nyquist A ^ | |----\ | --------\ | ------------\ Figure 1.2: Fourier spectrum | ---\ of the low pass filtered | | signal (high frequency | | portion is trimmed) | | | | | | +----------------------------------------->F ^ Filter cut off frequency A ^ | |----\ /---\ | --------\ /--- ---\ | ---- | Figure 1.3: Fourier spectrum | | of the low pass filtered | | signal processed with a | | harmonic exciter | \--------\ | | | | +----------------------------------------->F ^ ^ Harmonic generation cut off frequency Filter cut off frequency Thus, different techniques are required to treat such audio material, but personally I found no suitable tools even among some commercial VST plugins, let alone free software. Thus, given a fairly simple idea borrowed from modern audio compression techniques, this tool was developed. II. Principles of operation Somehow, the high frequency content must be restored based on the remaining lower frequency spectral lines. High frequencies have the nice property that they are, by and large, harmonics (overtones) of lower frequency signals. When one gets into 10-16 kHz range, these harmonics are so abundant that the individual tones are no longer perceived separately by the ear, but a continuous "hiss" accompanying the lower base frequencies is heard instead (consider a cymbal crash, for instance). Thus, since we don't have to worry about tonality, the most straightforward approach for regenerating these frequencies is to take the frequency band just below the filter cut off and copy it beyond the cut off. Limited Fourier window size (piecewise transform) will ensure temporal synchronicity and appropriate intensity of the frequency band replicated in this way. This is the main technique used by this tool. A similar but more sophisticated technique called spectral band replication (SBR) is used in low bitrate lossy audio encoding schemes, such as MP3PRO and AAC+; this technique allows encoders to drop frequencies above certain cut off (and save bits) at encoding time and restore them from lower frequencies at decoding time. The difference is that SBR enabled audio encoders store information in the bitstream that allows the decoders to determine how the high frequencies can be optimally regenerated, but of course no such information is present in the plain low pass filtered audio. So unlike the audio decoders, we have to replicate the bands blindly, which leads us to some issues. The first one is interference. If we just copy the complex FT values from one position to another, this is likely to result in the regenerated frequencies interfering (in a periodic variable pattern) with the ones we replicated from, resulting in nasty "phasing" artifacts. Thus, the phase of the replicated lines has to be adjusted. Next, to play with spectral lines, obviously we have to take the Fourier transform of the signal. We then toy with the signal in the frequency domain, and then take the inverse Fourier transform to obtain processed signal. If the operation we wanted to perform was a time invariant filter (e.g. an equalizer), this could of course have been done more efficiently using FIR techniques (determine the desired frequency response, take its inverse Fourier transform and obtain the convolution function, which can then be applied in the time domain). But the operation we are interested in is not (technically) a filter, since it cannot be expressed using multiplication in frequency domain (nor the equivalent convolution in time domain), not to mention that it's not time invariant. So we are stuck with FT/process/iFT cycle. The problem is that Fourier and other sinusoidal transforms obliterate all temporal information. If we take the FT of the entire piece of audio, process it in the way above, and then take the inverse FT, we will find that our regenerated high frequiencies, instead of being where we need them in time, are "smudged" all over the place. Thus, we have to take the FT piecewise, where the window has to be large enough to sufficiently define the frequency band we wish to replicate, but small enough not to introduce time artifacts in the form of "smudginess" (echo/pre-echo). The next problem that we run into is that after processing, we introduce discontinuities at window edges. So the transform windows have to overlap and for each sample, a weighted average has to be computed from overlapping windows, with more weight assigned to samples closer to the center of the window, and less weight assigned to samples closer to the edges. In this implementation, windows overlap by half and the raised cosine (Hanning) weighting function is used by default. But again we run into an interference problem -- the amplitude of the regenerated frequency band tends to fluctuate with the period equal to window size, this time due to overlapping windows. Now if the signal across all successive windows was constant, then we could have simply adjusted the phase -- when window is shifted by half, the phase of the lowest (non-zero) frequency changes by Pi, and the phases of all other frequency lines change by n*Pi, where n is the number of the frequency line. The trouble, of course, is that the signal is not constant, and phase of the lines that we try to replicate does not change by exactly n*Pi. So we are faced with a choice: to compute the phase of the regenerated spectral lines based on the phase of the regenerated lines in the prior block (by adding n*Pi) and thus avoiding interference between blocks, or to compute phases of the replicated lines based on the phases of the lines they are replicated from (by multiplying them by the ratio of the regenerated frequency and the original frequency -- we don't know what the initial phase of the original line is supposed to be, so we assume it 0), and thus avoiding interference within blocks. This implementation chooses the latter approach; subjectively it sounds better. Trying to avoid both kinds of interference at present seems like trying to balance a four legged chair with one leg shorter than the others -- no matter what you do, it won't stand firm. I'd appreciate some advice on how to deal with this problem properly from someone who is more knowledgeable on the subject -- right now it's one of the unresolved issues in this implementation. So right now, to avoid interference, one has to use small window sizes -- the default is chosen to be 5 ms, which is comparable to the temporal resolution of the human ear, but still provides decent frequency resolution for the task. The other issue is the intensity of the replicated band. In the frequency profile of the typical sound (and this will be slightly different for speech, music, background noise, etc), generally, the higher the frequency, the lower the amplitude. Thus replicating a lower frequency band into higher frequencies will create a sharp spike in the average spectrum. This is to be avoided, since it causes ringing and sounds unnaturally; the transition from the original to the replicated band in the spectral profile should look smooth. What seems a straightforward solution is to fit a log/log scale line through the array of magnitudes of the spectral lines in the original band. Then compute the values of this line at the start and end of the original frequency band; the end of the original band, in this implementation, becomes the start of the regenerated band. Thus, the difference between these values is the factor (on the log scale) that the regenerated band magnitudes are to be scaled by. But the trouble is, the least squared difference line that we fit in such a way may not fit significantly (there is too much variance within the original spectral band), resulting in wildly fluctuating amplitudes in the regenerated band, which is preceived as "wheezing" or "snoring". To avoid this, we average the log scaling factor over several successive windows, and apply linear interploation to the average values when we need to determine the attenuation factor at each particular block. The number of windows (i.e. the amount of time) to average over again has to be chosen experimantally; too low results in "snoring" due to wildly fluctuating magnitudes, too high -- in "ringing" due to sharp peaks in the frequency spectrum. The default is chosen to be 100 ms. Finally, a slight overlap of the original and replicated bands (of 2 spectral lines, where the values are averaged) is allowed to further reduce ringing. Additionally, the line may be fitted through a wider frequency band than is replicated (more data points), resulting in it being more stable as well. III. Command line parameters The following is the command line syntax of the program. Required parameters are denoted by <>, optional -- by []. bandr.exe [sfmt] [start] [lst] [end] [wsize] [wfac] [wfunc] inf: The input file to be processed. This must be 8, 16, 24, or 32 bit PCM, or 32 or 64 bit floating point. Any number of channels is allowed; each channel will be processed separately and result in the corresponding channel in the output file. Processing an audio file is done in two passes. The first pass calculates the attenuation factors of the replicated band, and the second pass applies band replication with appropriate scaling. outf: The output file where the processed signal will be written to. This will be overwritten if exists. sfmt: Output file sample format. One of the following: uint8: 8 bit PCM int16: 16 bit PCM int24: 24 bit PCM int32: 32 bit PCM fp32: 32 bit floating point fp64: 64 bit floating point auto: Same as input file (default) Note that although 64 bit floating point (double precision) output is possible, it is wasteful since internally, the program operates with 32 bit floating point numbers (single precision only). start: Start frequency of the original band band to replicate, in Hz. Default is 8000. This will be rounded to the nearest frequency line depending on the FT window size. With this parameter, as well as [lst], [end], [wsize], and [wfac], specify 0 to use the default value. Both [start] and the subsequent [lst] must be less than [end] by at least 2 spectral lines, and both must correspond to at least spectral line 1 (the first non-zero frequency bin in DFT). The values are automatically adjusted if they happen to lie out of range. lst: Start frequency of the reference band to fit a line through in order to determine the intensity of the regenerated band that provides a smooth spectral transition between the original and replicated bands. Default is the same as [start], 8000 Hz, so that the original and reference bands are one and the same. Setting this value lower (so that the reference band is wider than the original band to replicate) will help reduce unpleasant fluctuations in the amplitudes of the replicated lines, but if the original spectrum in the range [lst-end] is not linear, this will result in a poor fit, and consequently a sharp peak at the junction of the original and replicated bands. end: The end of the original and reference bands and start of the replicated band. Beyond this, existing frequency lines (up to the Nyquist frequency) will be truncated. The original band, starting at [start] and ending at (nfreq - ([end] - [start]) will then be copied (with a few tweaks) and placed starting at this frequency less 2 spectral lines (nfreq is the Nyquist frequency of the audio file, which is half the sampling rate). The end frequency line itself and two preceding lines define the overlap zone; the original spectral values and the replicated ones are blended linearly here to reduce the "ringing" artifacts resulting from sharp low pass filtering of the original signal. Thus, given the start of the original band as [start], and the end of it as [end], the end of the replicated band is approximately [end] + ([end] - [start]) -- given that the original signal is low pass filtered at [end], which is how [end] value should be determined for a given audio file. The default for [end] is 12000 Hz, so the default end of the replicated band is 12000 + (12000 - 8000) = 16000 Hz, minus a few spectral lines in blending. The maximum value for [end] is the Nyquist frequency. wsize: Size of the Fourier transform window, in samples. Larger windows provie better frequency resolution (allowing the user to specify the frequencies above more exactly), but also introduce time artifacts in the form of echo/pre-echo and interference patterns. The default is calculated for the target length of 5 ms based on the input file's sampling rate and rounded to the nearest power of 2 for performance reasons. We should also note that larger window setups take longer to process (FT is not exactly linear time, not even FFT), and memory requirements for these get nasty fast. Use -1 to set FT window size equal to the number of samples in the entire file, but use this option with caution. Minimum window size is 6, but because there are only 4 (6 / 2 + 1) spectral lines in this case (0 .. 3), wfac: The number of windows over which to average the replicated band equalization profile. Larger values will tend to smooth out sharp variations in the high frequency content where they result from too much variation *within* FT windows and thus reduce "snoring", but will also slow the response to genuine changes of high frequency levels *between* FT windows, resulting in long rise/fall times and thus may add too much "breathing". Smaller values tend to reduce "breathing". The default is calculated for the target length of 100 ms based on the window size and the input file's sampling rate. When setting your own value, remember that FT windows overlap by half, thus the temporal length of the averaging period will be approximately twice as short as window size in milliseconds times [wfac]. Use -1 to average over the entire file, that is, to compute and use the constant equalization profile. wfunc: The function to modulate overlapping inverse FT samples by before averaging them. This is not a window function in the strict sense (even though Hanning function is used by default here); window function is normally applied *before* taking Fourier transform to reduce spectral leakage and improve the accuracy of the spectral analyzer. But since we need to process the FT and take the inverse transform, applying the window function before the transform would have butchered the output signal. Hence, rectangular window is used in all cases. [wfunc] is applied *after* the inverse FT is taken, as a weighting factor which assigns more weight to the center of the sample than to its edges in order to get rid of discontinuities introduced by processing. Each sample in the window is multiplied by the weighting factor given by [wfunc], added with the sample from the other window occupying the same position in the stream, also multiplied by its own weighting factor, and the sum is divided by the sum of the weighting factors (which, rest assured, is never 0). The following functions can be used; the support of each is [0..1]. pow4: { 16 * x^4 if x <= 0.5 { 16 * (x - 1)^4 otherwise pow4i: -16 * (x - 0.5)^4 + 1 lin: { x if x <= 0.5 { 1 - x otherwise rcos: 0.5 + 0.5 * cos(2 * Pi * (x - 0.5)) (default) const: 1 Using functions that assign more weight to the center of the window (pow4) tends to reduce discontinuities, while using those that assign a little more weight to the edges (pow4i, lin) will result in less annoying interference patterns with large window sizes. rcos is a good compromise between the two; use const (assigning equal weight to every sample in the window) if you would like to get rid of alternating interference patterns entirely, but either don't mind discontinuities (audible as clicks at regular intervals) or are using -1 as [wsize]. Examples: bandr.exe in.wav out.wav fp32 0 0 12500 Processes the file in.wav using the original band cut off of 12500 Hz, with default original and reference band starts, default window size, modulation function, and averaging factor, writing the output to out.wav as 32 bit floating point. bandr.exe in.wav out.wav auto 0 0 0 1024 40 lin Processes the file in.wav using the default original and reference bands, FT window size of 1024 samples, averaging factor of 40 windows, and triangle modulation function, writing the output to out.wav using the same sample format as the input. IV. Determining operation parameters To process your particular audio material properly, you will need to determine the cut off frequency of the low pass filter that was applied to your audio data. This tool will not do it for you; use some spectrum analyzer tool; one is available in most equalizer and audio editing software. The cutoff is visible as an abrupt drop in the average spectral profile. This is your cutoff frequency ([end] parameter); if your audio sounds muffled, this will likely be below 16 kHz. If, however, you do not see an abrupt cutoff below 16 kHz but your audio still sounds muffled, do not use this tool -- use an equalizer instead; the high frequency content is there, it just needs amplified and possibly noise reduced (using some spectral subtraction tool) afterwards (yes, in my experience it's better to apply noise reduction *after* equalization -- this results in less "musical" noise introduced by spectral subtraction). If you have determined the cutoff, next, you need to determine how much of the high frequency content you want to restore; typically you will want your restored spectrum to cut off at no lower than 14 kHz (16 kHz is good, if you can achieve that; frequencies above that are largely inaudible -- I can't hear anything past 19 kHz, no matter how loud). Next, subtract the actual cut off from the desired cut off; suppose your actual cut off was 13 kHz, then 16000 - 13000 = 3000. This is the width of the original band you need to replicate. We already know that the original band ends at [end]; so to determine [start], we simply subtract the width value we obtained above from [end]. In the example above, [start] of 13000 - 3000 = 10000 Hz will achieve the goal. Note, however, that it is not desirable to use [start] frequency below 8 kHz as this begins to introduce tonal distortions; instead, try running the tool multiple times to replicate the bands multiple times. But your mileage may vary depending on the kind of audio. V. Files in this distribution BANDR.EXE The main program executable FFTW3.DLL The FFTW library README.TXT This manual GPL.TXT GNU General Public License SOURCE\BREP.PRJ LCC-Win32 project file SOURCE\BANDR.C Main program code SOURCE\WAVE.C WAVE read/write support routines SOURCE\WAVE.H WAVE routines header SOURCE\FFTW3.H FFTW library header SOURCE\LCC\FFTW3.LIB Imports of FFTW3.DLL built with LCC-Win32 VI. Version history Version 1.1 ----------- 1) The original existing content beyond [end] is now replicated; that is, the width of the original band to replicate was changed from [end] - [start] to (nfreq - [end]). If the value for [end] is chosen correctly, the end of the replicated spectrum should be about the same, since the original file will not contain any significant data beyond [end]. However, simply cutting off the frequency response at [end] + ([end] - [start]), as was done in the previous version, used to introduce ringing. This version preserves the original cutoff shape instead. 2) Changed blending filter from 2 to 3 spectral lines to further reduce ringing. 3) A few cosmetic adjustments in screen output. Version 1.0 ----------- Initial release. VII. Credits This tool: Pa3PyX (pa3pyx@come.to, http://pa3pyx.come.to/) FFTW library: Matteo Frigo & Steven G. Johnson (source code and binaries for various platforms are available at http://www.fftw.org/) FFTW is distributed under GPL; hence, so is this tool (GPL Version 2 or any later version, to be specific). See the file GPL.TXT to view complete terms. Copyright (C) 2006 Pa3PyX This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA