Resources Contact Us Home
Browse by: INVENTOR PATENT HOLDER PATENT NUMBER DATE
 
 
Apparatus and method for calculating a fingerprint of an audio signal, apparatus and method for synchronizing and apparatus and method for characterizing a test audio signal
8634946 Apparatus and method for calculating a fingerprint of an audio signal, apparatus and method for synchronizing and apparatus and method for characterizing a test audio signal
Patent Drawings:

Inventor: Scharrer, et al.
Date Issued: January 21, 2014
Application:
Filed:
Inventors:
Assignee:
Primary Examiner: Elbin; Jesse
Assistant Examiner:
Attorney Or Agent: Glenn; Michael A.Perkins Coie LLP
U.S. Class: 700/94; 341/101; 341/142; 380/253
Field Of Search: ;700/94; ;341/101; ;341/120; ;341/142; ;341/151; ;341/899; ;380/253; ;386/331
International Class: G06F 17/00; H03M 1/00; H03M 9/00; H04K 1/02
U.S Patent Documents:
Foreign Patent Documents: 102004046746; 1760693; 2431837; 200765659; 2007171933; WO-2006018747; WO-2006034825; WO 2006/102991
Other References: Herre, J. et al.; "Spatial Audio Coding: Next-generation efficient and compatible coding of multi-channel audio"; Oct. 2004; Prepring 6186Presented at the 117th Convention, AES, San Francisco, CA, 13 pages. cited by applicant.
Doets, P.J.O. et al.; " on the comparison of audio fingerprings for extracting quality parameters of compressed audio"; Feb. 2006; Proceedings of SPIE, vol. 6072, pp. 228-239, San Jose, CA. cited by applicant.
English Translation of the International Preliminary Report on Patentability dated Oct. 29, 2010 in parallel application PCT/EP2009/000917, 7 pages. cited by applicant.
International Search Report mailed Jul. 31, 2009 in parallel application PCT/EP2009/000917, 3 pages. cited by applicant.









Abstract: For calculating a fingerprint of an audio signal, the audio signal is divided into subsequent blocks of samples. For the subsequent blocks, one fingerprint value each is calculated, wherein fingerprint samples of subsequent blocks are compared. Based on whether the fingerprint value of a block is higher than the fingerprint value of a subsequent block or not, a binary value is assigned, wherein information about a sequence of binary values is output as fingerprint for the audio signal.
Claim: The invention claimed is:

1. An apparatus for synchronizing multichannel extension data with an audio signal, wherein the multichannel extension data are associated with reference audio signalfingerprint information comprising: a fingerprint calculator for calculating a fingerprint of the audiosignal comprising: a divider for dividing the audio signal into subsequent blocks of samples; a calculator for calculating a first fingerprint valuefor a first block of the subsequent blocks and a second fingerprint value for a second block of the subsequent blocks; a comparator for comparing the first fingerprint value with the second fingerprint value; an assigner for assigning a first binaryvalue when the first fingerprint value is higher than the second fingerprint value, or a second different binary value when the first fingerprint value is smaller than the second fingerprint value; and an outputter for outputting information about asequence of binary values as a sequence of test audio, signal fingerprints for the audio signal; a fingerprint extractor for extracting a sequence of reference audio signal fingerprints from the reference audio signal fingerprint information associatedwith the multichannel extension data; wherein the sequence of test audio signal fingerprints and the sequence of reference audio signal fingerprints are each a sequence of 1-bit values, wherein one bit each is associated with one block of audio samples,a fingerprint correlator for correlating the sequence of test audio signal fingerprints and the sequence of reference audio signal fingerprints, wherein the fingerprint correlator is implemented to combine a bit sequence of the sequence of test audiosignal fingerprints and a bit sequence of the reference audio signal fingerprints by a bit-by-bit XOR operation, and to sum up acquired bit results in order to acquire a first correlation value, to further combine a bit sequence of the sequence of testaudio signal fingerprints or the reference audio signal fingerprints shifted by an offset value with a respectively different sequence by a bit-by-bit XOR operation, and to sum up acquired bit results in order to acquire a second correlation value, andto select that offset value as the correlation result for which the largest correlation value has resulted; and a compensator for reducing or eliminating a time offset between the multichannel extension data and the audio signal based on the correlationresult.

2. The apparatus according to claim 1, wherein the assigner is implemented to take a binary value that is complementary to the first binary value as a second different value.

3. The apparatus according to claim 2, wherein the first binary value and the second binary value are exactly one bit.

4. The apparatus according to claim 3, wherein the assigner is implemented to assign a first bit value as first binary value and a second bit value complementary to the first value as second different value.

5. The apparatus according to claim 1, wherein the outputter is implemented to output a sequence of bits as the sequence of test audio signal fingerprints.

6. The apparatus according to claim 1, wherein the comparator is implemented to calculate a difference between the first fingerprint value and the second fingerprint value; and wherein the assigner is implemented to assign the first binaryvalue when the difference is more than 0 and to assign the second binary value when the difference is less than 0.

7. The apparatus according to claim 1, wherein the divider is implemented to provide adjacent or overlapping blocks as subsequent blocks.

8. The apparatus according to claim 1, wherein the calculator is implemented to calculate an energy or power-dependent amount of the block as first or second fingerprint value.

9. The apparatus according to claim 1, wherein the calculator is implemented to square and sum up time samples per block in order to acquire the first or second fingerprint value for the block.

10. The apparatus according to claim 1, wherein the calculator is implemented to calculate a crest factor of a power spectrum of the block as first or second fingerprint value.

11. An apparatus for characterizing a test audio signal, comprising: a calculator for calculating a test fingerprint, of the test audio signal comprising: a divider for dividing the audio signal into subsequent blocks of samples; a calculatorfor calculating a first fingerprint value for a first block of the subsequent blocks and a second fingerprint value for a second block of the subsequent blocks; a comparator for comparing the first fingerprint value with the second fingerprint value; an assigner for assigning a first binary value when the first fingerprint value is higher than the second fingerprint value, or a second different binary value when the first fingerprint value is smaller than the second fingerprint value; and anoutputter for outputting information about a sequence of binary values as a sequence of test audio signal fingerprints for the audio signal; a correlator for correlating the information about the sequence of binary values with different referencefingerprints in a reference database, wherein the reference database comprises information about an audio signal for every reference fingerprint, which is associated to the reference fingerprint; and wherein the sequence of test audio signalfingerprints and the sequence of reference audio signal fingerprints are each a sequence of 1-bit values, wherein one bit each is associated with one block of audio samples, wherein the correlator is implemented to combine a bit sequence of the sequenceof test audio signal fingerprints and a bit sequence of the reference audio signal fingerprints by a bit-by-bit XOR operation, and to sum up acquired bit results in order to acquire a first correlation value, to further combine a bit sequence of thesequence of test audio signal fingerprints or the reference audio signal fingerprints shifted by an offset value with a respectively different sequence by a bit-by-bit XOR operation, and to sum up acquired bit results in order to acquire a secondcorrelation value, and to select that offset value as the correlation result for which the largest correlation value has resulted, a provider for providing information about the test audio signal based on the correlation result.

12. A method for synchronizing multichannel extension data with an audio signal, wherein the multichannel extension data are associated with the reference audio signal fingerprint information, comprising: calculating a fingerprint of an audiosignal, comprising: dividing the audio signal into subsequent blocks of samples; calculating a first fingerprint value for a first block of the subsequent blocks and a second fingerprint value for a second block of the subsequent blocks; comparing thefirst fingerprint value with the second fingerprint value; assigning a first binary value when the first fingerprint value is higher than the second fingerprint value, or a second different binary value when the first fingerprint value is smaller thanthe second fingerprint value; and outputting information about a sequence of binary values as a sequence of test audio signal fingerprints for the audio signal; extracting a sequence of reference audio signal fingerprints from the reference audiosignal fingerprint information associated with the multichannel extension data; wherein the sequence of test audio signal fingerprints and the sequence of reference audio signal fingerprints are each a sequence of 1-bit values, wherein one bit each isassociated with one block of audio samples, correlating the sequence of test audio signal fingerprints and the sequence of reference audio signal fingerprints, the correlating comprising combining a bit sequence of the sequence of test audio signalfingerprints and a bit sequence of the reference audio signal fingerprints by a bit-by-bit XOR operation, and to sum up acquired bit results in order to acquire a first correlation value, combining a bit sequence of the sequence of test audio signalfingerprints or the reference audio signal fingerprints shifted by an offset value with a respectively different sequence by a bit-by-bit XOR operation, and to sum up acquired bit results in order to acquire a second correlation value, and selecting thatoffset value as the correlation result for which the largest correlation value has resulted; and reducing or eliminating a time offset between the multichannel extension data and the audio signal based on the correlation result.

13. A method for characterizing a test audio signal, comprising: calculating a test fingerprint of an audio signal, comprising: dividing the audio signal into subsequent blocks of samples; calculating a first fingerprint value for a firstblock of the subsequent blocks and a second fingerprint value for a second block of the subsequent blocks; comparing the first fingerprint value with the second fingerprint value; assigning a first binary value when the first fingerprint value ishigher than the second fingerprint value, or a second different binary value when the first fingerprint value is smaller than the second fingerprint value; and outputting information about a sequence of binary values as a sequence of test audio signalfingerprints for the audio signal, wherein a sequence of binary values is acquired as test fingerprint; wherein the sequence of test audio signal fingerprints and the sequence of reference audio signal fingerprints are each a sequence of 1-bit values,wherein one bit each is associated with one block of audio samples, correlating the information about a sequence of binary values with different reference fingerprints in a reference database, wherein the reference database comprises, for every referencefinger print, information about an audio signal associated with the reference fingerprint, the correlating comprising: combining a bit sequence of the sequence of test audio signal fingerprints and a hit sequence of the reference audio signalfingerprints by a bit-by-bit XOR operation, and to sum up acquired bit results in order to acquire a first correlation value, combining a bit sequence of the sequence of test audio signal fingerprints or the reference audio signal fingerprints shifted byan offset value with a respectively different sequence by a bit-by-bit XOR operation, and to sum up acquired bit results in order to acquire a second correlation value, and selecting that offset value as the correlation result for which the largestcorrelation value has resulted; and providing information about the test audio signal based on the correlation result.

14. A computer program comprising a program code for performing the method for synchronizing multichannel extension data with an audio signal, wherein the multichannel extension data are associated with the reference audio signal fingerprintinformation, the method comprising: calculating a fingerprint of an audio signal, comprising: dividing the audio signal into subsequent blocks of samples; calculating a first fingerprint value for a first block of the subsequent blocks and a secondfingerprint value for a second block of the subsequent blocks; comparing the first fingerprint value with the second fingerprint value; assigning a first binary value when the first fingerprint value is higher than the second fingerprint value, or asecond different binary value when the first fingerprint value is smaller than the second fingerprint value; and outputting information about a sequence of binary values as a sequence of test audio signal fingerprints for the audio signal; extracting asequence of reference audio signal fingerprints from the reference audio signal fingerprint information associated with the multichannel extension data; wherein the sequence of test audio signal fingerprints and the sequence of reference audio signalfingerprints are each a sequence of 1-bit values, wherein one bit each is associated with one block of audio samples, correlating the sequence of test audio signal fingerprints and the sequence of reference audio signal fingerprints, the correlatingcomprising combining a bit sequence of the sequence of test audio signal fingerprints and a bit sequence of the reference audio signal fingerprints by a bit-by-bit XOR operation, and to sum up acquired bit results in order to acquire a first correlationvalue, combining a bit sequence of the sequence of test audio signal fingerprints or the reference audio signal fingerprints shifted by an offset value with a respectively different sequence by a bit-by-bit XOR operation, and to sum up acquired bitresults in order to acquire a second correlation value, and selecting that offset value as the correlation result for which the largest correlation value has resulted; and reducing or eliminating a time offset between the multichannel extension data andthe audio signal based on the correlation result, when the program runs on a computer.

15. A non-transitory computer readable medium with computer program encoded thereon, the computer program comprising a program code for performing the method for characterizing a test audio signal, the method comprising: calculating a testfingerprint of an audio signal, comprising: dividing the audio signal into subsequent blocks of samples; calculating a first fingerprint value for a first block of the subsequent blocks and a second fingerprint value for a second block of the subsequentblocks; comparing the first fingerprint value with the second fingerprint value; assigning a first binary value when the first fingerprint value is higher than the second fingerprint value, or a second different binary value when the first fingerprintvalue is smaller than the second fingerprint value; and outputting information about a sequence of binary values as a sequence of test audio signal fingerprints for the audio signal, wherein a sequence of binary values is acquired as test fingerprint; wherein the sequence of test audio signal fingerprints and the sequence of reference audio signal fingerprints are each a sequence of 1-bit values, wherein one bit each is associated with one block of audio samples, correlating the information about asequence of binary values with different reference fingerprints in a reference database, wherein the reference database comprises, for every reference fingerprint, information about an audio signal associated with the reference fingerprint, thecorrelating comprising: combining a bit sequence of the sequence of test audio signal fingerprints and a bit sequence of the reference audio signal fingerprints by a bit-by-bit XOR operation, and to sum up acquired bit results in order to acquire a firstcorrelation value, combining a bit sequence of the sequence of test audio signal fingerprints or the reference audio signal fingerprints shifted by an offset value with a respectively different sequence by a bit-by-hit XOR operation, and to sum upacquired bit results in order to acquire a second correlation value, and selecting that offset value as the correlation result for which the largest correlation value has resulted; and providing information about the test audio signal based on thecorrelation result, when the program runs on a computer.
Description: CROSS-REFERENCE TO RELATED APPLICATION

This application is a U.S. National Phase entry of PCT/EP2009/000917 filed Feb. 10, 2009, and claims priority to German Patent Application No. 102008009025.5 filed Feb. 14, 2008, each of which is incorporated herein by references hereto.

BACKGROUND OF THE INVENTION

The present invention relates to the fingerprint technology for audio signals and in particular to calculating a fingerprint, using a fingerprint for synchronizing multichannel extension data with an audio signal and characterizing an audiosignal with the fingerprint.

Currently developed technologies allow an ever more efficient transmission of audio signals by data reduction, but also an increase of audio enjoyment by extensions, such as by the usage of multichannel technology.

Examples for such an extension of common transmission techniques have become known under the name of "Binaural Cue Coding" (BCC) as well as "Spatial Audio Coding". Regarding this, reference is made exemplarily to J. Herre, C. Faller, S. Disch,C. Ertel, J. Hilpet, A. Hoelzer, K. Linzmeier, C. Spenger, P. Kroon: "Spatial Audio Coding: Next-Generation Efficient and Compatibel Coding Oberflache Multi-Channel Audio", 117.sup.th AES Convention, San Francisco 2004, Preprint 6186.

In a sequentially operating transmission system, such as radio or Internet, such methods separate the audio program to be transmitted into audio base data or an audio signal, which can be a mono or also a stereo downmix audio signal, and intoextension data that can also be referred to as multichannel additional information or multichannel extension data. The multichannel extension data can be broadcast together with the audio signal, i.e. in a combined manner, or the multichannel extensiondata can also be broadcast separately from the audio signal. As an alternative to broadcasting a radio program, the multichannel extension data can also be transmitted separately, for example to a version of the downmix channel already existing on theuser side. In this case, transmission of the audio signal, for example in the form of an interne download or a purchase of a compact disc or DVD takes place spatially and temporally separate from the transmission of the multichannel extension data,which can be provided, for example, from a multichannel extension data server.

Basically, the separation of a multichannel audio signal into an audio signal and multichannel extension data has the following advantages. A "classic" receiver is able to receive and replay audio base data, i.e. the audio signal at any time,independent of content and version of the multichannel additional data. This characteristic is referred to as reverse compatibility. In addition to that, a receiver of the newer generation can evaluate the transmitted multichannel additional data andcombine the same with the audio base data, i.e. the audio signal, in such a manner that the complete extension, i.e. the multichannel sound, can be provided to the user.

In an exemplary application scenario in digital radio, with the help of these multichannel extension data, the previously broadcast stereo audio signal can be extended to the multichannel format 5.1 with little additional transmission effort. The multichannel format 5.1 comprises five replay channels, i.e. a left channel L, a right channel R, a central channel C, a left rear channel LS (left surround) and a right rear channel RS (right surround). For this, the program provider generates themultichannel additional information on the transmitter side from multichannel sound sources, such as they are found, for example, on a DVD/audio/video. Subsequently, this multichannel additional information can be transmitted in parallel to the audiostereo signal broadcast as before, which now includes a stereo downmix of the multichannel signal.

One advantage of this method is the compatibility with the so far existing digital radio transmission system. A classical receiver that cannot evaluate this additional information will be able to receive and replay the two-channel sound signalas before without any limitations regarding quality.

A receiver of novel design, however, can evaluate and decode the multichannel information and reconstruct the original 5.1 multichannel signal from the same, in addition to the stereo sound signal received so far.

For allowing simultaneous transmission of the multichannel additional information as a supplement to the stereo sound signal used so far, two solutions are possible for compatible broadcast via a digital radio system.

The first solution is to combine the multichannel additional information with the coded downmix audio signal such that they can be added to the data stream generated by an audio encoder as a suitable and compatible extension. In this case, thereceiver only sees one (valid) audio data stream and can again, synchronously to the associated audio data block, extract and decode the multichannel additional information by means of a correspondingly preceding data distributor and output the same as a5.1 multichannel sound.

This solution necessitates the extension of the existing infrastructure/data paths, such that they can now transport the data signals consisting of downmix signals and extension instead of merely the stereo audio signals as before. This is, forexample, possible without additional effort, or unproblematic, when it is a data-reduced illustration, i.e. a bit stream transmitting the downmix signals. A field for the extension information can then be inserted into this bit stream.

A second possible solution is to couple the multichannel additional information not to the used audio coding system. In this case, the multichannel extension data are not coupled into the actual audio data stream. Instead, transmission isperformed via a specific but not necessarily temporarily synchronized additional channel, which can, for example, be a parallel digital additional channel. Such a situation occurs, for example, when the downmix data, i.e. the audio signal, are routedthrough a common audio distribution infrastructure existing in studios in unreduced form, e.g. as PCM data per AES/EBU data format. These infrastructures are aimed at distributing audio signals digitally between various sources ("crossbars") and/orprocessing them, for example by means of sound regulation, dynamic compression, etc.

In the second possible solution described above, the problem of time offset of the downmix audio signal and multichannel additional information in the receiver can occur, since both signals pass through different, non-synchronized data paths. Atime offset between downmix signal and additional information, however, causes deterioration of the sound quality of the reconstructed multichannel signal, since then an audio signal with multichannel extension data, which actually do not belong to thecurrent audio signal but to an earlier or later portion or block of the audio signal, is processed on the replay side.

Since the order of magnitude of the time offset can no longer be determined from the received audio signal and the additional information, a time-correct reconstruction and association of the multichannel signal in the receiver is not ensured,which will result in quality losses.

A further example for this situation is when an already running 2-channel transmission system is to be extended to multichannel transmission, for example when considering a receiver for digital radio. Here, it is often the case that decoding ofthe downmix signal frequently takes place by means of an audio decoder already existing in the receiver, which means, for example, a stereo audio decoder according to the MPEG 4 standard. The delay time of this audio decoder is not known or cannot bepredicted exactly, due to the system-immanent data compression of audio signals. Hence, the delay time of such an audio decoder cannot be compensated reliably.

In the extreme case, the audio signal can also reach the multichannel audio decoder via a transmission chain including analog parts. Here, digital/analog conversion takes place at a certain point in the transmission, which is followed again byanalog/digital conversion after a further storage/transmission. Here also, no indications are available as to how a suitable delay compensation of the downmix signal in relation to the multichannel additional data can be performed. When the samplingfrequency for the analog/digital conversion and the digital/analog conversion differ slightly, even a slow time drift of the necessitated compensation delay results according to the ratio of the two sampling rates to each other.

German patent DE 10 2004 046 746 B4 discloses a method and an apparatus for synchronizing additional data and base data. A user provides a fingerprint based on his stereo data. An extension data server identifies the stereo signal based on theobtained fingerprint and accesses a database for retrieving the extension data for this stereo signal. In particular, the server identifies an ideal stereo signal corresponding to the stereo signal existing at the user and generates two testfingerprints of the ideal audio signal belonging to the extension data. These two test fingerprints are then provided to the client who determines a compression/expansion factor and a reference offset therefrom, wherein, based on the reference offset,the additional channels are expanded/compressed and cut off at the beginning and the end. Thereupon, a multichannel file can be generated by using the base data and the extension data.

Generally speaking, fingerprint technologies have to be characteristic for an audio signal. On the other hand, they should also be an equally highly compressed representation of an audio signal. This means that the fingerprint may use upsignificantly less memory space than the audio signal itself, since otherwise generating a fingerprint and using a fingerprint would be useless.

On the other hand, a fingerprint should reproduce the time curve of an audio signal in order to be suitable, on the one hand, for synchronization purposes and, on the other hand, also for identification purposes. In particular with regard toidentification or characterization purposes, there is frequently the situation that an audio signal, such as a radio transmission, does not fully replay an audio piece, but starts transmitting at a certain time in the piece and possibly even stopstransmitting before the piece has ended. However, the fingerprint does not need to be decompressable since fingerprint generation can be considered as a particularly lossy compression.

Since fingerprint information is additional information, it should, as mentioned above, be a representation that is as compressed as possible but nevertheless characteristic. It is a further advantage of the compressed representation that themore compressed the representation is, the faster and easier to handle any correlations will be performed, i.e. calculation methods where a fingerprint is involved, e.g. for synchronizing or characterizing an audio signal.

SUMMARY

According to an embodiment, an apparatus for synchronizing multichannel extension data with an audio signal, wherein the multichannel extension data are associated with reference audio signal fingerprint information, may have: a fingerprintcalculator for calculating a fingerprint of the audiosignal having: a means for dividing the audio signal into subsequent blocks of samples; a means for calculating a first fingerprint value for a first block of the subsequent blocks and a secondfingerprint value for a second block of the subsequent blocks; a means for comparing the first fingerprint value with the second fingerprint value; a means for assigning a first binary value when the first fingerprint value is higher than the secondfingerprint value, or a second different binary value when the first fingerprint value is smaller than the second fingerprint value; and a means for outputting information about a sequence of binary values as fingerprint for the audio signal; afingerprint extractor for extracting a sequence of reference audio signal fingerprints from the reference audio signal fingerprint information associated with the multichannel extension data; wherein the sequence of test audio signal fingerprints and thesequence of reference audio signal fingerprints are each a sequence of 1-bit values, wherein one bit each is associated with one block of audio samples, a fingerprint correlator for correlating the sequence of test audio signal fingerprints and thesequence of reference audio signal fingerprints, wherein the fingerprint correlator is implemented to combine a bit sequence of the sequence of test audio signal fingerprints and a bit sequence of the reference audio signal fingerprints by a bit-by-bitXOR operation, and to sum up obtained bit results in order to obtain a first correlation value, to further combine a bit sequence of the sequence of test audio signal fingerprints or the reference audio signal fingerprints shifted by an offset value witha respectively different sequence by a bit-by-bit XOR operation, and to sum up obtained bit results in order to obtain a second correlation value, and to select that offset value as the correlation result for which the largest correlation value hasresulted; and a compensator for reducing or eliminating a time offset between the multichannel extension data and the audio signal based on the correlation result.

According to another embodiment, an apparatus for characterizing a test audio signal may have: a means for calculating a test fingerprint of the test audio signal having: a means for dividing the audio signal into subsequent blocks of samples; ameans for calculating a first fingerprint value for a first block of the subsequent blocks and a second fingerprint value for a second block of the subsequent blocks; a means for comparing the first fingerprint value with the second fingerprint value; ameans for assigning a first binary value when the first fingerprint value is higher than the second fingerprint value, or a second different binary value when the first fingerprint value is smaller than the second fingerprint value; and a means foroutputting information about a sequence of binary values as fingerprint for the audio signal; a means for correlating the information about the sequence of binary values with difference reference fingerprints in a reference database, wherein thereference database includes information about an audio signal for every reference fingerprint, which is associated to the reference fingerprint; and wherein the sequence of test audio signal fingerprints and the sequence of reference audio signalfingerprints are each a sequence of 1-bit values, wherein one bit each is associated with one block of audio samples, wherein the means for correlating is implemented to combine a bit sequence of the sequence of test audio signal fingerprints and a bitsequence of the reference audio signal fingerprints by a bit-by-bit XOR operation, and to sum up obtained bit results in order to obtain a first correlation value, to further combine a bit sequence of the sequence of test audio signal fingerprints or thereference audio signal fingerprints shifted by an offset value with a respectively different sequence by a bit-by-bit XOR operation, and to sum up obtained bit results in order to obtain a second correlation value, and to select that offset value as thecorrelation result for which the largest correlation value has resulted, a means for providing information about the test audio signal based on the correlation result.

According to another embodiment, a method for synchronizing multichannel extension data with an audio signal, wherein the multichannel extension data are associated with the reference audio signal fingerprint information, may have the steps of:calculating a fingerprint of an audio signal, having: dividing the audio signal into subsequent blocks of samples; calculating a first fingerprint value for a first block of the subsequent blocks and a second fingerprint value for a second block of thesubsequent blocks; comparing the first fingerprint value with the second fingerprint value; assigning a first binary value when the first fingerprint value is higher than the second fingerprint value, or a second different binary value when the firstfingerprint value is smaller than the second fingerprint value; and outputting information about a sequence of binary values as fingerprint for the audio signal; extracting a sequence of reference audio signal fingerprints from the reference audio signalfingerprint information associated with the multichannel extension data; wherein the sequence of test audio signal fingerprints and the sequence of reference audio signal fingerprints are each a sequence of 1-bit values, wherein one bit each isassociated with one block of audio samples, correlating the sequence of test audio signal fingerprints and the sequence of reference audio signal fingerprints, the correlating having: combining a bit sequence of the sequence of test audio signalfingerprints and a bit sequence of the reference audio signal fingerprints by a bit-by-bit XOR operation, and to sum up obtained bit results in order to obtain a first correlation value, combining a bit sequence of the sequence of test audio signalfingerprints or the reference audio signal fingerprints shifted by an offset value with a respectively different sequence by a bit-by-bit XOR operation, and to sum up obtained bit results in order to obtain a second correlation value, and selecting thatoffset value as the correlation result for which the largest correlation value has resulted; and reducing or eliminating a time offset between the multichannel extension data and the audio signal based on the correlation result.

According to another embodiment, a method for characterizing a test audio signal may have the steps of: calculating a test fingerprint of an audio signal, having: dividing the audio signal into subsequent blocks of samples; calculating a firstfingerprint value for a first block of the subsequent blocks and a second fingerprint value for a second block of the subsequent blocks; comparing the first fingerprint value with the second fingerprint value; assigning a first binary value when thefirst fingerprint value is higher than the second fingerprint value, or a second different binary value when the first fingerprint value is smaller than the second fingerprint value; and outputting information about a sequence of binary values asfingerprint for the audio signal, wherein a sequence of binary values is obtained as test fingerprint; wherein the sequence of test audio signal fingerprints and the sequence of reference audio signal fingerprints are each a sequence of 1-bit values,wherein one bit each is associated with one block of audio samples, correlating the information about a sequence of binary values with different reference fingerprints in a reference database, wherein the reference database includes, for every referencefinger print, information about an audio signal associated with the reference fingerprint, the correlating having: combining a bit sequence of the sequence of test audio signal fingerprints and a bit sequence of the reference audio signal fingerprints bya bit-by-bit XOR operation, and to sum up obtained bit results in order to obtain a first correlation value, combining a bit sequence of the sequence of test audio signal fingerprints or the reference audio signal fingerprints shifted by an offset valuewith a respectively different sequence by a bit-by-bit XOR operation, and to sum up obtained bit results in order to obtain a second correlation value, and selecting that offset value as the correlation result for which the largest correlation value hasresulted; and providing information about the test audio signal based on the correlation result.

Another embodiment may have a computer program having a program code for performing the inventive method for synchronizing multichannel extension data with an audio signal and the inventive method for characterizing a test audio signal, when theprogram runs on a computer.

The present invention is based on the knowledge that a well-compressed fingerprint is obtained by block processing an audio signal, i.e. that one fingerprint value is derived per block of the audio signal. Further, it has been found out that acourse of this fingerprint value from block to block is particularly characteristic for the audio signal. Hence, in the sense of differential coding, a comparison of subsequent fingerprint values is performed for subsequent blocks to then merelybinarily characterize the change. If the first fingerprint value is higher than the second fingerprint value, a first binary value will be assigned, while if the second fingerprint value is higher than the first fingerprint value, another second binaryvalue will be assigned. This sequence of binary values is output as a fingerprint for the audio signal. This change is quantized by merely one single bit. By this 1-bit quantization, merely one single bit of fingerprint information is provided perblock of the audio signal, and the audio signal is represented by a simple bit sequence, by which a fast, efficient and surprisingly exact correlation with a corresponding test bit sequence can be performed.

Audio signals have the property that the characteristics do not change so much from block to block, so that a full, e.g., 8-bit quantization or 16-bit quantization of the fingerprint value is not absolutely necessitated. Further, audio signalshave the property that a change of the fingerprint value from one block to the next is very expressive for the audio signal. By the 1-bit quantization, this change from one block to the next is strongly emphasized. In this way, audio signals have inparticular the characteristic that the fingerprint value does not change very much from one block to the next. However, the characterization information for the audio signal that is particularly necessitated for fingerprint processing purposes, which iseffectively used by the inventive 1-bit quantization, is embedded within this little change.

In particular when the fingerprint value is an energy-dependent or power-dependent value, changes from one block to the next are relatively small, wherein, however, particularly when blocks are formed in the range of less than 5,000 samples andin particular of less than 2,00 samples and blocks of more than 500 samples, the change of the energy-dependent or power-dependent value from one block to the next is particularly characteristic for the audio signal.

The inventive fingerprint can be used in a particularly favorable manner for the synchronization of multichannel extension data with an audio signal, wherein synchronization is achieved efficiently and reliably by means of a block-basedfingerprint technology.

It has been found out that fingerprints calculated block-by-block represent a good and efficient characteristic for an audio signal. However, in order to bring the synchronization onto a level that is smaller than one block length, it isadvantageous to provide the audio signal with block division information that is detected during synchronization and that can be used for fingerprint calculation.

The audio signal comprises block division information that can be used at the time of synchronization. Thereby, it is ensured that the fingerprints derived from the audio signal during synchronization are based on the same block division orblock rasterization as the fingerprints of the audio signal associated with the multichannel extension data. In particular, the multichannel extension data comprise a sequence of reference audio signal fingerprint information. This reference audiosignal fingerprint information provides an association, inherent in the multichannel extension stream, between a block of multichannel extension data and the portion or block of the audio signal to which the multichannel extension data belong.

For synchronization, the reference audio signal fingerprints are extracted from the multichannel extension data and correlated with the test audio signal fingerprints calculated by the synchronizer. The correlator merely has to achieve blockcorrelation, since, due to using block division information, the block rasterization on which the two sequences of fingerprints are based is already identical.

Thereby, despite the fact that merely fingerprints sequences have to be correlated on block level, an almost sample-exact synchronization of the multichannel extension data with the audio signal can be obtained.

The block division information included in the audio signal can be stated as explicit side information, e.g. in a header of the audio signal. Alternatively, even when a digital but uncompressed transmission exists, this block divisioninformation can also be included in a sample which was, for example, the first sample of a block that was formed for calculating the reference audio signal fingerprints contained in the multichannel extension data. Alternatively or additionally, theblock division information can also be introduced directly into the audio signal itself, e.g. by means of watermark embedding. A pseudo noise sequence is particularly suited for this, however, different ways of watermark embeddings can be used forintroducing block division information into the audio signal. An advantage of this watermark implementation is that any analog/digital or digital/analog conversions are uncritical. Further, watermarks that are robust against data compression exist,which will even withstand compression/decompression or even tandem/coding stages and which can be used as reliable block division information for synchronization purposes.

In addition to that, it is advantageous to embed the reference audio signal fingerprint information directly block by block into the data stream of the multichannel extension data. In this embodiment, finding an appropriate time offset isachieved by using a fingerprint with a data fingerprint not stored separately from the multichannel extension data. Instead, for every block of the multichannel extension data, the fingerprint is embedded in this block itself Alternatively, however, thereference audio signal fingerprint information can be associated with the multichannel extension data but originate from a separate source.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:

FIG. 1 is a block diagram of an apparatus for processing the audio signal for providing a synchronizable output signal with multichannel extension data, according to an embodiment of the invention;

FIG. 2 is a detailed illustration of the fingerprint calculator of FIG. 1;

FIG. 3a is a block diagram of an apparatus for synchronizing according to an embodiment of the invention;

FIG. 3b is a detailed representation of the compensator or FIG. 3a;

FIG. 4a is a schematic illustration of an audio signal with block division information;

FIG. 4b is a schematic illustration of multichannel extension data with block-wise embedded fingerprints;

FIG. 5 is a schematic illustration of a watermark embedder for generating an audio signal with a watermark;

FIG. 6 is a schematic illustration of a watermark extractor for extracting block division information;

FIG. 7 is a schematic illustration of a result diagram as it appears after correlation across, e.g., 30 blocks of the test block division;

FIG. 8 is a flow diagram for illustrating different fingerprint calculation options;

FIG. 9 is a multichannel encoder scenario with an inventive apparatus for processing;

FIG. 10 is a multichannel decoder scenario with an inventive synchronizer;

FIG. 11a is a detailed illustration of the multichannel extension data calculator of FIG. 9; and

FIG. 11b is a detailed illustration of a block with multichannel extension data as can be generated by the arrangement shown in FIG. 11a.

DETAILED DESCRIPTION OF THE INVENTION

FIG. 1 shows a schematic diagram of an apparatus for processing an audio signal, wherein the audio signal is shown at 100 with block division information, while the audio signal at 102 may comprise no block division information. The apparatusfor processing an audio signal of FIG. 1, which can be used in an encoder scenario, which will be detailed with regard to FIG. 9, comprises a fingerprint calculator 104 for calculating one fingerprint per block of the audio signal for a plurality ofsubsequent blocks for obtaining a sequence of reference audio signal fingerprint information. The fingerprint calculator is implemented to use predetermined block division information 106. The predetermined block division information 106 can, forexample, be detected by a block detector 108 from the audio signal 100 with block division information. As soon as the block division information 106 has been detected, the fingerprint calculator 104 is able to calculate the sequence of referencefingerprints from the audio signal 100.

If the fingerprint calculator 104 obtains an audio signal 102 without block division information, the fingerprint calculator will select any block division and first performs block division. This block division is signalized to a block divisioninformation embedder 112 via block division information 110, which is implemented to embed the block division information 110 into the audio signal 102 without block division information. On the output side, the block division information embedderprovides an audio signal 114 with block division information, wherein this audio signal can be output via an output interface 116, or can be stored separately or output via a different path independent from the output via the output interface 116, as is,for example, illustrated schematically at 118.

The fingerprint calculator 104 is implemented to calculate a sequence of reference audio signal fingerprint information 120. This sequence of reference audio signal fingerprint information is supplied to a fingerprint information embedder 122. The fingerprint information embedder embeds the reference audio signal fingerprint information 120 into multichannel extension data 124, which can be provided separately, or which can also be calculated directly by a multichannel extension datacalculator 126, which receives a multichannel audio signal 128 on the input side. On the output side, the fingerprint information embedder 122 provides multichannel extension data with associated reference audio signal fingerprint information, whereinthese data are designated by 130. The fingerprint information embedder 122 is implemented to embed the reference audio signal fingerprint information directly into the multichannel extension data, quasi at block level. Alternatively or additionally,the fingerprint information embedder 122 will also store or provide the sequence of reference audio signal fingerprint information based on the association with a block of multichannel extension data, wherein this block of multichannel extension datatogether with a block of the audio signal represents a fairly good approximation of a multichannel audio signal or the multichannel audio signal 128.

The output interface 116 is implemented to output an output signal 132 which comprises the sequence of reference audio signal fingerprint information and the multichannel extension data in unique association, such as within an embedded datastream. Alternatively, the output signal can also be a sequence of blocks of multichannel extension data without reference audio signal fingerprint information. The fingerprint information is then provided in a separate sequence of fingerprintinformation, wherein, for example, every fingerprint is "connected" to a block of multichannel extension data by means of a serial block number. Alternative associations of fingerprint data with blocks, such as via implicit signalization of a sequence,etc., can also be applied.

Further, the output signal 132 can also comprise an audio signal with block division information. In specific cases of application, such as in broadcasting, the audio signal with block division information will run along a separate path 118.

FIG. 2 shows a detailed illustration of the fingerprint calculator 104. In the embodiment shown in FIG. 2, the fingerprint calculator 104 comprises a block-forming means 104a, a downstream fingerprint calculator 104b and a fingerprintpost-processor 104c for providing a sequence of reference audio signal fingerprint information 120. The block-forming means 104a is implemented to provide the block division information to storage/embedding 110 when the same actually performs firstblock formation. If, however, the audio signal already has block division information, the block forming means 104a will be controllable to perform block formation in dependence on the predetermined block division information 106.

Independent of the usage of block division information, a particularly good, characteristic and efficient fingerprint is obtained by an apparatus for calculating a fingerprint of an audio signal as, for example, illustrated in FIG. 2. The blockforming means 104 represents a means for dividing the audio signal into subsequent blocks of samples. Further, the fingerprint value calculation 104b is effective as a means for calculating a first fingerprint value for a first block of the subsequentblocks and a second fingerprint value for a second block of the subsequent blocks.

The fingerprint correlator 312 of FIG. 3a represents a means for comparing, as illustrated at 806 in FIG. 8, wherein the first fingerprint value is compared to the second fingerprint value. An implementation of the means 806 for comparingconsists in difference formation, as will be described based on FIG. 8, since then, based on the sign of the difference result, it can be determined whether the first fingerprint value was higher or smaller than the second fingerprint value.

The fingerprint postprocessor 104c of FIG. 2 is implemented according to the invention to perform a one-bit quantization 814, or generally to assign a first binary value, when the first fingerprint value is higher than the second fingerprintvalue, or to assign a second different binary value when the first fingerprint value is smaller than the second fingerprint value.

Finally, the inventive apparatus for calculating a fingerprint comprises a means for outputting information about a sequence of binary values as fingerprint for the audio signal, wherein the means can be implemented, for example, in the form ofthe output interface 116 of FIG. 1, or can operate as any other data stream or bit stream writer.

The two binary values, i.e. the first binary value and the second binary value, are complementary to each other. In the 1-bit quantization example (block 108, 114) shown in FIG. 8, the first binary value is, for example, 0 or 1, and the secondbinary value is also 0 or 1, wherein the second value is complementary to the first value. 1-bit quantization is performed, wherein exactly one bit is generated per block of the audio signal.

The sequence of bits as generated by block 814 is then the test fingerprint or the reference fingerprint.

The block dividing means 104a of FIG. 2 is implemented to either form successive adjacent blocks that are overlapping or to form blocks that are overlapping, which have, for example, 50% overlapping. Further, the block forming means 104a isimplemented to provide blocks of the audio signal with time samples having at least 500 samples or more, and whose length is less than 5000 samples. Particularly advantageous, blocks in the range between 1000 and 2500 samples are used, wherein inparticular when frequency-based measures are used for fingerprint value calculation, e.g. 1024 samples or 2048 samples are advantageous. The longer the blocks are selected, the lower the bit requirements of fingerprint information per audio signal willbecome. However, with increasing block length, the significance of the fingerprint is reduced, which is why the above described block lengths are advantageous, which can relate to an audio sampling frequency of, e.g. 44.1 KHz, wherein, however,respective block lengths for different sample rates also provide reasonable results as long as one block includes a time period of the audio signal of approx. 10 ms to approx. 100 ms.

The inventive fingerprint can be used for synchronization, as has been described based on FIG. 3, wherein an accuracy in the order of magnitude of one block length is obtained already without block division information, which can be increased tothe range of one sample by adding the block division information. In cases of application where block-accurate synchronization is sufficient, a satisfying result can already be obtained without block division information. Also, with fingerprintapplications for characterizing or identifying an audio signal, respectively, a sample-accurate synchronization between test fingerprint and reference fingerprint does not necessarily have to be obtained.

In one embodiment of the present invention, the audio signal is provided with a watermark, as is shown in FIG. 4a. In particular, FIG. 4a shows an audio signal having a sequence of samples, wherein a block division into blocks i, i+1, i+2 isindicated schematically. However, even in the embodiment shown in FIG. 4a, the audio signal itself does not include such an explicit block division. Instead, a watermark 400 is embedded in the audio signal such that every audio sample comprises aportion of the watermark. This portion of the watermark is automatically indicated at 404 for a sample 402. In particular, the watermark 400 is embedded such that the block structure can be detected based on the watermark. For this purpose, thewatermark is, for example, a known periodic pseudo noise sequence, as is shown in FIG. 5 at 500. This known pseudo noise sequence has a period length equal to the block length or larger than a block length, wherein, however, a period length equal to theblock length or in the order of magnitude of the block length is advantageous.

For watermark embedding, first, as is shown in FIG. 5, a block formation 502 of the audio signal is performed. Then, a block of the audio signal is converted to the frequency domain by means of a time/frequency conversion 504. Analogously, theknown pseudo noise sequence 500 is transformed to the frequency domain by means of a time/frequency conversion 506. Thereupon, a psychoacoustic module 508 calculates the psychoacoustic masking threshold of the audio signal block, wherein, as known inpsychoacoustics, a signal in a band will then be masked in the audio signal, i.e. the same is inaudible, when the energy of the signal in the band is below the value of the masking threshold for this band. Based on this information, a spectral weighting510 for the spectral illustration of the pseudo noise sequence is performed. Then, prior to a combiner 512, the spectrally weighted pseudo noise sequence has a spectrum, which has a course corresponding to the psychoacoustic masking threshold. Thissignal is then combined, spectral value by spectral value, with the spectrum of the audio signal in the combiner 512. Hence, at the output of the combiner 512, an audio signal block with an introduced watermark exists, wherein, however, the watermark ismasked by the audio signal. By a frequency/time converter 514, the block of the audio signal is converted back to the time domain and the audio signal shown in FIG. 4a exists, which now, however, has a watermark illustrating block division information.

It should be noted that many different watermark-embedding strategies exist. Hence, the spectral weighting 510 can be performed, for example, by a dual operation in the time domain, such that time/frequency conversion 506 is not necessitated.

Further, the spectrally weighted watermark could also be transformed into the time domain prior to its combination with the audio signal, such that the combination 512 takes place in the time domain, wherein in this case time/frequencyconversion 504 would not absolutely be necessitated, as long as the masking threshold can be calculated without transformation. Obviously, calculation of the masking threshold used independently of the audio signal or of a transformation length of theaudio signal, could also be performed.

The length of the known pseudo noise sequence is equal to the length of one block. Then, correlation for watermark extraction works particularly efficiently and clearly. However, longer pseudo noise sequences could be used, as long as a periodlength of the pseudo noise sequence is equal to or longer than the block length. Further, a watermark having no white spectrum can be used, which is merely implemented such that it comprises spectral portions in certain frequency bands, such as thelower spectral band or a central spectral band. Thereby, it can be controlled that the watermark is not, for example, introduced only in the upper bands which are eliminated or parameterized, for example, by a "spectral band replication" technique, asknown from MPEG 4 standard, in a data rate-saving transmission.

As an alternative to using a watermark, block division can also be performed when, for example, a digital channel exists, where every block of the audio signal of FIG. 4 can be marked such that, for example, the first sample value of a blockobtains a flag. Alternatively, for example, block division can be signalized in a header of an audio signal, which is used for the calculation of the fingerprint and which has also been used for calculating the multichannel extension data from theoriginal multichannel audio channels.

For illustrating the scenario of calculating the multichannel extension data, reference will be made below to FIG. 9. FIG. 9 shows an encoder-side scenario, as it is used for reducing the data rate of multichannel audio signals. A 5.1 scenariois shown exemplarily, wherein, however, a 7.1, 3.0 or an alternative scenario can be used. For the spatial audio object coding, which is also known and where audio objects are coded instead of audio channels, where the multichannel extension data areactually data with which objects can be reconstructed, a basically binary structure, indicated in FIG. 9, is used. The multichannel audio signal having the several audio channels or audio objects is supplied to a downmixer 900 providing a downmix audiosignal, wherein the audio signal is, for example, a mono downmix or a stereo downmix. Further, multichannel extension data calculation is performed in a respective multichannel extension data calculator 902. There, the multichannel extension data arecalculated, e.g. according to the BCC technique or according to the standard known under the name MPEG surround. Extension data calculation for audio objects, which are also referred to as multichannel extension data, can also take place in the audiosignal 102. The apparatus for processing the audio signal shown in FIG. 1 is downstream of these known two blocks 900, 902, wherein the apparatus 904 for processing shown in FIG. 9 receives, according to FIG. 1, for example an audio signal 102 withoutblock division information as mono downmix or stereo downmix, and further receives the multichannel extension data via the line 124. Hence, the multichannel extension data calculator 126 of FIG. 1 will correspond to the multichannel extension datacalculator 902 of FIG. 9. On the output side, the apparatus 904 for processing provides, for example, an audio signal 118 having embedded block division information as well as a data stream having multichannel extension data together with associated orembedded reference audio signal fingerprint information as illustrated in FIG. 1 at 132.

FIG. 11a shows a detailed illustration of the multichannel extension data calculator 902. In particular, first, block formation in respective block-forming means 910 is performed for obtaining a block for the original channel of themultichannel audio signal. Thereupon, time/frequency conversion in a time/frequency converter 912 is performed per block. The time/frequency converter can be a filter bank for performing sub-band filtering, a general transformation or in particular atransformation in the form of an FFT. Alternative transformations are also known as MDCT etc. Thereupon, an individual correlation parameter between the channel and the reference channel indicated by ICC is calculated in the multichannel extension datacalculator per band, block and, for example, also per channel. Further, an individual energy parameter ICLD is calculated per band and block and channel, wherein this is performed in a parameter calculator 914. It should be noted that the block-formingmeans 910 uses block division information 106, when such block division information already exists. Alternatively, the block-forming means 910 can also determine block division information itself when the first block division is performed and thenoutput the same and use it to control, for example, the fingerprint calculator of FIG. 1. Analogously to the designation in FIG. 1, the output block division information is also designated by 110. Generally, it is ensured that the block formation forcalculating the multichannel extension data is performed in synchronization with the block formation for calculating the fingerprints of FIG. 1. Thereby it is ensured that a sample-exact synchronization of multichannel extension data to the audio signalis obtainable.

The parameter data calculated by the parameter calculator 914 are supplied to a data stream formatter 916, which can be implemented equal to the fingerprint information embedder 122 of FIG. 1. Further, the data stream formatter 916 receives afingerprint per block of the downmix signal as indicated at 918. Then, with the fingerprint and the received parameter data 915, the data stream formatter generates multichannel extension data 130 with embedded fingerprint information, one block ofwhich is illustrated schematically in FIG. 11b. In particular, the fingerprint information for this block is entered after an optional present synchronization word 950 at 960. Then, after the fingerprint information 960, the parameters 915 follow whichthe parameter calculator 940 has calculated, namely, for example, in the sequence shown in FIG. 11b where first the ICLD parameters per channel and band occur, which are then followed by the ICC parameters per channel and band. The channel is inparticular signalized by the index of "ICLD", wherein an index "1" stands, for example, for the left channel, an index "2" stands for the central channel, an index "3" stands for the right channel, an index "4" stands for the left rear channel (LS), andan index "5" stands for the right rear channel (RS).

Generally this results in a data stream with multichannel extension data as illustrated in FIG. 4b, wherein the fingerprint of the audio signal, i.e. the stereo downmix signal or the mono downmix signal or generally the downmix signal, precedesthe multichannel extension data 124 for a block. In one implementation, the fingerprint information for one block can also be inserted in the transmission direction after the multichannel extension data or somewhere between the multichannel extensiondata. Alternatively, the fingerprint information can also be transmitted in a separate data stream, or, for example, in a separate table which is, for example, associated with the multichannel extension data by means of an explicit block identificator,or where the association is implicitly given, namely by the order of the fingerprints in relation to the order of the multichannel extension data for the individual blocks. Other associations without explicit embedding can also be used.

FIG. 3a shows an apparatus for synchronizing multichannel extension data with an audio signal 114. In particular, the audio signal 114 includes block division information, as is illustrated based on FIG. 1. In addition to that, reference audiosignal fingerprint information is associated with the multichannel extension data.

The audio signal with the block division information is supplied to a block detector 300, which is implemented to detect the block division information in the audio signal, and to supply the detected block division information 302 to afingerprint calculator 304. Further, the fingerprint calculator 304 receives the audio signal, wherein here an audio signal without block division information would be sufficient, wherein, however, the fingerprint calculator can also be implemented touse the audio signal with block division information for fingerprint calculation.

Now, the fingerprint calculator 304 calculates one fingerprint per block of the audio signal for a plurality of subsequent blocks in order to obtain a sequence of test audio signal fingerprints 306. In particular, the fingerprint calculator 304is implemented to use the block division information 302 for calculating the sequence of test audio signal fingerprints 306.

The inventive synchronization apparatus, or the inventive synchronization method, is further based on a fingerprint extractor 308 for extracting a sequence of reference audio signal fingerprints 310 from the reference audio signal fingerprintinformation 120 as it is supplied to the fingerprint extractor 308.

Both the sequence of test fingerprints 306 and the sequence of reference fingerprints 308 are supplied to a fingerprint correlator 312, which is implemented to correlate the two sequences. Depending on a correlation result 314, where an offsetvalue is obtained, which is an integer (x) of the block length (.DELTA.D), a compensator 316 is controlled for reducing, or, in the best case, eliminating a time offset between the multichannel extension data 132 and the audio signal 114. At the outputof the compensator 316, both the audio signal and the multichannel extension data are output in a synchronized form in order to be supplied to multichannel reconstruction, as will be discussed with reference to FIG. 10.

The synchronizer shown in FIG. 3a is shown in FIG. 10 at 1000. As has been illustrated with reference to FIG. 3a, the synchronizer 1000 includes the audio signal 114 and the multichannel extension data in non-synchronized form and provides theaudio signal and the multichannel extension data in synchronized form to an upmixer 1102 on the output side. The upmixer 1102, also referred to as an "upmix" block, can now calculate, based on the audio signal and the multichannel extension datasynchronized thereto, reconstructed multichannel audio signals L', C', R', LS' and RS'. These reconstructed multichannel audio signals represent an approximation to the original multichannel audio signals, as they have been illustrated at the input ofthe block 900 in FIG. 9. Alternatively, the reconstructed multichannel audio signals at the output of block 1102 in FIG. 10 also represent reconstructed audio objects or reconstructed audio objects already amended at certain positions, as is known fromaudio object coding. Now, the reconstructed multichannel audio signals have a maximum obtainable audio quality, due to the fact that synchronization of the multichannel extension data has been obtained in a sample-exact manner with the audio signal.

FIG. 3b shows a specific implementation of the compensator 316. The compensator 316 has two delay blocks, of which one block 320 can be a fixed delay block having a maximum delay and the second block 322 can be a block having a variable delaythat can be controlled between a delay equal to zero and a maximum delay D.sub.max. Control takes place based on the correlation result 314. The fingerprint correlator 312 provides correlation offset control in integers (x) of one block length(.DELTA.d). Due to the fact that fingerprint calculation has been performed in the fingerprint calculator 304 itself based on the block division information included in the audio signal, according to the invention, sample-exact synchronization isobtained although the fingerprint correlator only had to perform block-based correlation. Despite the fact that the fingerprint has been calculated block by block, i.e. represents the time curve of the audio signal and correspondingly the time curve ofthe multichannel extension data only in a relatively coarse manner, a sample-exact correlation is nevertheless obtained, merely due to the fact that the block division of the fingerprint calculator 304 has been synchronized in the synchronizer withregard to the block division that has been used for calculating the multichannel extension data block by block and which has, above all, been used for calculating the fingerprints embedded in the multichannel extension data stream or associated with themultichannel extension data stream.

With regard to the implementation of the compensator 316, it should be noted that also two variable delays can be used, such that the correlation result 314 controls both variable delay stages. Also, alternative implementation options within acompensator for synchronization purposes can be used for eliminating time offsets.

In the following, with reference to FIG. 6, a detailed implementation of the block detector 300 of FIG. 3a will be illustrated, when the block division information is introduced into the audio signal as a watermark. The watermark extractor inFIG. 6 can be structured analogously to the watermark embedder of FIG. 5, but it does not have to be structured in an exactly analogous manner.

In the embodiment shown in FIG. 6, the audio signal with watermark is supplied to a block former 600, which generates subsequent blocks from the audio signal. One block is then supplied to a time/frequency converter 602 for transforming theblock. Based on the spectral representation of the block or due to a separate calculation, a psychoacoustic module 604 is able to calculate a masking threshold for subjecting the block of the audio signal to prefiltering in a prefilter 606 by using thismasking threshold. The implementation of the module 604 and the prefilter 606 serve to increase the detection accuracy for the watermark. The same can also be omitted, such that the output of the time/frequency converter 602 is directly coupled to acorrelator 608. The correlator 608 is implemented to correlate the known pseudo noise sequence 500, which has already been used in the watermark embedding in FIG. 5, after a time/frequency conversion in a converter 502 to a block of the audio signal.

For block formation in the block 600, a test block division is predetermined that does not necessarily have to correspond to the final block division. Instead, the correlator 608 will now perform correlation across several blocks, for exampleacross twenty or even more blocks. Thereby, the spectrum of the known noise sequence is correlated with the spectrum of every block at different delay values in the correlator 608, such that a correlation result 610 results after several blocks, whichcould, for example, look like it is shown in FIG. 7. A control 612 can monitor the correlation result 610 and perform peak detection. For that purpose, the control 612 detects a peak 700 becoming more and more apparent with a larger number of blocksused for correlation. As soon as a correlation peak 700 is detected, merely the x coordinate, i.e. the offset .DELTA.n, has to be determined, where the correlation result has shown. In an embodiment of the present invention, this offset .DELTA.nindicates the number of samples by which the test block division has deviated from the block division actually used in the watermark embedding. From this knowledge about the test block division and the correlation result 700, the control 612 nowdetermines a corrected block division 614, e.g. according to the formula shown in FIG. 7. In particular, the offset value .DELTA.n is subtracted from the test block division for calculating the corrected block division 614, which is then to bemaintained by the fingerprint calculator 304 of FIG. 3a for calculating the test fingerprints.

Regarding the exemplary watermark extractor in FIG. 6, it should be noted that an extraction can also be performed alternatively, e.g. in the time domain and not in the frequency domain, that prefiltering can also be omitted, and thatalternative ways can be used for calculating the delay, i.e. the sample offset value .DELTA.n. An alternative option is, for example, to test several test block divisions and to use the test block division providing the best correlation result eitherafter one or after several blocks. Also, non-periodic watermarks can be used as correlation measures, i.e. non-periodic sequences, which could be even shorter than one block length.

Hence, for solving the association problem, a specific procedure on the transmitter side and the receiver side is advantageous in an embodiment of the present invention. On the transmitter side, calculation of time-variable and appropriatefingerprint information from the corresponding (mono or stereo) downmix audio signal can be performed. Further, these fingerprints can be entered regularly into the transmitted multichannel additional data stream as a synchronization help. This can beperformed as a data field within the spatial audio coding side information organized block by block, or in such a manner that the fingerprint signal is transmitted as first or last information of the data block in order to be easily added or removed. Further, a watermark, such as a known noise sequence, can be embedded into the audio signal to be transmitted. This helps the receiver to determine the frame phase and to eliminate a frame-internal offset.

On the receiver side, two-stage synchronization is advantageous. In a first stage, the watermark is extracted from the received audio signal and the position of the noise sequence is determined. Further, the frame boundaries can be determineddue to their noise sequence by the position and the audio data stream can be divided correspondingly. Within these frame boundaries, or block boundaries, the characteristic audio features, i.e. fingerprints, can be calculated across almost equalportions, as were calculated within the transmitter, which increases the quality of the result at a later correlation. In a second stage, time-variable and appropriate fingerprint information is calculated from the corresponding stereo audio signal ormono audio signal, or, generally, from the downmix signal, wherein the downmix signal can also have more than two channels, as long as the channels in the downmix signal have a smaller number than there are channels or generally audio objects in theoriginal audio signal prior to the downmix.

Further, the fingerprints can be extracted from the multichannel additional information and a time offset between the multichannel additional information and the received signal can be performed by means of appropriate and also known correlationmethods. An overall time offset consists of the frame phase and the offset between the multichannel additional information and the received audio signal. Further, the audio signal and the multichannel additional information can be synchronized forsubsequent multichannel decoding by a downstream actively regulated delay compensation stage.

For obtaining the multichannel additional data, the multichannel audio signal is divided, for example into blocks of a fixed size. In the respective block, a noise sequence also known to the receiver is embedded, or, generally, a watermark isembedded. In the same raster, a fingerprint is calculated block by block simultaneously or at least synchronized for obtaining the multichannel additional data, which is suitable for characterizing the time structure of the signal as clearly aspossible.

One embodiment for this is using the energy content of the current downmix audio signal of the audio block, for example in a logarithmic form, i.e. in a decibel-related representation. In this case, the fingerprint is a measure for the timeenvelope of the audio signal. For reducing the information amount to be transmitted, and for increasing the accuracy of the measurement value, this synchronization information can also be expressed as difference to the energy value of the previous blockwith subsequent appropriate entropy coding, such as a Huffman coding, adaptive scaling and quantization.

With reference to FIG. 8 and generally with reference to FIG. 2, embodiments for calculating a fingerprint will be discussed below.

After a block division in a block dividing step 800, the audio signal is present in subsequent blocks. Thereupon, fingerprint value calculation is performed according to block 104b of FIG. 2, wherein the fingerprint value can, for example, beone energy value per block, as illustrated in a step 802. When the audio signal is a stereo audio signal, energy calculation of the downmix audio signal in the current block is performed according to the following equation:

.times..times..times..times..function..function. ##EQU00001##

In particular, the signal value s.sub.left(i) with the number i represents a time sample of a left channel of the audio signal. s.sub.right(i) is the i.sup.th sample of a right channel of the audio signal. In the shown embodiment, the blocklength is 1152 audio samples, which is why the 1153 audio samples (including the sample for i=0) both from the left and the right downmix channel are each squared and summed. If the audio signal is a monophonic audio signal, the summation is omitted. If the audio signal is a signal with, for example, three channels, the squared samples from three channels will be summed up. Further, it is advantageous to remove the (non-meaningful) steady components of the downmix audio signals prior to calculation.

In a step 804, a minimum limitation of the energy is performed due to subsequent logarithmic representation. For a decibel-related evaluation of the energy, a minimum energy offset E.sub.offset is provided, so that a useful logarithmiccalculation results in the case of zero energy. This energy measure in dB describes a number range of 0 to 90 (dB) at an audio signal resolution of 16 bits. Hence, in a block 804, the following equation will be implemented: E.sup.(db)=10log(E.sub.monosum+E.sub.offset)

For an exact determination of the time offset between the multichannel additional information and the received audio signal, not the absolute energy level value is used, but rather the slope or steepness of the signal envelope. Therefore, forcorrelation measurement in the fingerprint correlator 312 of FIG. 3a, the steepness of the energy envelope is used. Technically speaking, this signal deviation is calculated by a difference formation of the energy value with that of the previous block,according to the following equation: E.sub.db(diff)=E.sub.db(current_block)E.sub.db(previous_block)

E.sub.db(dif) is the difference value of the energy values of two previous blocks, in a dB representation, while E.sub.db is the energy in dB of the current block or the previous block, as it is obvious from the above equation. This differenceformation of energies is performed in a step 806.

It should be noted that this step is performed, for example, only in the encoder, i.e. in the fingerprint calculator 104 of FIG. 1, such that the fingerprint embedded in the multichannel extension data consists of difference coded values.

Alternatively, step 806 of the difference formation can also be implemented purely on the decoder side, i.e. in the fingerprint calculator 304 of FIG. 3a. In this case, the transmitted fingerprint only consists of non-difference codedfingerprints, and the difference formation according to step 806 is only performed within the decoder. This option is represented by the dotted signal flow line 807, which bridges the difference formation block 806. This latter option 808 has theadvantage that the fingerprint still includes information about the absolute energy of the downmix signal, but necessitates a slightly higher fingerprint word length.

While blocks 802, 804, 806 belong to fingerprint value calculation according to 104b of FIG. 2, the subsequent steps 808 (scaling with amplification factor), 810 (quantization), 812 (entropy coding) or also 1-bit quantization are counted inblock 814 belong to fingerprint post-processing according to the fingerprint post-processor 104c.

When scaling the energy (envelope of the signal) for optimal modulation according to block 808, it is ensured that in the subsequent quantization of this fingerprint both the number range is utilized maximally and also the resolution at lowenergy values is improved. Therefore, additional scaling or amplification is introduced. The same can be realized either as a fixed or static weighting amount or via a dynamic amplification regulation adapted to the envelope signal. Combinations of astatic weighting amount as well as an adapted dynamic amplification regulation can also be used. In particular, the following equation is followed: E.sub.scaled=E.sub.db(diff)*A.sub.amplification(t)

E.sub.scaled represents the scaled energy. E.sub.db(diff) represents the difference energy in dB calculated by the difference formation in block 806, and A.sub.amplification is the amplification factor, which can depend on the time t when it isa particularly dynamic amplification regulation. The amplification factor will depend on the envelope signal in that the amplification factor becomes smaller with a larger envelope and the amplification factor becomes higher with a smaller envelope inorder to obtain a modulation of the available number range that is as uniform as possible. The amplification factor can be reproduced in particular in the fingerprint calculator 304 by measuring the energy of the transmitted audio signal, so that theamplification factor does not have to be transmitted explicitly.

In a block 810, the fingerprint calculated by block 808 is quantized. This is performed in order to prepare the fingerprint for entering into the multichannel additional information. This reduced fingerprint resolution has shown to be a goodtradeoff with regard to bit requirement and reliability of the delay detection. In particular overruns of >255 can be limited to the maximum value of 255 with a saturation characteristic curve, as can be illustrated, for example, in an equation asbelow:

.times..times..function..times..times..times. ##EQU00002##

E.sub.quantized is the quantized energy value and represents a quantization index having 8 bits. Q.sub.8bits is the quantization operation assigning the quantization index for the maximum value 255 to a value of >255. It should be notedthat finer quantizations with more than 8 bits or coarser quantizations with less than 8 bits can also be used, wherein the additional bit requirements decrease with coarser quantization, while the additional bit requirements increase with finerquantization with more bits, but the accuracy increases as well.

Thereupon, in a block 812, entropy coding of the fingerprint can take place. By evaluating statistical characteristics of the fingerprint, the bit requirements for the quantized fingerprint can be reduced further. An appropriate entropy methodis, for example, Huffman coding. Statistically different frequencies of fingerprint values can be expressed by different code lengths, and can thus, on average, reduce the bit requirements for fingerprint illustration.

The result of the entropy coding block 812 will then be written into the extension channel data stream, as is illustrated at 813. Alternatively, non-entropy coded fingerprints can be written into the bit stream as quantized values, as isillustrated at 811.

As an alternative to the energy calculation per block in step 802, a different fingerprint value can be calculated, as is illustrated in block 818.

As an alternative to the energy of a block, the crest factor of the power density spectrum (PSD crest) can be calculated. The crest factor is generally calculated as the quotient between the maximum value XMax of the signal in a block to thearithmetic average of the signals X.sub.n (e.g. spectral values) in the block, as is illustrated exemplarily in the following equation

.times..times..times..times. ##EQU00003##

Further, another method can be used in order to obtain a more robust synchronization. Instead of post-processing by means of blocks 808, 810, 812, 1-bit quantization can be used as an alternative fingerprint post-processing 104c (FIG. 2), as isillustrated in block 814. Here, additionally, 1-bit quantization is performed directly after the calculation and the difference formation of the fingerprint according to 802 or 818 in the encoder. It has been shown that this can increase the accuracyof the correlation. This 1-bit quantization is realized such that the fingerprint equals 1 when the new value is higher than the old one (slope positive) and equals -1 when the slope is negative. A negative slope is achieved when the new value issmaller than the old value.

The inventive 1-bit quantization simplifies the correlations calculation in the fingerprint correlator 312 significantly. Based on the fact that the test fingerprint and the reference fingerprint are bit sequences, the correlation can besimplified to a simple XOR operation and subsequently summing up the bit-by-bit results of the XOR operation. Hence, when the sequence of tests audio signal fingerprint values and the sequence of reference audio signal fingerprints each are a sequenceof 1 bit values, wherein 1 bit each stands for a block of audio samples, the fingerprint correlator 312 of FIG. 3a is implemented to combine a bit sequence of the sequence of test audio signal fingerprints and a bit sequence of the reference audio signalfingerprints by a bit-by-bit XOR operation and to sum up obtained bit results. The result of this summation represents a first correlation value. The bit sequences have a length of, e.g. 32 bits or between e.g. 10 bits and 100 bits.

Further, the fingerprint correlator 312 is implemented to combine a bit sequence of a sequence of test audio signal fingerprints or reference audio signal fingerprints offset by an offset-value with a respectively different sequence, also bybit-by-bit XOR operation and to sum up the obtained bit results, which results in a second correlation value. For the offset value, for which the maximum correlation value was given, it can be determined that test fingerprint and reference fingerprinthave matched. Hence, this offset value represents the correlation result, since the highest correlation value has been given for this specific offset value.

In addition to improving the synchronization results, this quantization also has an effect on the bandwidth for transmitting the fingerprint. While previously at least 8 bits had to be introduced for the fingerprint for providing a sufficientlyaccurate value, here, a single bit is sufficient. Since the fingerprint and its 1-bit counterpart are already determined in the transmitter, a more accurate calculation of the difference is obtained since the actual fingerprint is present with maximumresolution and thus even minimum changes between the fingerprints can be considered both in the transmitter and in the receiver. Further, it has been found out that most subsequent fingerprints only differ minimally. This difference, however, will beeliminated by quantization prior to difference formation.

Depending on the implementation and when block-by-block accuracy is sufficient, the 1-bit quantization can be used as the specific fingerprint post-processing even independent of whether an audio signal with additional information is present ornot, since the 1-bit quantization based on difference coding is already a robust and still accurate fingerprint method in itself, which can also be used for purposes other than synchronization, e.g. for the purpose of identification or classification.

As has been illustrated based on FIG. 11a, a calculation of the multichannel additional data is performed with the help of the multichannel audio data. The calculated multichannel additional information is subsequently extended by newly addedsynchronization information in the form of the calculated fingerprints by appropriate embedding into the bit stream.

The wordmark fingerprint hybrid solution allows a synchronizer to detect a time offset of downmix signal and additional data and to realize a time-correct adaptation, i.e. delay compensation between the audio signal and the multichannelextension data in the order of magnitude of +/- one sample value. Therewith, the multichannel association is reconstructed almost completely in the receiver, i.e. apart from a hardly noticeable time difference of several samples, which does not have anoticeable effect on the quality of the reconstructed multichannel audio signal.

The inventive fingerprint as calculated, for example, by a fingerprint calculator 104 or the fingerprint calculator 304 with or without a block division information, can be used for characterizing a test audio signal. Therefore, means 104 or304, respectively, is provided in order to obtain a sequence of test audio fingerprints from the test audio signal.

Further, a correlator, such as the correlator 312 is provided in order to correlate the sequence of binary values with different reference fingerprints provided in a reference database, wherein the reference database comprises, for everyreference fingerprint, information about an audio signal associated with the reference fingerprint.

Based on these different correlations, which means based on the correlation of the test audio signal fingerprint in sequence of a 1-bit frequency and the different reference fingerprints of the reference database, information about the testaudio signal can be reached.

Information about the test audio signal is, for example, an identification of the audio signal, for example the name of the piece and possibly the author, on what CD or which sound carrier this piece can be found, and where it can be ordered. An alternative characterization of an audio signal is to identify a test audio signal for example as audio signal of a specific stylistic period or a specific style, or to identify the same as being from a certain band. Such a characterization can bemade, for example, by determining not only qualitatively, but also quantitatively how the reference fingerprint relates to the test fingerprint or which distance exists between the two. This matching of fingerprint sequences or calculating thequantitive distance of fingerprint sequences, respectively, can take place, for example, when a correlation has taken place in order to eliminate the time offset of the reference fingerprint and the test fingerprint.

Depending on the circumstances, the inventive method can be implemented in hardware or in software. The implementation can be made on a digital storage medium, in particular a disc, CD or DVD with electronically readable control signals thatcan cooperate with a programmable computer system such that the method is performed. Hence, generally, the invention also consists of a computer program product having a program code stored on a machine-readable carrier for performing the inventivemethod when the computer program product runs on a computer. In other words, the invention can be realized as a computer program having a program code for performing the method when the computer program runs on a computer.

While this invention has been described in terms of several advantageous embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternativeways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit andscope of the present invention.

* * * * *
 
 
  Recently Added Patents
Multi-carrier operation for wireless systems
Flat panel crystal display employing simultaneous charging of main and subsidiary pixel electrodes
Projection-type video-image display apparatus
Adaptive period network session reservation
Lens system
Feature management of a communication device
Ionization device, mass spectrometer including the ionization device, and image generation system including the ionization device
  Randomly Featured Patents
Electrosurgical method and apparatus for establishing an electrical discharge in an inert gas flow
Aromatic anhydrides
Soluble coffee having intensified flavor and color and method of making same from a coffee extract
Steel alloy for zinc and aluminum die casting
Use of substituted N,N-disubstituted non-fused heterocyclo amino compounds for inhibiting cholesteryl ester transfer protein activity
Elevator position compensation system
Document type identifying method and document type identifying apparatus
Non-antigenic, non-thrombogenic infection-resistant grafts from umbilical cord vessels and process for preparing and using same
Dust collector for mobile robotic vacuum cleaner
Ferroelectric random-access memory