Sound Recognition - Thesis: 2008

Monday, March 24, 2008

Sound Kills

This weekend I threw myself in the java sound world to figure out how to get the Pitch (frequency) of a wave.
I've read and read and read and after that I read even some more! There is a lot of information on the web about this. I found out that the best way of getting the frequency is by applying the Fast Fourier Transformation (FFT). This I knew for a while, but this time I got a deeper look at what FFT is and oh my god, I thought I was good at maths, but this took me soooo long to figure it out. I mean, the way some pages explained it made it sound so easy at the beginning and then started to complicate things step at the time to the point where I felt that I was reading gibberish.
One of the best sites I found that explain FFT is http://www.relisoft.com/Science/Physics/sound.html. And it even gave me the code and a running application in Visual C++ that implements it. As I said, it takes patience, pen and paper to understand but in the end I think I did get it.
Really FFT is a faster way to compute the Discrete Fourier Transform (DFT).
Figure I shows the DFT equation

.. Figure I

To get a FFT code for java was easy, just google FFT and Java and you will get many versions, I chose one that also implements a Complex class which makes it more understandable.

Once I got this, I needed to figure out how to join the audio reading part with the FFT.
Most pages talk about sampling the sound and then feeding the values to the FFT, but none of them says how to do the sampling.
After many pages and reading almost everywhere I got to the conclusion that sampling means just the read value from the audio file. In other words, sampling is when I do:
"nBytesRead = audioInputStream.read(abData, 0, abData.length);"
If you want to read more about how to do this, and view a sample code check this page out http://forum.java.sun.com/thread.jspa?threadID=504397&start=0&tstart=0
I believe that is the one that helped me the most for this part and now I am getting the frequencies for each of the samples I obtain.

My current problem is that I don't know how to calculate the size of the sampling array that will be fed to the FFT method. At the moment I am using 4096.
Well, as much as I understand now, the array should be of the power of 2, that is 2, 4, 8, 16...1024, 2048, 4096, etc.
The thing is, how do I relate the size of the array to the time in milliseconds that we want (that is 10 ms) for every sampling? And does the length depend on the file format? (*.raw, *.wav, *.mp3, etc) ...the later is probably not true, but I can't prove it, nor can I confirm it.
I will try to find that out, but if you have any ideas, please tell me.

I was also looking at the audio format we should use, and I think it should be "MP3". I was hoping WAV would work, but for some reason when I record my voice using AUDACITY and export it as WAV file and then play it back with my java program I get a "mark/reset not supported".
At first I thought it was a problem with my program but seems that it is a bug in the Java sound API. The reported bug ID is: 6408764.
Don't get me wrong, I have some WAV files that do work, is just that I can not generate a WAV file with audacity that works with the Java API and is because of this that I had to rule out the WAV format. If you know of another audio recording application that we could use that can generate WAV files without generating the mark/reset error, then we can consider it again, but for now mp3 is a good option, can be generated with audacity and does work with the the Java API.

For now I think I can say I found the frequency for a sampling array of 4096. And it is not that slow!

NEXT STEP

Playback changing the frequency!
Making the application run under Ubuntu.
Write about my findings for the thesis.

PENDING QUESTIONS

And as reminder for myself, these are questions pending for this part:

The frequency values that I am getting, are they correct or coherent? I need a comparison reference to verify it.
How to relate milliseconds with the amount of bytes that should be sampled read)?
When reading a wav, mp3, wma, or any other audio format with the Java Audio API (JMF), will it give me the same values? The questions is brought up because for each different format, exist a different level of compression (different sizes), so, when the API reads it, does it decompress it and return a similar or almost similar value as the other formats would?

Sunday, March 16, 2008

Should have learned bash....

So I looked at the corpus and, since we wanted to shorten it a bit more, but had no real good selection mechanism I created a small program to do the selection for us :) I hacked it together in a short Java app. However, it reminds me that one day I should really learn to do this kind of simple text parsing/filtering with a nice scripting language. Perl, or just bash would be nice to know. At least the basics... oh well another time, I mean the java program works! Committed both the filtered corpus and the program to the subversion repository... check it out! Will work a bit on the thesis text now.

Thursday, March 6, 2008

Creating a Corpus

I've been trying to create a corpus for more than a week now.
The reason it is taking so long is because I need to pick words from a pronouncing dictionary which I found at the CMU web site. The version picked was the "cmudict.0.1" because it is the shortest one from them all. It only has around 100,000 words. From all of those I am picking around 2000 words.

This process is not easy, it is long and booooooooring! But also it will not end when I finish, because I am sure that my first output will go pass the 2000 words. So I was thinking that this first output I can give it to you, so you can also use a criteria of your own to cut down the list.

The criteria I am currently using is:
- Short words
- As different sounds as possible
- Mixtures of ups and downs within the same word
- Different endings, that is: ..s, ed, c, p, ans all sort of other short sounds.
- Words that have a way of writing but a different way of pronouncing (this criteria might not be too good... need some debate on this point)
- Common words
- Our names :P
- Words use in games

Today I finally finished it. I am sooo happy. The only problem is that now instead of having 2000 words I ended up with 3500. But it is a good start, I mean, I began with 100 000 words!!
Now dude, it is your turn to have a look at them
Have fun! :P ;)

Tuesday, March 4, 2008

Mastering the MASTER_GAIN control

MASTER GAIN ... Part II
Thanks for opening my eyes to the fact that MASTER_GAIN works on decibels (dB).
After reading your post I looked more into depth on this control and found the Java Class Specification for the class FloatControl.Type (http://java.sun.com/j2se/1.4.2/docs/api/javax/sound/sampled/FloatControl.Type.html) where it describes what each of the controls mean.
Besides the fact that the GAIN is in dB, it also states that it follows a logarithmic curve.
It gives a conversion function, from a linear value [0, 2] to the GAIN range [-80, 6].
This two ranges are rounded to natural numbers. The real GAIN range is [-80, 6.0206] and the conversion function is:

linearValue = pow(10.0, gainDB/20.0) ..(I)

and if I inverted I get:

gainDB = 20*Log(linearValue) ..(II)

So, if we look at the Logarithmic curve (green line) base 10 we can see that the X values from [0, 2] should be use to have a positive sound gain. We can also see that if we use the values from [0, 1>, the curve goes down really fast while the values from [1, 2] has a slower, smoother and almost linear growth. So we can expect big difference between the dB values when the X value is under 1 but smaller distances when the X value is above 1.

So, to make this more easy for us I created a conversion method, where as input I give the percentage value, that is, from [0, 100] and the value should an integer, then I also set the min and max values of the Gain control (see III) and with some magic (scale equivalence and applying the conversion function I) I get a value in dB.

The method is something like this:
private static float getVolumeIndB(int linearValue, float minGainValue, float maxGainValue) .. (III)

So if I set 100 as the linear value I will get the value 6.0206 which will be the loudest sound and if I set 0 I will get no sound at all and the dB value from the method III will be -80.

ESSAY
I've read the document you sent me, mostly it is ok although seems that some parts should be not in the introduction but somewhere else; I will reserve my comments for now, but I would like to read your new version and if possible also send me the editable files so if I need to change something I can do so.
Thanks

ps. I will make sure by tomorrow we get the SVN working from the Internet and not only in the internal LAN. Sorry for that delay.

CORPUS
I will now continue with the corpus, will send it or uploaded to the SVN as soon as I am done.

;)

Sunday, March 2, 2008

Volume Manipulation... maybe

Finally it worked.
I managed to manipulate the volume of an audio file. I tried with MP3 and WAV and it worked.
The thing that I don't understand yet is that I am not controlling the control called "SOUND".
Let me say it again:
I can manipulate some controls, when I print out which controls I have permission to do so I get:
- Master Gain
- Mute
- Balance
- Pan

If you see from the list, the Volume control is not there.
So I tried each one of them and I found out that PAN makes the sound come from a specific side, that is, I can specify if the full sound will come from the left speaker, the right speaker, 50 50 or for example 20% of the sound from the left and 80% from the right. While MASTER_GAIN helps me play with the volume, but it is weird. If I put it on a scale from 0 (no sound) to 100 (full sound), when I reach around 70, I am not able to hear anything. And is because of this behavior that makes me not confident on this control to be the one handling the volume.
Do you have any idea what MASTER_GAIN means?

I will finish the corpus this week. I am half done.

We will talk later about that when I send it to you.
And also I will send you some pointers about how to write the thesis. That I got from uni.

Friday, February 29, 2008

Mumbai hotel day...

Soooo I finally managed to cram some writing out of my mind and onto... well not paper but at least it's digital representation. So far it's not very good, but hey that can wait as long as I get something down right? Basically I've started on the introduction and background things. I tend to find things I need to cite from someone so it's slow, but I'm filling up my .bib file :) I'm not 100% sure on how to write a thesis, but I hope that once we get through the background stuff we wont need to do much more citing of things since we are mainly developing something new.

On the issue of the phoneme corpus thing. As long as we have a marked up speech corpus (doesn't really matter what it contains as long as it's a good mix of sounds) we should be able to use that, right? If we have the phoneme mark up for such a corpus we can record our own, by just pronouncing those phoneme sequences. 2000 words might sound a lot but it's really just about 4 pages of text.

Anyways, now I'm going to head to the gym for a while so that when I get back to Sweden I'll be buff like a the all conquering Schwartzenegger (in his youth)!!! Muwahaha! :) Oh, I'll post up some pics for here too.

Pictures from India

A young monkey living in the Sanji-Ghandi National Park (not 100% sure on the name... of the park not the monkey... I named the monkey Ronald Regan... maybe he can grow up and one day create some kind of "Monkeygate" scandal? One can always dream! Come to think of it, it could be a girl monkey. Well to me Ronald in gender neutral so there you go :) )

<-- Picture is taken on Elephant Island. I am standing infront of Shiva, God of Destruction. Most (all?) of the images of Shiva was destroyed/mutilated by the Portuguese when they came to india. That, if anything is the definition of irony!

There is a Lion in this picture. I also like to think that the image showcases my photographic craftsmanship... hehe

Gates of India and the Taj seen from a boat. The Gates of India is apparently a symbol for the Indian independence. It was built for some English king and the last Englisd soldiers marched out of India through it!

Wednesday, February 27, 2008

Starting 0.1

Hej
After some "vacations" (not really, just poorly time management) I am back and doing some stuff with the thesis.

SEARCH FOR A DICTIONARY
I started to look at some dictionaries at CMU (http://www.speech.cs.cmu.edu/cgi-bin/cmudict).
I took the one called cmudict.0.1 which is the "smallest one", it only has around 100,000 words. I am cutting them down and I hope my first draft will not go above 2000 words, but I think I should cut it down to 1000 in the end.

Probably I am just going nuts and should just read up a bit more to specified a better corpus, but the reason I am doing this now is to have a big group of words we can choose from and also to have as many mixtures of sounds as possible.

Recording 2000 sounds might be a "little" bit unrealistic, but we should talk about that.
I have a dictionary of about 25 (maybe more) words which are numbers. (that one we should do!!).

Question: Will we need a corpus only on the phonemes? I guess we will. Do you have an idea of where we could find one for free?

AUDIO FORMAT
This is another matter I was suppose to check.
I was looking into RAW format and I could find no good links nor tutorials for handling this.
So what I am doing now is creating a prototype with JMF (Java Media Framework) and play with and try to get the pitch and control the volume.

I found a cool website that might help me with this. It is called jsResources.org (http://www.jsresources.org/faq_audio.html) and from there I am trying to create something we could use.
Initially I want to control the volume and finally I will do the pitch which uses seems that I need to implement "fast fourier transform".

This Blog
I started this blog for 2 reasons. One is to force myself to work, that is, I will write comments, questions and findings here. The other reason is for you to know what I am doing, where I am getting stuck and what decisions I am taking.

Sound Recognition - Thesis