Thursday, March 6, 2008

Creating a Corpus

I've been trying to create a corpus for more than a week now.
The reason it is taking so long is because I need to pick words from a pronouncing dictionary which I found at the CMU web site. The version picked was the "cmudict.0.1" because it is the shortest one from them all. It only has around 100,000 words. From all of those I am picking around 2000 words.

This process is not easy, it is long and booooooooring! But also it will not end when I finish, because I am sure that my first output will go pass the 2000 words. So I was thinking that this first output I can give it to you, so you can also use a criteria of your own to cut down the list.

The criteria I am currently using is:
- Short words
- As different sounds as possible
- Mixtures of ups and downs within the same word
- Different endings, that is: ..s, ed, c, p, ans all sort of other short sounds.
- Words that have a way of writing but a different way of pronouncing (this criteria might not be too good... need some debate on this point)
- Common words
- Our names :P
- Words use in games

Today I finally finished it. I am sooo happy. The only problem is that now instead of having 2000 words I ended up with 3500. But it is a good start, I mean, I began with 100 000 words!!
Now dude, it is your turn to have a look at them
Have fun! :P ;)

No comments: