The Moby Project is a collection of public-domain lexical resources. It was created by Grady Ward. The resources were dedicated to the public domain, and are now mirrored at Project Gutenberg. As of 2007, it contains the largest free phonetic database, with 177,267 words and corresponding pronunciations.
Hyphenator
The Moby Hyphenator II contains the hyphenations of 187,175 words and phrases (including 9,752 entries where no hyphenations are given, such as through and avoir). The character encoding appears to be MacRoman, and hyphenation is indicated by a bullet (character value 165 decimal, or A5 hexadecimal). Some entries, however, have a combination of actual hyphens and character 165, such as "barâ¢ber-surâ¢geon".
There is little to no documentation of the hyphenation choices made; the following examples might give some flavour of the style of hyphenation used: atâ¢mosâ¢phere; atâ¢tendâ¢ant; caâ¢pacâ¢iâ¢ty; unâ¢colâ¢orâ¢aâ¢ble.
Language
Moby Language II contains wordlists of five languages: French, German, Italian, Japanese, and Spanish:
However, some of the lists are contaminated, for example the Japanese list contains English words such as abnormal and non-words such as abcdefgh and m,./. There are also unusual peculiarities in the sorting of these lists, as the French list contains a straight alphabetical listing, while the German list contains the alphabetical listing of traditionally capitalized words and then the alphabetical listing of traditionally lower-cased words. The list of Italian words, however, contains no capitalized words whatsoever.
The foreign languages list does not use accented characters, so "e^tre" is how a user would look up the French word être ("To Be").
Part-of-Speech
Moby Part-of-Speech contains 233,356 words fully described by part(s) of speech, listed in priority order. The format of the file is word\parts-of-speech, with the following parts of speech being identified:
Pronunciator
The Moby Pronunciator II contains 177,267 words with corresponding pronunciations. The Project Gutenberg distribution also contains a copy of the cmudict v0.3. The file follows the format word[/part-of-speech] pronunciation. The part-of-speech field is used to disambiguate 770 of the words which have differing pronunciations depending on their part-of-speech. For example for the words spelled close, the verb has the pronunciation , whereas the adjective is . The parts-of-speech have been assigned the following codes:
Following this is the pronunciation. Several special symbols are present:
The rest of the symbols are used to represent IPA characters, according to the following table:
Shakespeare
Moby Shakespeare contains the complete unabridged works of Shakespeare. This specific resource is not available from Project Gutenberg.
Thesaurus
The Moby Thesaurus II contains 30,260 root words, with 2,520,264 synonyms and related terms â" an average of 83.3 per root word. Each line consists of a list of comma-separated values, with the first term being the root word, and all following words being related terms.
Grady Ward placed this thesaurus in the public domain in 1996. It is also available as a Debian package.
Words
Moby Words II is the largest wordlist in the world. The distribution consists of the following 16 files:
References
External links
- Moby Project homepage
- Project Gutenberg downloads
- Searching for Rhymes with Perl; corresponding code
- Conversion to relational database (Dead link)