Phonem: Flexible class to convert text into a phoneme.

Version Information
Version: 0.3.2, updated: 2010-07-04
Developed with Delphi5 and Perl
Download source code (Delphi, Perl) and phonem.exe (command line tool).
Content
For Delphi: TPhonem: A class derived from a TStringList that converts text into a phoneme.
For Perl: I just included a small Perl Module phonem.pm with two subs, one to phonemize and one to initialize the phonemization table. See phonem.pm for comments.
A replacement table for German is implemented as default. For other languages, you need to devise other tables.
phonem.exe is a command line tool to transform text read from standard in to phonemes and write it to standard out.
Licence
uPhonem.pas, phonem.exe and phonem.pm is freeware that comes without any warranty or support. Thus you use it at your own risk. You may even use it in commercial products as long as you mention me as original author somewhere. If you make modifications to the source code, you must maintain the original copyright notice and publish it under the same licence. For the exact conditions, see LICENCE.txt that is contained in the download package.
I am looking forward to feedback via eMail.

TPhonem (Delphi) / phonemize() (Perl)

TPhonem is a class that allows you to do phonetic search with Delphi. It converts a given input text to a phoneme. The algorithm is quite simple, the parameters form the hard part. Basically, I just do conversions from character sequences to other sequences. Michael Abmayer and I have once implemented the algorithm that was described in principle in the German computer magazine c’t (around 1990 or so) in VBA.

Michael did some modifications on the replacement table to improve performance for German family names. This table is the default replacement table that is being used. You will not be happy if you want to phonemize other languages than German, unless you have a suitable translation table.

So I ask any of you who has replacement tables that fit certain languages or needs to send them to me so that I can offer them for download here.

Examples: The names Mayer, Meier, Maier and Mayr are all varaiants which are phonetically identical. TPhonem converts all of them (with the default replacement table) to MAYR. Another example is Haydn vs. Heyden: both names are mapped to HAYDM.

Tweaking the replacement table is not forbidden but encouraged. It depends very much on the words that you want to compare which table suits best. Loading alternate replacement tables works by simply using the LoadFromFile method in the Delphi class or loadPhonemTableFromFile() in Perl.

The Perl implementation just handles scalars (strings), hash references and array references. In array references, all elements of the array are replaced with their phonemes. In hash references, all values (not the keys) are replaced with their phonemes.

Technical details

About the algorithm:

  1. remove leading and trailing whitespace
  2. transform input string to upper case
  3. eliminate double, triple, … characters (ss, tt, and friends)
  4. apply each item of the translation table in the given order to the whole string as long as there are matches

Although I have optimized the implementation a little bit, I advise you to not use TPhonem to transcribe large texts at once.

About the replacement table: The table is nothing but a list of match=replace pairs. Any ocurrence of match in the input string is replaced with replace in the order of the table. This is were you can influence the results of TPhonem.

Changelog

0.3.2 — [ 2010-04-07 ]
Added pre-compiled phonem.exe to the package (no other changes, thus no new version number)
0.3.2 — [ 2007-12-20 ]
Included cleanup routine after LoadFromFile and LoadFromStream
Yet another change in the replacement table
Shipped the default list with the package as external file
Added Perl implementation
0.3.1 — [ 2007-12-13 ]
Submission by Dieter Dasberg: Flaw fix and new rules
minor change in SetInput
0.3 — [ 2004-01-18 ]
Submissions by Peter Tiemann: Bug fix and new rules
0.2.3 — [ 2003-10-22 ]
Introduced some promising but experimental replacement rules
Some bug fixes (empty match, setlength)
0.2.1 — [ 2003-10-21 ]
Tweaked the replacement list a little bit
Better removal of duplicates (not only A-Z) (Thomas Bornhaupt)
0.2 — [ 2003-10-21 ]
Performance improvement in PhonemReplace
0.1 — [ 2003-10-20 ]
First implementation as TStringList descendant based on the VBA implementation that Michael and I created for the Niederösterreichischen Seniorenbund

Thanks

I want to thank Thomas Bornhaupt for he provided a much faster routine for eliminating duplicate characters and Marian Aldenhövel for his precious comments.
Peter Tiemann has submitted some rules and a bug fix.
Dieter Dasberg has also submitted new rules and an improvement suggestion for D7.