wordaxe : Hyphenation by decomposition of compound words
The wordaxe library (formerly known as deco-cow)
provides Python programs with the ability
to automatically hyphenate words using an algorithm
which is based on decomposition of compound words into base words,
and is named DCWHyphenator in the code.
Currently, only German language is supported for the DCWHyphenator.
Other germanic languages like Nederlands, Dansk, etc.
that make heave use of compound words should profit from the
algorithm as well - as soon as someone provides a word base.
Apart from the DCWHyphenator, the library supports other algorithms
- BaseHyphenator is language-independent and splits words
only after a few special characters (in particular, '-').
- PyHnjHyphenator (deprecated) is a pure Python implementation of the
hyphenation algorithm used in TeX and OpenOffice.
- wordaxe.plugins.PyHyphenHyphenator uses the
library, which works a lot better than pyhnj on my computer.
The library can be used as a simple add-on to the
ReportLab PDF library,
adding support for automatic hyphenation in the Paragraph
The documentation is still far from complete.
For more information, look at the text document
install.txt in the distribution and the samples.
- Pure Python: The library itself can work in a pure Python environment,
it does not require any C libraries.
- ReportLab support: You can use the hyphenation library for automatic
hyphenation of Paragraphs in the
ReportLab PDF generation toolkit.
- Extendable: The word base is a simple ASCII text file (DEhyph.py),
containing base words, prefixes and suffixes.
If you encounter words the algorithm does not know,
you can just add the base words, prefixes and suffixes to the file as needed.
- HNJ support: Apart from the DCWHyphenator algorithm, the library
supports pyHNJ as well - using the libhnj C library or a pure Python routine.
- Generic: The library provides a generic interface for different
hyphenation algorithms. I.E. you could buy a commercial hyphenation library
and use wordaxe as a wrapper for it.
For possible hyphenation points, the interface (and the wordaxe implementation)
defines a quality (good: between base words, not so good: inside a base word)
and how to exactly substitute characters in the left and the right side:
for example the English word "eighteen" should be hyphenated as "eight-teen",
and the German word "Schiffahrt" as "Schiff-fahrt".
Here are two PDF files conatining hyphenated german text,
created with the deco-cow library and the ReportLab toolkit:
- Python 2.4: The library was developed with Python 2.6 and should work with Python 2.4 or newer4.
It could be back-ported to 2.2 or 2.3, but that would certainly require some fiddling.
- ReportLab 2.3: If you want to use the hyphenation inside the
ReportLab toolkit (http://www.reportlab.org),
then you need ReportLab 2.3.
For RL versions 2.1 or 2.2, you should use wordaxe 0.3.0.
Older versions of the library work with ReportLab 2.0 and 1.19 (not recommended).
- PyHnj: If you want to use the libhnj C library (not recommended for German language),
then you also need the pyHnj library from Danny Yoo.
- PyHyphen If you want to use the PyHyphen-based hyphenator,
you have to install that library of course.
The wordaxe library itself is released under a dual-license: Apache 2.0 license or Free BSD.
If you want to use pyHnj or ReportLab, see the corresponding licenses.
The hyphenation dictionary files were taken from the OpenOffice distribution;
they are licensed under the GNU LGPL license.
You can download the source from the files section.
As usual on Sourceforge.net, you can also browse the subversion repository.
See the project page