Wikitext parser Sweble gets first public release
In a blog posting, Professor Dirk Riehle, who hired PhD student Hannes Dohrn in 2009 to work on what later became Sweble, explained that Wikitext, the markup language used to create the content in Wikipedia and other Wikimedia sites, has had a major problem in that it was poorly defined.
Wikitext had no formal grammar, defined processing rules or output. Ironically, it was not an open standard either but in fact was defined by 5,000 lines of PHP code. There were over thirty attempts to create parsers in the past that failed and attempts to hide the complexity behind visual editors also failed, because in order to work well, they also need that knowledge of the grammar. This also led to long term doubts about the editability of Wikipedia.
Sweble solves that problem by being a complete parser for Wikitext, able to understand its table and template, and with that information to generate abstract syntax trees – and soon document object models (DOM) – which other tools can manipulate. Performance wise, Sweble is currently slower than the PHP code, but should be able to provide a basis for future development of Wikitext. Riehle noted that by "untying the content and data from MediaWiki, we are enabling an ecosystem of tools and technology around Wikipedia (and related projects') content so these projects can gain more speed and breadth".
Sweble is written in Java, licensed under Apache 2.0 and developed by the Open Source Research Group at the University of Erlangen. A Crystal Ball demo allows users to use Sweble to parse, pre-process and process existing Wikipedia entries into custom text formats or HTML; the developers hope that people can use that – or the library directly in their own software – to provide bug reports and feedback.