rxe: literate and composable regular expressions
Marton Trencseni - Sat 02 March 2019 - Python
Introduction
rxe
is a thin wrapper around Python's re
module (see official re docs). The various rxe
functions are wrappers around corresponding re
patterns. For example, rxe.digit().one_or_more('a').whitespace()
corresponds to \da+\s
. Because rxe
uses parentheses but wants to avoid unnamed groups, the internal (equivalent) representation is actually \d(?:a)+\s
. This pattern can always be retrieved with get_pattern()
.
Github repo: https://github.com/mtrencseni/rxe
Motivation
Suppose you want to parse geo coordinates from a string, like (<latitude>,<longitude>)
, where each is a decimal. The raw regular expression would look like \(\d+\.\d\+,\d+\.\d\+)
. This is hard to read and maintain for the next guy, and diffs will be hard to understand and verify.
With rx, you can write:
decimal = (rxe
.one_or_more(rxe.digit())
.literal('.')
.one_or_more(rxe.digit())
)
coord = (rxe
.literal('(')
.exactly(1, decimal)
.literal(',')
.exactly(1, decimal)
.literal(')')
)
Note how rxe allows the decimal
regex to be re-used in the coord
pattern! Although it's more code, it's much more readable.
Suppose you want to support arbitrary number of whitespace. The diff for this change will be:
coord = (rxe
.literal('(')
.zero_or_more(rxe.whitespace()) # <--- line added
.exactly(1, decimal)
.zero_or_more(rxe.whitespace()) # <--- line added
.literal(',')
.zero_or_more(rxe.whitespace()) # <--- line added
.exactly(1, decimal)
.zero_or_more(rxe.whitespace()) # <--- line added
.literal(')')
)
Okay, but we also want to extract the latitude and longitude, not just match on it. Let's extract them, but in a readable way:
coord = (rxe
.literal('(')
.zero_or_more(rxe.whitespace())
.exactly(1, rxe.named('lat', decimal)) # <--- line changed
.zero_or_more(rxe.whitespace())
.literal(',')
.zero_or_more(rxe.whitespace())
.exactly(1, rxe.named('lon', decimal)) # <--- line changed
.zero_or_more(rxe.whitespace())
.literal(')')
)
m = coord.match('(23.34, 11.0)')
print(m.group('lat'))
print(m.group('lon'))
One more example, parsing email addresses. The regex is [\w.%+-]+@[\w.-]+\.[a-zA-Z]{2,6}
. The equivalent rxe
code:
username = rxe.one_or_more(rxe.set([rxe.alphanumeric(), '.', '%', '+', '-']))
domain = rxe.one_or_more(rxe.set([rxe.alphanumeric(), '.', '-']))
tld = rxe.at_least_at_most(2, 6, rxe.set([rxe.range('a', 'z'), rxe.range('A', 'Z')]))
email = (rxe
.exactly(username)
.literal('@')
.exactly(domain)
.literal('.')
.exactly(tld)
)
Install
Use pip
:
pip install git+git://github.com/mtrencseni/rxe
Then:
$ python
>>> from rxe import *
>>> r = rxe.digit().at_least(1, 'p').at_least(2, 'q')
>>> assert(r.match('1ppppqqqqq') is not None)