rxe: literate and composable regular expressions

Marton Trencseni - Sat 02 March 2019 - Python

Introduction

rxe is a thin wrapper around Python's re module (see official re docs). The various rxe functions are wrappers around corresponding re patterns. For example, rxe.digit().one_or_more('a').whitespace() corresponds to \da+\s. Because rxe uses parentheses but wants to avoid unnamed groups, the internal (equivalent) representation is actually \d(?:a)+\s. This pattern can always be retrieved with get_pattern().

Github repo: https://github.com/mtrencseni/rxe

Motivation

Suppose you want to parse geo coordinates from a string, like (<latitude>,<longitude>), where each is a decimal. The raw regular expression would look like \(\d+\.\d\+,\d+\.\d\+). This is hard to read and maintain for the next guy, and diffs will be hard to understand and verify.

With rx, you can write:

decimal = (rxe
  .one_or_more(rxe.digit())
  .literal('.')
  .one_or_more(rxe.digit())
)
coord = (rxe
  .literal('(')
  .exactly(1, decimal)
  .literal(',')
  .exactly(1, decimal)
  .literal(')')
)

Note how rxe allows the decimal regex to be re-used in the coord pattern! Although it's more code, it's much more readable.

Suppose you want to support arbitrary number of whitespace. The diff for this change will be:

coord = (rxe
  .literal('(')
  .zero_or_more(rxe.whitespace()) # <--- line added
  .exactly(1, decimal)
  .zero_or_more(rxe.whitespace()) # <--- line added
  .literal(',')
  .zero_or_more(rxe.whitespace()) # <--- line added
  .exactly(1, decimal)
  .zero_or_more(rxe.whitespace()) # <--- line added
  .literal(')')
)

Okay, but we also want to extract the latitude and longitude, not just match on it. Let's extract them, but in a readable way:

coord = (rxe
  .literal('(')
  .zero_or_more(rxe.whitespace())
  .exactly(1, rxe.named('lat', decimal)) # <--- line changed
  .zero_or_more(rxe.whitespace())
  .literal(',')
  .zero_or_more(rxe.whitespace())
  .exactly(1, rxe.named('lon', decimal)) # <--- line changed
  .zero_or_more(rxe.whitespace())
  .literal(')')
)

m = coord.match('(23.34, 11.0)')
print(m.group('lat'))
print(m.group('lon'))

One more example, parsing email addresses. The regex is [\w.%+-]+@[\w.-]+\.[a-zA-Z]{2,6}. The equivalent rxe code:

username = rxe.one_or_more(rxe.set([rxe.alphanumeric(), '.', '%', '+', '-']))
domain = rxe.one_or_more(rxe.set([rxe.alphanumeric(), '.', '-']))
tld = rxe.at_least_at_most(2, 6, rxe.set([rxe.range('a', 'z'), rxe.range('A', 'Z')]))
email = (rxe
    .exactly(username)
    .literal('@')
    .exactly(domain)
    .literal('.')
    .exactly(tld)
)

Install

Use pip:

pip install git+git://github.com/mtrencseni/rxe

Then:

$ python
>>> from rxe import *
>>> r = rxe.digit().at_least(1, 'p').at_least(2, 'q')
>>> assert(r.match('1ppppqqqqq') is not None)