Link Search Menu Expand Document

Regular Expression Cheat Sheet

Regular Expression is useful when working with strings, but I often find it difficult to get started when I vaguely remember the regex grammar and the code syntax. This post is a summary of regex handy code for python I found online (and mainly from here).

1. Code Syntax

import re
text = "At 12, I bought 13 bananas and 3 oranges. Amazing!"
p = re.compile(r"\d+")

There are 4 methods:

  • match(): determine if the beginning of the string matches the RE
  • search(): return the first match
  • findall(): find all substrings that match the RE and return a list
  • finditer(): find all substrings that match the RE and return an iter

  • p.match(text) returns None becuase the begining is not a number
  • p.serch(text).group() returns 12 because it is the first set of numbers
  • p.findall(text) returns ['12', '12', '3'].

Or you don’t have to create a pattern object.

import re
re.search(r'\d+', text)
<re.Match object; span=(3, 5), match='12'>

2. RE Grammar

If you sort of know the grammar, check your RE here and skip the rest.

This RegEx Cookbook may be useful.

Otherwise, read below and here if you need more clarification.

Python’s Raw String to Handle Backslashes

Use Pyhton’s ring notation when using regex. Simply put an ‘r’ in front of an expression. This is to abtract away double interpretation of backslashes in regex and Python. raw st

Symbols

Anchor

  • ^Start start of string
  • end.$ end of string
  • \b one side is a word and the other is a white space. \B is a negation. \Babc\B matches abc in the middle of a word.

Quantifiers

  • * zero or more
  • + one or more
  • ? zero or one
  • {2} exactly 2
  • {2,} 2 or more
  • {2,5} 2 to up to 5

Character Classes

  • \d digit
  • \w alphanumeric and underscore
  • s white space, tab, and line break
  • . any characters

Capitalized classes (\D, \W, and \S) are their negations.

Capturing

  • a(bc) matches an a that followed by a sequence bc
  • a(?:bc) like previous, but the capturing group is disabled using ?:
  • a(?<foo>bc) the group is named <foo>. The result can be accessed like a dictionary.

Brackets

  • [a-z] matches any letter from a to z
  • [^a-zA-Z] a string that has not a letter from a to z or from A to Z. In this case the ^ is used as negation of the expression

Lookahead and Lookbehind

  • d(?=r) matches a d that is followed by an r. r is not in the match.
  • (?<=r)d matches a d that is preceded by an r. r is not in the match.
  • Use ! for their negations. d(?!r) and (?<!r)d.

Back reference [to be updated]

Use Case Examples [to be updated]