Python Text Processing Modules

Updated: 11/06/2021 by Computer Hope
python command

This page describes the standard text modules in Python 3, and how to use them.

Overview of text processing

Text processing is one of a software developer's most common tasks. When a user provides input, it needs to be parsed, translated, broken into its component parts, sanitized, and manipulated in countless ways. Python has a broad, high-level toolkit in its standard library helps you handle all of these tasks and more.

The modules described below provide a wide range of tools for working with strings and text in general.

Tip

For a tutorial about how to use text in Python, see How to extract specific portions of a text file using Python.

string: common string operations

The source code for the string module is located in the file string.py, and it contains the following tools.

string constants

The constants defined in the string module are:

string.ascii_letters
The concatenation of the ascii_lowercase and ascii_uppercase constants described below. This value is not locale-dependent.
string.ascii_lowercase
The lowercase letters 'abcdefghijklmnopqrstuvwxyz'. This value is not locale-dependent and will not change.
string.ascii_uppercase
The uppercase letters 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'. This value is not locale-dependent and will not change.
string.digits
The string '0123456789'.
string.hexdigits
The string '0123456789abcdefABCDEF'.
string.octdigits
The string '01234567'.
string.punctuation
String of ASCII characters that are considered punctuation characters in the C locale.
string.printable
String of ASCII characters that are considered printable. This is a combination of digits, ascii_letters, punctuation, and whitespace.
string.whitespace
A string containing all ASCII characters that are considered whitespace. This includes the characters space, tab, linefeed, return, formfeed, and vertical tab.

String formatting

The built-in string class provides the ability to do complex variable substitutions and value formatting via the format() method described in PEP 3101. The Formatter class in the string module allows you to create and customize a string formatting behaviors using the same implementation as the built-in format() method.

class string.Formatter

The Formatter class has the following public methods:

format(format_string,  *args,  **kwargs)
format() is the primary API method. It takes a format string and an arbitrary set of positional and keyword arguments. format() is only a wrapper that calls vformat().
vformat(format_string, args, kwargs)
This function does the actual work of formatting. It is exposed as a separate function for cases where you want to pass in a predefined dictionary of arguments, rather than unpacking and repacking the dictionary as individual arguments using the *args and **kwargs syntax. vformat() does the work of breaking up the format string into character data and replacement fields. It calls the various methods described below.

Also, the Formatter defines many methods that are intended to be replaced by subclasses:

parse(format_string)
Loop over the format_string and return an iterable of tuples (literal_text, field_name, format_spec, conversion). This is used by vformat() to break the string into either literal text, or replacement fields.

The values in the tuple conceptually represent a span of literal text followed by a single replacement field. If there is no literal text (which can happen if two replacement fields occur consecutively), then literal_text will be a zero-length string. If there is no replacement field, then the values of field_name, format_spec and conversion will be None.
get_field(field_name, args,  kwargs)
Given field_name as returned by parse() (see above), convert it to an object to be formatted. Returns a tuple (obj, used_key). The default version takes strings of the form defined in PEP 3101, such as “0[name]” or “label.title”. The args and kwargs are as passed in to vformat(). The return value used_key has the same meaning as the key parameter to get_value().
get_value(key, args,  kwargs)
Retrieve given field value. The key argument will be either an integer or a string. If it's an integer, it represents the index of the positional argument in args; if it's a string, then it represents a named argument in kwargs.

The args parameter is set to the list of positional arguments to vformat(), and the kwargs parameter is set to the dictionary of keyword arguments.

For compound field names, these functions are only called for the first component of the field name; Subsequent components are handled through normal attribute and indexing operations.

So for example, the field expression ‘0.name’ would cause get_value() to be called with a key argument of 0. The name attribute will be looked up after get_value() returns by calling the built-in getattr() function.

If the index or keyword refers to an item that does not exist, then an IndexError or KeyError should be raised.
check_unused_args(used_args,  args,  kwargs)
Implement checking for unused arguments if desired. The arguments to this function is the set of all argument keys that were actually referred to in the format string (integers for positional arguments, and strings for named arguments), and a reference to the args and kwargs that was passed to vformat. The set of unused args can be calculated from these parameters. check_unused_args() is assumed to raise an exception if the check fails.
format_field(value,  format_spec)
format_field() calls the global format() built-in. The method is provided so that subclasses can override it.
convert_field(value,  conversion)
Converts the value (returned by get_field()) given a conversion type (as in the tuple returned by the parse() method). The default version understands ‘s’ (str), ‘r’ (repr) and ‘a’ (ascii) conversion types.

Syntax

The str.format() method and the Formatter class share the same syntax for format strings (although in the case of Formatter, subclasses can define their format string syntax).

Format strings contain “replacement fields” surrounded by curly braces {}. Anything that is not contained in braces is considered literal text, which is copied unchanged to the output. If you need to include a brace character in the literal text, it can be escaped by doubling: {{ and }}.

The grammar for a replacement field is as follows:

replacement_field ::=  "{" [field_name] ["!" conversion] [":" format_spec] "}"
field_name        ::=  arg_name ("." attribute_name | "[" element_index "]")*
arg_name          ::=  [identifier | integer]
attribute_name    ::=  identifier
element_index     ::=  integer | index_string
index_string      ::=  <any source character except "]">
conversion        ::=  "r" | "s" | "a"
format_spec       ::=  <described in the next section>

In less formal terms, the replacement field can start with a field_name that specifies the object whose value is to be formatted and inserted into the output instead of the replacement field. The field_name is optionally followed by a conversion field, which is preceded by an exclamation point '!', and a format_spec, which is preceded by a colon ':'. These specify a non-default format for the replacement value.

The field_name itself begins with an arg_name that is either a number or a keyword. If it's a number, it refers to a positional argument, and if it's a keyword, it refers to a named keyword argument. If the numerical arg_names in a format string are 0, 1, 2, ... in sequence, they can all be omitted (not only some) and the numbers 0, 1, 2, ... will be automatically inserted in that order. Because arg_name is not quote-delimited, it is not possible to specify arbitrary dictionary keys (e.g., the strings '10' or ':-]') within a format string. The arg_name can be followed by any number of index or attribute expressions. An expression of the form '.name' selects the named attribute using getattr(), while an expression of the form '[index]' does an index lookup using __getitem__().

Changed in version 3.1: The positional argument specifiers can be omitted, so '{} {}' is equivalent to '{0} {1}'.

Some simple format string examples:

"First, thou shalt count to {0}" # References first positional argument
"Bring me a {}"                  # Implicitly references the first positional argument
"From {} to {}"                  # Same as "From {0} to {1}"
"My quest is {name}"             # References keyword argument 'name'
"Weight in tons {0.weight}"      # 'weight' attribute of first positional arg
"Units destroyed: {players[0]}"  # First element of keyword argument 'players'.

The conversion field causes a type coercion before formatting. Normally, the job of formatting a value is done by the __format__() method of the value itself. However, in some cases it is desirable to force a type to be formatted as a string, overriding its own definition of formatting. By converting the value to a string before calling __format__(), the normal formatting logic is bypassed.

Three conversion flags are currently supported: '!s' which calls str() on the value, '!r' which calls repr() and '!a' which calls ascii().

Some examples:

"Harold's a clever {0!s}"        # Calls str() on the argument first
"Bring out the holy {name!r}"    # Calls repr() on the argument first
"More {!a}"                      # Calls ascii() on the argument first

The format_spec field contains a specification of how the value should be presented, including such details as field width, alignment, padding, decimal precision and so on. Each value type can define its own “formatting mini-language” or interpretation of the format_spec.

Most built-in types support a common formatting mini-language, which is described in the next section.

A format_spec field can also include nested replacement fields within it. These nested replacement fields can contain only a field name; conversion flags and format specifications are not allowed. The replacement fields in the format_spec are substituted before the format_spec string is interpreted. This allows the formatting of a value to be dynamically specified.

Format Specification Mini-Language

“Format specifications” are used within replacement fields contained within a format string to define how individual values are presented (see Syntax). They can also be passed directly to the built-in format() function. Each formattable type may define how the format specification is to be interpreted.

Most built-in types implement the following options for format specifications, although some of the formatting options are only supported by the numeric types.

A general convention is that an empty format string ("") produces the same result as if you had called str() on the value. A non-empty format string often modifies the result.

The general form of a standard format specifier is:

format_spec ::=  [[fill]align][sign][#][0][width][,][.precision][type]
fill        ::=  <any character>
align       ::=  "<" | ">" | "=" | "^"
sign        ::=  "+" | "-" | " "
width       ::=  integer
precision   ::=  integer
type        ::=  "b" | "c" | "d" | "e" | "E" | "f" | "F" | "g" | "G" | "n" | "o" | "s" | "x" | "X" | "%"

If a valid align value is specified, it can be preceded by a fill character can be any character and defaults to a space if omitted. Note that it is not possible to use { and } as fill char while using the str.format() method. However, this limitation doesn’t affect the format() function.

The meaning of the various alignment options is as follows:

Option Meaning
'<'
Forces the field to be left-aligned in the available space (this is the default for most objects).
'>'
Forces the field to be right-aligned in the available space (this is the default for numbers).
'='
Forces the padding to be placed after the sign (if any) but before the digits. This is used for printing fields in the form ‘+000000120’. This alignment option is only valid for numeric types.
'^'
Forces the field to be centered in the available space.

Note that unless a minimum field width is defined, the field width is always the same size as the data to fill it, so that the alignment option has no meaning in this case.

The sign option is only valid for number types, and can be one of the following:

Option Meaning
'+'
indicates that a sign should be used for both positive and negative numbers.
'-'
indicates that a sign should be used only for negative numbers (this is the default behavior).
space indicates that a leading space should be used on positive numbers, and a minus sign on negative numbers.

The '#' option causes the “alternate form” to be used for the conversion. The alternate form is defined differently for different types. This option is only valid for integer, float, complex and Decimal types. For integers, when binary, octal, or hexadecimal output is used, this option adds the prefix respective '0b', '0o', or '0x' to the output value. For floats, complex and Decimal the alternate form causes the result of the conversion to always contain a decimal-point character, even if no digits follow it. Normally, a decimal-point character appears in the result of these conversions only if a digit follows it. Also, for 'g' and 'G' conversions, trailing zeros are not removed from the result.

The ',' option signals the use of a comma for a thousands separator. For a locale aware separator, use the 'n' integer presentation type instead.

Changed in version 3.1: Added the ',' option (see also PEP 378).

width is a decimal integer defining the minimum field width. If not specified, then the field width is determined by the content.

Preceding the width field by a zero ('0') character enables sign-aware zero-padding for numeric types. This is equivalent to a fill character of '0' with an alignment type of '='.

The precision is a decimal number indicating how many digits should be displayed after the decimal point for a floating point value formatted with 'f' and 'F', or before and after the decimal point for a floating point value formatted with 'g' or 'G'. For non-number types the field indicates the maximum field size: in other words, how many characters will be used from the field content. The precision is not allowed for integer values.

Finally, the type determines how the data should be presented.

The available string presentation types are:

Type Meaning
's' String format. This is the default type for strings and may be omitted.
None The same as 's'.

The available integer presentation types are:

Type Meaning
'b' Binary format. Outputs the number in base 2.
'c' Character. Converts the integer to the corresponding unicode character before printing.
'd' Decimal Integer. Outputs the number in base 10.
'o' Octal format. Outputs the number in base 8.
'x' Hex format. Outputs the number in base 16, using lowercase letters for the digits above 9.
'X' Hex format. Outputs the number in base 16, using uppercase letters for the digits above 9.
'n' Number. This is the same as 'd', except that it uses the current locale setting to insert the appropriate number separator characters.
None The same as 'd'.

In addition to the above presentation types, integers can be formatted with the floating point presentation types listed below (except 'n' and None). When doing so, float() is used to convert the integer to a floating point number before formatting.

The available presentation types for floating point and decimal values are:

Type Meaning
'e' Exponent notation. Prints the number in scientific notation using the letter ‘e’ to indicate the exponent. The default precision is 6.
'E' Exponent notation. Same as 'e' except it uses an uppercase ‘E’ as the separator character.
'f' Fixed point. Displays the number as a fixed-point number. The default precision is 6.
'F' Fixed point. Same as 'f', but converts nan to NAN and inf to INF.
'g' General format. For given precision p >= 1, this rounds the number to p significant digits and then formats the result in either fixed-point format or in scientific notation, depending on its magnitude.

The precise rules are as follows: suppose that the result formatted with presentation type 'e' and precision p-1 would have exponent exp. Then if -4 <= exp < p, the number is formatted with presentation type 'f' and precision p-1-exp. Otherwise, the number is formatted with presentation type 'e' and precision p-1. In both cases insignificant trailing zeros are removed, and the decimal point is also removed if there are no remaining digits following it.

Positive and negative infinity, positive and negative zero, and nans, are formatted as inf, -inf, 0, -0 and nan respectively, regardless of the precision.

A precision of 0 is treated as equivalent to a precision of 1. The default precision is 6.
'G' General format. Same as 'g' except switches to 'E' if the number gets too large. The representations of infinity and NaN are uppercased, too.
'n' Number. This is the same as 'g', except that it uses the current locale setting to insert the appropriate number separator characters.
'%' Percentage. Multiplies the number by 100 and displays in fixed ('f') format, followed by a percent sign.
None Similar to 'g', except with at least one digit past the decimal point and a default precision of 12. This is intended to match str(), except you can add the other format modifiers.

Examples

This section contains examples of the new format syntax and comparison with the old %-formatting.

In most of the cases the syntax is similar to the old %-formatting, with the addition of the {} and with : used instead of %. For example, '%03.2f' can be translated to '{:03.2f}'.

The new format syntax also supports new and different options, shown in the follow examples.

Accessing arguments by position:

>>> '{0}, {1}, {2}'.format('a', 'b', 'c')
'a, b, c'
>>> '{}, {}, {}'.format('a', 'b', 'c')  # 3.1+ only
'a, b, c'
>>> '{2}, {1}, {0}'.format('a', 'b', 'c')
'c, b, a'
>>> '{2}, {1}, {0}'.format(*'abc')      # unpacking argument sequence
'c, b, a'
>>> '{0}{1}{0}'.format('abra', 'cad')   # arguments' indices can be repeated
'abracadabra'

Accessing arguments by name:

>>> 'Coordinates: {latitude}, {longitude}'.format(latitude='37.24N', longitude='-115.81W')
'Coordinates: 37.24N, -115.81W'
>>> coord = {'latitude': '37.24N', 'longitude': '-115.81W'}
>>> 'Coordinates: {latitude}, {longitude}'.format(**coord)
'Coordinates: 37.24N, -115.81W'

Accessing arguments’ attributes:

>>> c = 3-5j
>>> ('The complex number {0} is formed from the real part {0.real} '
...  'and the imaginary part {0.imag}.').format(c)
'The complex number (3-5j) is formed from the real part 3.0 and the imaginary part -5.0.'
>>> class Point:
...     def __init__(self, x, y):
...         self.x, self.y = x, y
...     def __str__(self):
...         return 'Point({self.x}, {self.y})'.format(self=self)
...
>>> str(Point(4, 2))
'Point(4, 2)'

Accessing arguments’ items:

>>> coord = (3, 5)
>>> 'X: {0[0]};  Y: {0[1]}'.format(coord)
'X: 3;  Y: 5'

Replacing %s and %r:

>>> "repr() shows quotes: {!r}; str() doesn't: {!s}".format('test1', 'test2')
"repr() shows quotes: 'test1'; str() doesn't: test2"

Aligning the text and specifying a width:

>>> '{:<30}'.format('left aligned')
'left aligned                  '
>>> '{:>30}'.format('right aligned')
'                 right aligned'
>>> '{:^30}'.format('centered')
'           centered           '
>>> '{:*^30}'.format('centered')  # use '*' as a fill char
'***********centered***********'

Replacing %+f, %-f, and % f and specifying a sign:

>>> '{:+f}; {:+f}'.format(3.14, -3.14)  # show it always
'+3.140000; -3.140000'
>>> '{: f}; {: f}'.format(3.14, -3.14)  # show a space for positive numbers
' 3.140000; -3.140000'
>>> '{:-f}; {:-f}'.format(3.14, -3.14)  # show only the minus -- same as '{:f}; {:f}'
'3.140000; -3.140000'

Replacing %x and %o and converting the value to different bases:

>>> # format also supports binary numbers
>>> "int: {0:d};  hex: {0:x};  oct: {0:o};  bin: {0:b}".format(42)
'int: 42;  hex: 2a;  oct: 52;  bin: 101010'
>>> # with 0x, 0o, or 0b as prefix:
>>> "int: {0:d};  hex: {0:#x};  oct: {0:#o};  bin: {0:#b}".format(42)
'int: 42;  hex: 0x2a;  oct: 0o52;  bin: 0b101010'

Using the comma as a thousands separator:

>>> '{:,}'.format(1234567890)
'1,234,567,890'

Expressing a percentage:

>>> points = 19
>>> total = 22
>>> 'Correct answers: {:.2%}'.format(points/total)
'Correct answers: 86.36%'

Using type-specific formatting:

>>> import datetime
>>> d = datetime.datetime(2010, 7, 4, 12, 15, 58)
>>> '{:%Y-%m-%d %H:%M:%S}'.format(d)
'2010-07-04 12:15:58'

Nesting arguments and more complex examples:

>>> for align, text in zip('<^>', ['left', 'center', 'right']):
...     '{0:{fill}{align}16}'.format(text, fill=align, align=align)
...
'left<<<<<<<<<<<<'
'^^^^^center^^^^^'
'>>>>>>>>>>>right'
>>>
>>> octets = [192, 168, 0, 1]
>>> '{:02X}{:02X}{:02X}{:02X}'.format(*octets)
'C0A80001'
>>> int(_, 16)
3232235521
>>>
>>> width = 5
>>> for num in range(5,12): 
...     for base in 'dXob':
...         print('{0:{width}{base}}'.format(num, base=base, width=width), end=' ')
...     print()
...
    5     5     5   101
    6     6     6   110
    7     7     7   111
    8     8    10  1000
    9     9    11  1001
   10     A    12  1010
   11     B    13  1011

Template strings

Templates provide simpler string substitutions as described in PEP 292. Instead of the normal %-based substitutions, Templates support $-based substitutions, using the following rules:

  • $$ is an escape; it is replaced with a single $.
  • $identifier names a substitution placeholder matching a mapping key of "identifier". By default, "identifier" must spell a Python identifier. The first non-identifier character after the $ character terminates this placeholder specification.
  • ${identifier} is equivalent to $identifier. It is required when valid identifier characters follow the placeholder but are not part of the placeholder, such as "${noun}ification".

Any other appearance of $ in the string results in a ValueError being raised.

The string module provides a Template class that implements these rules.

The constructor of Template is:

class string.Template(template)

The constructor takes a single argument that is the template string.

The methods of Template are:

substitute(mapping, **kwds)
Performs the template substitution, returning a new string. The mapping is any dictionary-like object with keys that match the placeholders in the template. Alternatively, you can provide keyword arguments, where the keywords are the placeholders. When both mapping and kwds are given and there are duplicates, the placeholders from kwds take precedence.
safe_substitute(mapping, **kwds)
Like substitute(), except that if placeholders are missing from mapping and kwds, instead of raising a KeyError exception, the original placeholder appears in the resulting string intact. Also, unlike with substitute(), any other appearances of the $ returns $ instead of raising ValueError.

While other exceptions may still occur, this method is called “safe” because substitutions always tries to return a usable string instead of raising an exception. In another sense, safe_substitute() may be anything other than safe since it will silently ignore malformed templates containing dangling delimiters, unmatched braces, or placeholders that are not valid Python identifiers.

Template instances also provide one public data attribute:

template
This is the object passed to the constructor's template argument. In general, don’t change it, but read-only access is not enforced.

Here is an example of how to use a Template:

>>> from string import Template
>>> s = Template('$who likes $what')
>>> s.substitute(who='tim', what='kung pao')
'tim likes kung pao'
>>> d = dict(who='tim')
>>> Template('Give $who $100').substitute(d)
Traceback (most recent call last):
...
ValueError: Invalid placeholder in string: line 1, col 11
>>> Template('$who likes $what').substitute(d)
Traceback (most recent call last):
...
KeyError: 'what'
>>> Template('$who likes $what').safe_substitute(d)
'tim likes $what'

Advanced usage: you can derive subclasses of Template to customize the placeholder syntax, delimiter character, or the entire regular expression used to parse template strings. To do this, you can override these class attributes:

  • delimiter – This is the literal string describing a placeholder introducing delimiter. The default value is $. Note that this should not be a regular expression, as the implementation will call re.escape() on this string as needed.
  • idpattern – This is the regular expression describing the pattern for non-braced placeholders (the braces are added automatically as appropriate). The default value is the regular expression [_a-z][_a-z0-9]*.
  • flags – The regular expression flags that will be applied when compiling the regular expression used for recognizing substitutions. The default value is re.IGNORECASE. Note that re.VERBOSE is always added to the flags, so custom idpatterns must follow conventions for verbose regular expressions.

Alternatively, you can provide the entire regular expression pattern by overriding the class attribute pattern. If you do this, the value must be a regular expression object with four named capturing groups. The capturing groups correspond to the rules given above, along with the invalid placeholder rule:

  • escaped – This group matches the escape sequence, e.g., $$, in the default pattern.
  • named – This group matches the unbraced placeholder name; it should not include the delimiter in capturing group.
  • braced – This group matches the brace enclosed placeholder name; it should not include either the delimiter or braces in the capturing group.
  • invalid – This group matches any other delimiter pattern (usually a single delimiter), and it should appear last in the regular expression.

Helper functions

string.capwords(s, sep=None)
Split the argument into words using str.split(), capitalize each word using str.capitalize(), and join the capitalized words using str.join(). If the optional second argument sep is absent or None, runs of whitespace characters are replaced by a single space and leading and trailing whitespace are removed, otherwise sep is used to split and join the words.

re: regular expression operations

This module provides regular expression matching operations similar to those found in Perl.

Both patterns and strings to be searched can be Unicode strings and 8-bit strings. However, Unicode strings and 8-bit strings cannot be mixed: that is, you cannot match an Unicode string with a byte pattern or vice-versa; similarly, when asking for a substitution, the replacement string must be of the same type as both the pattern and the search string.

Regular expressions use the backslash character ('\') to indicate special forms or allow special characters to be used without invoking their special meaning. This collides with Python's usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write '\\\\' as the pattern string, because the regular expression must be \\, and each backslash must be expressed as \\ inside a regular Python string literal.

The solution is to use Python's raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation.

It is important to note that most regular expression operations are available as module-level functions and methods on compiled regular expressions. The functions are shortcuts that don’t require you to compile a regex object first, but miss some fine-tuning parameters.

Syntax

A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches given regular expression (or if given regular expression matches a particular string, which comes down to the same thing).

Regular expressions can be concatenated to form new regular expressions; if A and B are both regular expressions, then AB is also a regular expression. In general, if a string p matches A and another string q matches B, the string pq will match AB. This holds unless A or B contain low precedence operations; boundary conditions between A and B; or have numbered group references. Thus, complex expressions can easily be constructed from simpler primitive expressions like the ones described here.

Regular expressions can contain both special and ordinary characters. Most ordinary characters, like 'A', 'a', or '0', are the simplest regular expressions; they match themselves. You can concatenate ordinary characters, so last matches the string 'last'. (In the rest of this section, we’ll write RE's in this special style, usually without quotes, and strings to be matched 'in single quotes'.)

Some characters, like '|' or '(', are special. Special characters either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted. Regular expression pattern strings may not contain null bytes, but can specify the null byte using a \number notation such as '\x00'.

The special characters are:

'.'
(Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag is specified, this matches any character including a newline.
'^'
(Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.
'$'
Matches the end of the string or only before the newline at the end of the string, and in MULTILINE mode also matches before a newline. The foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$ matches only ‘foo’. More interestingly, searching for foo.$ in 'foo1\nfoo2\n' matches ‘foo2’ normally, but ‘foo1’ in MULTILINE mode; searching for a single $ in 'foo\n' finds two (empty) matches: one before the newline, and one at the end of the string.
'*'
Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b's.
'+'
Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b's; it will not match only ‘a’.
'?'
Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.
*?+???
The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against '<H1>title</H1>', it will match the entire string, and not only '<H1>'. Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only '<H1>'.
{m}
Specifies that exactly m copies of the previous RE should be matched; fewer matches cause the entire RE not to match. For example, a{6} will match exactly six 'a' characters, but not five.
{m,n}
Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. For example, a{3,5} will match from 3 to 5 'a' characters. Omitting m specifies a lower bound of zero, and omitting n specifies an infinite upper bound. As an example, a{4,}b will match aaaab or a thousand 'a' characters followed by a b, but not aaab. The comma may not be omitted or the modifier would be confused with the previously described form.
{m,n}?
Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible. This is the non-greedy version of the previous qualifier. For example, on the 6-character string 'aaaaaa', a{3,5} will match 5 'a' characters, while a{3,5}? only matches three characters.
'\'
Either escapes special characters (permitting you to match characters like '*', '?', and so forth), or signals a special sequence; special sequences are discussed below.

If you’re not using a raw string to express the pattern, remember that Python also uses the backslash as an escape sequence in string literals; if the escape sequence isn’t recognized by Python's parser, the backslash and subsequent character are included in the resulting string. However, if Python would recognize the resulting sequence, the backslash should be repeated twice. This is complicated and hard to understand, so it's highly recommended that you use raw strings for all but the simplest expressions.
[]
Used to indicate a set of characters. In a set:

  • Characters can be listed individually, e.g., [amk] will match 'a', 'm', or 'k'.
  • Ranges of characters can be indicated by giving two characters and separating them by a '-', for example [a-z] will match any lowercase ASCII letter, [0-5][0-9] will match all the two-digits numbers from 00 to 59, and [0-9A-Fa-f] will match any hexadecimal digit. If - is escaped (e.g., [a\-z]) or if it's placed as the first or last character (e.g., [a-]), it will match a literal '-'.
  • Special characters lose their special meaning inside sets. For example, [(+*)] will match any of the literal characters '(', '+', '*', or ')'.
  • Character classes such as \w or \S (defined below) are also accepted inside a set, although the characters they match depends on whether ASCII or LOCALE mode is in force.
  • Characters that are not within a range can be matched by complementing the set. If the first character of the set is '^', all the characters that are not in the set will be matched. For example, [^5] will match any character except '5', and [^^] will match any character except '^'. ^ has no special meaning if it's not the first character in the set.
  • To match a literal ']' inside a set, precede it with a backslash, or place it at the beginning of the set. For example, both [()[\]{}] and []()[{}] will both match a parenthesis.
'|'
A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way. This can be used inside groups (see below) as well. As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B are not tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy. To match a literal '|', use \|, or enclose it inside a character class, as in [|].
(...)
Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match is performed, and can be matched later in the string with the \number special sequence, described below. To match the literals '(' or ')', use \( or \), or enclose them inside a character class: [(] [)].
(?...)
This is an extension notation (a '?' following a '(' is not meaningful otherwise). The first character after the '?' determines what the meaning and further syntax of the construct is. Extensions usually do not create a new group; (?P<name>...) is the only exception to this rule. Following are the currently supported extensions.
(?aiLmsux)
(One or more letters from the set 'a', 'i', 'L', 'm', 's', 'u', 'x'.) The group matches the empty string; the letters set the corresponding flags: re.A (ASCII-only matching), re.I (ignore case), re.L (locale dependent), re.M (multi-line), re.S (dot matches all), and re.X (verbose), for the entire regular expression. This is useful if you want to include the flags as part of the regular expression, instead of passing a flag argument to the re.compile() function.

Note that the (?x) flag changes how the expression is parsed. It should be used first in the expression string, or after one or more whitespace characters. If there are non-whitespace characters before the flag, the results are undefined.
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
(?P<name>...)
Similar to regular parentheses, but the substring matched by the group is accessible via the symbolic group name name. Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression. A symbolic group is also a numbered group, as if the group were not named.

Named groups can be referenced in three contexts. If the pattern is (?P<quote>['"]).*?(?P=quote) (i.e., matching a string quoted with either single or double quotes), the following table shows the context of reference to group "quote", and the ways to reference it:

Context Ways to reference
in the same pattern itself
  • (?P=quote) (as shown)
  • \1
when processing match object m
  • m.group('quote')
  • m.end('quote') (etc.)
in a string passed to the repl argument of re.sub()
  • \g<quote>
  • \g<1>
  • \1
(?P=name)
A backreference to a named group; it matches whatever text was matched by the earlier group named name.
(?#...)
A comment; the contents of the parentheses are ignored.
(?=...)
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it's followed by 'Asimov'.
(?!...)
Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if it's not followed by 'Asimov'.
(?<=...)
Matches if the current position in the string is preceded by a match for '...' that ends at the current position. This is called a positive lookbehind assertion. (?<=abc)def finds a match in abcdef since the lookbehind will back up 3 characters and check if the contained pattern matches. The contained pattern must only match strings of some fixed length, meaning that abc or a|b are allowed, but a* and a{3,4} are not. Note that patterns which start with positive lookbehind assertions will not match at the beginning of the string being searched; you will most likely want to use the search() function rather than the match() function:

>>> import re>>> m = re.search('(?<=abc)def', 'abcdef')>>> m.group(0)'def'
This example looks for a word following a hyphen:

>>> m = re.search('(?<=-)\w+', 'spam-egg')>>> m.group(0)'egg'
(?<!...)
Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion. Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length. Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched.
(?(id/name)yes-pattern|no-pattern)
Will try to match with yes-pattern if the group with given id or name exists, and with no-pattern if it doesn’t. The no-pattern is optional and can be omitted. For example, (<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$) is a poor e-mail matching pattern, which will match with '<[email protected]>' and '[email protected]', but not with '<[email protected]' nor '[email protected]>'.

The special sequences consist of '\' and a character from the list below. If the ordinary character is not on the list, then the resulting RE will match the second character. For example, \$ matches the character '$'.

\number
Matches the contents of the group of the same number. Groups are numbered starting from 1. For example, (.+) \1 matches 'the the' or '55 55', but not 'thethe' (note the space after the group). This special sequence can only be used to match one of the first 99 groups. If the first digit of number is 0 , or number is 3 octal digits long, it not interpreted as a group match, but as the character with octal value number. Inside the '[' and ']' of a character class, all numeric escapes are treated as characters.
\A
Matches only at the start of the string.
\b
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of Unicode alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore Unicode character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

By default, Unicode alphanumerics are the ones used, but this can be changed using the ASCII flag. Inside a character range, \b represents the backspace character, for compatibility with Python's string literals.
\B
Matches the empty string, but only when it is not at the beginning or end of a word. This means that r'py\B' matches 'python', 'py3', 'py2', but not 'py', 'py.', or 'py!'. \B is only the opposite of \b, so word characters are Unicode alphanumerics or the underscore, although this can be changed using the ASCII flag.
\d
For Unicode (str) patterns: Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd]). This includes [0-9], and also other digit characters. If the ASCII flag is used only [0-9] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [0-9] may be a better choice).

For 8-bit (bytes) patterns: Matches any decimal digit; this is equivalent to [0-9].
\D
Matches any character that is not a Unicode decimal digit. This is the opposite of \d. If the ASCII flag is used this becomes the equivalent of [^0-9] (but the flag affects the entire regular expression, so in such cases using an explicit [^0-9] may be a better choice).
\s
For Unicode (str) patterns: Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [ \t\n\r\f\v] may be a better choice).

For 8-bit (bytes) patterns: Matches characters considered whitespace in the ASCII character set; this is equivalent to [ \t\n\r\f\v].
\S
Matches any character that is not a Unicode whitespace character. This is the opposite of \s. If the ASCII flag is used this becomes the equivalent of [^ \t\n\r\f\v] (but the flag affects the entire regular expression, so in such cases using an explicit [^ \t\n\r\f\v] may be a better choice).
\w
For Unicode (str) patterns: Matches Unicode word characters; this includes most characters can be part of a word in any language, and numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [a-zA-Z0-9_] may be a better choice).

For 8-bit (bytes) patterns: Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_].
\W
Matches any character that is not a Unicode word character. This is the opposite of \w. If the ASCII flag is used this becomes the equivalent of [^a-zA-Z0-9_] (but the flag affects the entire regular expression, so in such cases using an explicit [^a-zA-Z0-9_] may be a better choice).
\Z
Matches only at the end of the string.

Most of the standard escapes supported by Python string literals are also accepted by the regular expression parser:

\a      \b      \f      \n\r      \t      \u      \U\v      \x      \\

(Note that \b is used to represent word boundaries, and means “backspace” only inside character classes.)

'\u' and '\U' escape sequences are only recognized in Unicode patterns. In bytes patterns they are not treated specially.

Octal escapes are included in a limited form. If the first digit is a 0, or if there are three octal digits, it is considered an octal escape. Otherwise, it is a group reference. As for string literals, octal escapes are always at most three digits in length.

re module contents

The module defines several functions, constants, and an exception. Some of the functions are simplified versions of the full featured methods for compiled regular expressions. Most non-trivial applications always use the compiled form.

re.compile(pattern,  flags=0)
Compile a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods, described below.

The expression's behaviour can be modified by specifying a flags value. Values can be any of the following variables, combined using bitwise OR (the | operator).

The sequence

prog = re.compile(pattern)result = prog.match(string)
is equivalent to

result = re.match(pattern, string)
but using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.

Note: The compiled versions of the most recent patterns passed to re.compile() and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.
re.Are.ASCII
Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns.

Note that for backward compatibility, the re.U flag still exists (and its synonym re.UNICODE and its embedded counterpart (?u)), but these are redundant in Python 3 since matches are Unicode by default for strings (and Unicode matching isn’t allowed for bytes).
re.DEBUG
Display debug information about compiled expression.
re.Ire.IGNORECASE
Perform case-insensitive matching; expressions like [A-Z] will match lowercase letters, too. This is not affected by the current locale and works for Unicode characters as expected.
re.Lre.LOCALE
Make \w, \W, \b, \B, \s and \S dependent on the current locale. The use of this flag is discouraged as the locale mechanism is very unreliable, and it only handles one “culture” at a time anyway; use Unicode matching instead, which is the default in Python 3 for Unicode (str) patterns.
re.Mre.MULTILINE
When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.
re.Sre.DOTALL
Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.
re.Xre.VERBOSE
This flag allows you to write regular expressions that look nicer. Whitespace in the pattern is ignored, except when in a character class or preceded by an unescaped backslash, and, when a line contains a '#' neither in a character class or preceded by an unescaped backslash, all characters from the leftmost such '#' through the end of the line are ignored.

That means that the two following regular expression objects that match a decimal number are functionally equal:

a = re.compile(r"""\d +  # the integral part \.    # the decimal point \d *  # some fractional digits""", re.X)b = re.compile(r"\d+\.\d*")
re.search(pattern,  string,  flags=0)
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.
re.match(pattern,  string,  flags=0)
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.

Note that even in MULTILINE mode, re.match() only matches at the beginning of the string and not at the beginning of each line.

To locate a match anywhere in string, use search() instead (see also search() vs. match()).
re.fullmatch(pattern,  string,  flags=0)
If the whole string matches the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.
re.split(pattern,  string,  maxsplit=0,  flags=0)
Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list.

>>> re.split('\W+', 'Words, words, words.')['Words', 'words', 'words', '']>>> re.split('(\W+)', 'Words, words, words.')['Words', ', ', 'words', ', ', 'words', '.', '']>>> re.split('\W+', 'Words, words, words.', 1)['Words', 'words, words.']>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)['0', '3', '9']
If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string:

>>> re.split('(\W+)', '...words, words...')['', '...', 'words', ', ', 'words', '...', '']
That way, separator components are always found at the same relative indices in the result list.

Note that split will never split a string on an empty pattern match. For example:

>>> re.split('x*', 'foo')['foo']>>> re.split("(?m)^$", "foo\n\nbar\n")['foo\n\nbar\n']
re.findall(pattern,  string,  flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
re.finditer(pattern,  string,  flags=0)
Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result unless they touch the beginning of another match.
re.sub(pattern,  repl,  string,  count=0,  flags=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. The repl can be a string or a function; if it's a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth. Unknown escapes such as \j are left alone. Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern. For example:

>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',...        r'static PyObject*\npy_\1(void)\n{',...        'def myfunc():')'static PyObject*\npy_myfunc(void)\n{'
If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string. For example:

>>> def dashrepl(matchobj):...     if matchobj.group(0) == '-': return ' '...     else: return '-'>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')'pro--gram files'>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)'Baked Beans & Spam'
The pattern may be a string or an RE object.

The optional argument count is the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. If omitted or zero, all occurrences will be replaced. Empty matches for the pattern are replaced only when not adjacent to a previous match, so sub('x*', '-', 'abc') returns '-a-b-c-'.

In string-type repl arguments, in addition to the character escapes and backreferences described above, \g<name> will use the substring matched by the group named name, as defined by the (?P<name>...) syntax. \g<number> uses the corresponding group number; \g<2> is therefore equivalent to \2, but isn’t ambiguous in a replacement such as \g<2>0. \20 would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character '0'. The backreference \g<0> substitutes in the entire substring matched by the RE.
re.subn(pattern,  repl,  string,  count=0,  flags=0)
Perform the same operation as sub(), but return a tuple (new_string, number_of_subs_made).
re.escape(string)
Escape all the characters in pattern except ASCII letters, numbers and '_'. This is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
re.purge()
Clear the regular expression cache.
exception re.error
Exception raised when a string passed to one of the functions here is not a valid regular expression (for example, it might contain unmatched parentheses) or when some other error occurs during compilation or matching. It is never an error if a string contains no match for a pattern.

Regular expression objects

Compiled regular expression objects support the following methods and attributes:

regex.search(string [, pos [, endpos]])
Scan through string looking for a location where this regular expression produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

The optional second parameter pos gives an index in the string where the search is to start; it defaults to 0. This is not completely equivalent to slicing the string; the '^' pattern character matches at the real beginning of the string and at positions only after a newline, but not necessarily at the index where the search is to start.

The optional parameter endpos limits how far the string will be searched; it will be as if the string is endpos characters long, so only the characters from pos to endpos - 1 will be searched for a match. If endpos is less than pos, no match will be found; otherwise, if rx is a compiled regular expression object, rx.search(string, 0, 50) is equivalent to rx.search(string[:50], 0).

>>> pattern = re.compile("d")>>> pattern.search("dog")     # Match at index 0<_sre.SRE_Match object; span=(0, 1), match='d'>>>> pattern.search("dog", 1)  # No match; search doesn't include the "d"
regex.match(string [, pos [, endpos]])
If zero or more characters at the beginning of string match this regular expression, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.

The optional pos and endpos parameters have the same meaning as for the search() method.
regex.fullmatch(string [, pos [, endpos]])
If the whole string matches this regular expression, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.

The optional pos and endpos parameters have the same meaning as for the search() method.

>>> pattern = re.compile("o[gh]")>>> pattern.fullmatch("dog")      # No match as "o" is not at the start of "dog".>>> pattern.fullmatch("ogre")     # No match as not the full string matches.>>> pattern.fullmatch("doggie", 1, 3)   # Matches within given limits.<_sre.SRE_Match object; span=(1, 3), match='og'>
regex.split(string,  maxsplit=0)
Identical to the split() function, using the compiled pattern.
regex.findall(string [, pos [, endpos]])
Similar to the findall() function, using the compiled pattern, but also accepts optional pos and endpos parameters that limit the search region like for match().
regex.finditer(string [, pos [, endpos]])
Similar to the finditer() function, using the compiled pattern, but also accepts optional pos and endpos parameters that limit the search region like for match().
regex.sub(repl,  string,  count=0)
Identical to the sub() function, using the compiled pattern.
regex.subn(repl,  string,  count=0)
Identical to the subn() function, using the compiled pattern.
regex.flags
The regex matching flags. This is a combination of the flags given to compile(), any (?...) inline flags in the pattern, and implicit flags such as UNICODE if the pattern is a Unicode string.
regex.groups
The number of capturing groups in the pattern.
regex.groupindex
A dictionary mapping any symbolic group names defined by (?P<id>) to group numbers. The dictionary is empty if no symbolic groups were used in the pattern.
regex.pattern
The pattern string from which the RE object was compiled.

Match Objects

Match objects always have a boolean value of True. Since match() and search() return None when there is no match, you can test whether there was a match with a simple if statement:

match = re.search(pattern, string)
if match:
    process(match)

Match objects support the following methods and attributes:

match.expand(template)
Return the string obtained by doing backslash substitution on the template string template, as done by the sub() method. Escapes such as \n are converted to the appropriate characters, and numeric backreferences (\1, \2) and named backreferences (\g<1>, \g<name>) are replaced by the contents of the corresponding group.
match.group([group1, ...])
Returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without arguments, group1 defaults to zero (the whole match is returned). If a groupN argument is zero, the corresponding return value is the entire matching string; if it's in the inclusive range [1..99], it is the string matching the corresponding parenthesized group. If a group number is negative or larger than the number of groups defined in the pattern, an IndexError exception is raised. If a group is contained in a part of the pattern that did not match, the corresponding result is None. If a group is contained in a part of the pattern that matched multiple times, the last match is returned.

>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")>>> m.group(0)       # The entire match'Isaac Newton'>>> m.group(1)       # The first parenthesized subgroup.'Isaac'>>> m.group(2)       # The second parenthesized subgroup.'Newton'>>> m.group(1, 2)    # Multiple arguments give us a tuple.('Isaac', 'Newton')
If the regular expression uses the (?P<name>...) syntax, the groupN arguments may also be strings identifying groups by their group name. If a string argument is not used as a group name in the pattern, an IndexError exception is raised.

A moderately complicated example:

>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")>>> m.group('first_name')'Malcolm'>>> m.group('last_name')'Reynolds'
Named groups can also be referred to by their index:

>>> m.group(1)'Malcolm'>>> m.group(2)'Reynolds'
If a group matches multiple times, only the last match is accessible:

>>> m = re.match(r"(..)+", "a1b2c3")  # Matches 3 times.>>> m.group(1)                        # Returns only the last match.'c3'
match.groups(default=None)
Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern. The default argument is used for groups that did not participate in the match; it defaults to None.

For example:

>>> m = re.match(r"(\d+)\.(\d+)", "24.1632")>>> m.groups()('24', '1632')
If we make the decimal place and everything after it optional, not all groups might participate in the match. These groups will default to None unless the default argument is given:

>>> m = re.match(r"(\d+)\.?(\d+)?", "24")>>> m.groups()      # Second group defaults to None.('24', None)>>> m.groups('0')   # Now, the second group defaults to '0'.('24', '0')
match.groupdict(default=None)
Return a dictionary containing all the named subgroups of the match, keyed by the subgroup name. The default argument is used for groups that did not participate in the match; it defaults to None. For example:

>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")>>> m.groupdict(){'first_name': 'Malcolm', 'last_name': 'Reynolds'}
match.start([group])match.end([group])
Return the indices of the start and end of the substring matched by group; group defaults to zero (meaning the whole matched substring). Return -1 if group exists but did not contribute to the match. For a match object m, and a group g that did contribute to the match, the substring matched by group g (equivalent to m.group(g)) is

m.string[m.start(g):m.end(g)]
Note that m.start(group) will equal m.end(group) if group matched a null string. For example, after m = re.search('b(c?)', 'cba'), m.start(0) is 1, m.end(0) is 2, m.start(1) and m.end(1) are both 2, and m.start(2) raises an IndexError exception.

An example that will remove remove_this from e-mail addresses:

>>> email = "tony@tiremove_thisger.net">>> m = re.search("remove_this", email)>>> email[:m.start()] + email[m.end():]'[email protected]'
match.span([group])
For a match m, return the 2-tuple (m.start(group), m.end(group)). Note that if group did not contribute to the match, this is (-1, -1). The group defaults to zero, the entire match.
match.pos
The value of pos which was passed to the search() or match() method of a regex object. This is the index into the string at which the RE engine started looking for a match.
match.endpos
The value of endpos which was passed to the search() or match() method of a regex object. This is the index into the string beyond which the RE engine will not go.
match.lastindex
The integer index of the last matched capturing group, or None if no group was matched at all. For example, the expressions (a)b, ((a)(b)), and ((ab)) have lastindex == 1 if applied to the string 'ab', while the expression (a)(b) have lastindex == 2, if applied to the same string.
match.lastgroup
The name of the last matched capturing group, or None if the group didn’t have a name, or if no group was matched at all.
match.re
The regular expression object whose match() or search() method produced this match instance.
match.string
The string passed to match() or search().

Examples

Checking for a Pair

In this example, we’ll use the following helper function to display match objects a little more gracefully:

def displaymatch(match):
    if match is None:
        return None
    return '<Match: %r, groups=%r>' % (match.group(), match.groups())

Suppose you are writing a poker program where a player's hand is represented as a 5-character string with each character representing a card, “a” for ace, “k” for king, “q” for queen, “j” for jack, “t” for 10, and “2” through “9” representing the card with that value.

To see if given string is a valid hand, one could do the following:

>>> valid = re.compile(r"^[a2-9tjqk]{5}$")
>>> displaymatch(valid.match("akt5q"))  # Valid.
"<Match: 'akt5q', groups=()>"
>>> displaymatch(valid.match("akt5e"))  # Invalid.
>>> displaymatch(valid.match("akt"))    # Invalid.
>>> displaymatch(valid.match("727ak"))  # Valid.
"<Match: '727ak', groups=()>"

That last hand, "727ak", contained a pair, or two of the same valued cards. To match this with a regular expression, one could use backreferences as such:

>>> pair = re.compile(r".*(.).*\1")
>>> displaymatch(pair.match("717ak"))     # Pair of 7s.
"<Match: '717', groups=('7',)>"
>>> displaymatch(pair.match("718ak"))     # No pairs.
>>> displaymatch(pair.match("354aa"))     # Pair of aces.
"<Match: '354aa', groups=('a',)>"

To find out what card the pair consists of, one could use the group() method of the match object in the following manner:

>>> pair.match("717ak").group(1)
'7'
# Error because re.match() returns None, which doesn't have a group() method:
>>> pair.match("718ak").group(1)
Traceback (most recent call last):
  File "<pyshell#23>", line 1, in <module>
    re.match(r".*(.).*\1", "718ak").group(1)
AttributeError: 'NoneType' object has no attribute 'group'
>>> pair.match("354aa").group(1)
'a'

Simulating scanf()

Python does not currently have an equivalent to scanf(). Regular expressions are generally more powerful, though also more verbose, than scanf() format strings. The table below offers some more-or-less equivalent mappings between scanf() format tokens and regular expressions.

scanf() token regular expression
%c .
%5c .{5}
%d [-+]?\d+
%e%E%f%g
[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?
%i [-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)
%o [-+]?[0-7]+
%s \S+
%u \d+
%x%X
[-+]?(0[xX])?[\dA-Fa-f]+

To extract the file name and numbers from a string like:

/usr/sbin/sendmail - 0 errors, 4 warnings

...where you would use a scanf() format similar to:

%s - %d errors, %d warnings

...The equivalent regular expression would be:

(\S+) - (\d+) errors, (\d+) warnings

search() vs. match()

Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does by default).

For example:

>>> re.match("c", "abcdef")  # No match
>>> re.search("c", "abcdef") # Match
<_sre.SRE_Match object; span=(2, 3), match='c'>

Regular expressions beginning with '^' can be used with search() to restrict the match at the beginning of the string:

>>> re.match("c", "abcdef")  # No match
>>> re.search("^c", "abcdef") # No match
>>> re.search("^a", "abcdef")  # Match
<_sre.SRE_Match object; span=(0, 1), match='a'>

Note, however, that in MULTILINE mode, match() only matches at the beginning of the string, whereas using search() with a regular expression beginning with '^' matches at the beginning of each line.

>>> re.match('X', 'A\nB\nX', re.MULTILINE)  # No match
>>> re.search('^X', 'A\nB\nX', re.MULTILINE)  # Match
<_sre.SRE_Match object; span=(4, 5), match='X'>

Making a phonebook

split() splits a string into a list delimited by the passed pattern. The method is invaluable for converting textual data into data structures can be easily read and modified by Python as demonstrated in the following example that creates a phonebook.

First, here is the input. Normally it may come from a file, here we are using triple-quoted string syntax:

>>> text = """Ross McFluff: 834.345.1254 155 Elm Street
...
... Ronald Heathmore: 892.345.3428 436 Finley Avenue
... Frank Burger: 925.541.7625 662 South Dogwood Way
...
...
... Heather Albrecht: 548.326.4584 919 Park Place"""

The entries are separated by one or more newlines. Now we convert the string into a list with each nonempty line having its own entry:

>>> entries = re.split("\n+", text)
>>> entries
['Ross McFluff: 834.345.1254 155 Elm Street',
'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
'Frank Burger: 925.541.7625 662 South Dogwood Way',
'Heather Albrecht: 548.326.4584 919 Park Place']

Finally, split each entry into a list with first name, last name, telephone number, and address. We use the maxsplit parameter of split() because the address has spaces, our splitting pattern, in it:

>>> [re.split(":? ", entry, 3) for entry in entries]
[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]

The :? pattern matches the colon after the last name, so that it does not occur in the result list. With a maxsplit of 4, we could separate the house number from the street name:

>>> [re.split(":? ", entry, 4) for entry in entries]
[['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]

Text munging

sub() replaces every occurrence of a pattern with a string or the result of a function. This example demonstrates using sub() with a function to “munge” text, or randomize the order of all the characters in each word of a sentence except for the first and last characters:

>>> def repl(m):
...   inner_word = list(m.group(2))
...   random.shuffle(inner_word)
...   return m.group(1) + "".join(inner_word) + m.group(3)
>>> text = "Professor Abdolmalek, please report your absences promptly."
>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'

Finding all adverbs

findall() matches all occurrences of a pattern, not only the first one as search() does. For example, if one was a writer and wanted to find all of the adverbs in some text, he or she might use findall() in the following manner:

>>> text = "He was carefully disguised but captured quickly by police."
>>> re.findall(r"\w+ly", text)
['carefully', 'quickly']

Finding all adverbs and their positions

>>> text = "He was carefully disguised but captured quickly by police."
>>> for m in re.finditer(r"\w+ly", text):
...     print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
07-16: carefully
40-47: quickly

Raw string notation

Raw string notation (r"text") keeps regular expressions sane. Without it, every backslash ('\') in a regular expression would have to be prefixed with another one to escape it. For example, the two following lines of code are functionally identical:

>>> re.match(r"\W(.)\1\W", " ff ")
<_sre.SRE_Match object; span=(0, 4), match=' ff '>
>>> re.match("\\W(.)\\1\\W", " ff ")
<_sre.SRE_Match object; span=(0, 4), match=' ff '>

When one wants to match a literal backslash, it must be escaped in the regular expression. With raw string notation, this means r"\\". Without raw string notation, one must use "\\\\", making the following lines of code functionally identical:

>>> re.match(r"\\", r"\\")
<_sre.SRE_Match object; span=(0, 1), match='\\'>
>>> re.match("\\\\", r"\\")
<_sre.SRE_Match object; span=(0, 1), match='\\'>

Writing a tokenizer

A tokenizer or scanner analyzes a string to categorize groups of characters. This is a useful first step in writing a compiler or interpreter.

The text categories are specified with regular expressions. The technique is to combine those into a single primary regular expression and to loop over successive matches:

import collections
import re
Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])
def tokenize(s):
    keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
    token_specification = [
        ('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number
        ('ASSIGN', r':='), # Assignment operator
        ('END', r';'), # Statement terminator
        ('ID', r'[A-Za-z]+'), # Identifiers
        ('OP', r'[+*\/\-]'), # Arithmetic operators
        ('NEWLINE', r'\n'), # Line endings
        ('SKIP', r'[ \t]'), # Skip over spaces and tabs
    ]
    tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
    get_token = re.compile(tok_regex).match
    line = 1
    pos = line_start = 0
    mo = get_token(s)
    while mo is not None:
        typ = mo.lastgroup
        if typ == 'NEWLINE':
            line_start = pos
            line += 1
        elif typ != 'SKIP':
            val = mo.group(typ)
            if typ == 'ID' and val in keywords:
                typ = val
            yield Token(typ, val, line, mo.start()-line_start)
        pos = mo.end()
        mo = get_token(s, pos)
    if pos != len(s):
        raise RuntimeError('Unexpected character %r on line %d' %(s[pos], line))
statements = '''
    IF quantity THEN
        total := total + price * quantity;
        tax := price * 0.05;
    ENDIF;
'''
for token in tokenize(statements):
    print(token)

The tokenizer produces the following output:

Token(typ='IF', value='IF', line=2, column=5)
Token(typ='ID', value='quantity', line=2, column=8)
Token(typ='THEN', value='THEN', line=2, column=17)
Token(typ='ID', value='total', line=3, column=9)
Token(typ='ASSIGN', value=':=', line=3, column=15)
Token(typ='ID', value='total', line=3, column=18)
Token(typ='OP', value='+', line=3, column=24)
Token(typ='ID', value='price', line=3, column=26)
Token(typ='OP', value='*', line=3, column=32)
Token(typ='ID', value='quantity', line=3, column=34)
Token(typ='END', value=';', line=3, column=42)
Token(typ='ID', value='tax', line=4, column=9)
Token(typ='ASSIGN', value=':=', line=4, column=13)
Token(typ='ID', value='price', line=4, column=16)
Token(typ='OP', value='*', line=4, column=22)
Token(typ='NUMBER', value='0.05', line=4, column=24)
Token(typ='END', value=';', line=4, column=28)
Token(typ='ENDIF', value='ENDIF', line=5, column=5)
Token(typ='END', value=';', line=5, column=10)

difflib — helpers for computing deltas

This module provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce difference information in various formats, including HTML and context and unified diffs. For comparing directories and files, see also, the filecmp module.

class difflib.SequenceMatcher
This is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable. The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980s by Ratcliff and Obershelp under the hyperbolic name “gestalt pattern matching.” The idea is to find the longest contiguous matching subsequence containing no “junk” elements (the Ratcliff and Obershelp algorithm doesn’t address junk). The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that “look right” to people.

Timing: The basic Ratcliff-Obershelp algorithm is cubic time in the worst case and quadratic time in the expected case. SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common; best case time is linear.

Automatic junk heuristic: SequenceMatcher supports a heuristic that automatically treats certain sequence items as junk. The heuristic counts how many times each item appears in the sequence. If an item's duplicates (after the first one) account for more than 1% of the sequence and the sequence is at least 200 items long, this item is marked as “popular” and is treated as junk for the purpose of sequence matching. This heuristic can be turned off by setting the autojunk argument to False when creating the SequenceMatcher.
class difflib.Differ
This is a class for comparing sequences of lines of text, and producing human-readable differences or deltas. Differ uses SequenceMatcher both to compare sequences of lines, and to compare sequences of characters within similar (near-matching) lines.

Each line of a Differ delta begins with a two-letter code:

Code Meaning
'- '
line unique to sequence 1
'+ '
line unique to sequence 2
'  '
line common to both sequences
'? '
line not present in either input sequence
Lines beginning with ‘?‘ attempt to guide the eye to intraline differences, and were not present in either input sequence. These lines can be confusing if the sequences contain tab characters.
class difflib.HtmlDiff
This class can be used to create an HTML table (or a complete HTML file containing the table) showing a side by side, line by line comparison of text with inter-line and intra-line change highlights. The table can be generated in either full or contextual difference mode.

The constructor for this class is:

__init__(tabsize=8,  wrapcolumn=None,  linejunk=None,  charjunk=IS_CHARACTER_JUNK)
Initializes instance of HtmlDiff.

tabsize is an optional keyword argument to specify tab stop spacing and defaults to 8.

wrapcolumn is an optional keyword to specify column number where lines are broken and wrapped, defaults to None where lines are not wrapped.

The linejunk and charjunk are optional keyword arguments passed into ndiff() (used by HtmlDiff to generate the side by side HTML differences). See ndiff() documentation for argument default values and descriptions.
The following methods are public:

make_file(fromlines,  tolines,  fromdesc='',  todesc='',  context=False,  numlines=5)
Compares fromlines and tolines (lists of strings) and returns a string that is a complete HTML file containing a table showing line by line differences with inter-line and intra-line changes highlighted.

fromdesc and todesc are optional keyword arguments to specify from/to file column header strings (both default to an empty string).

context and numlines are both optional keyword arguments. Set context to True when contextual differences are to be shown, else the default is False to show the full files. The numlines defaults to 5. When context is True numlines controls the number of context lines which surround the difference highlights. When context is False numlines controls the number of lines that are shown before a difference highlight when using the “next” hyperlinks (setting to zero would cause the “next” hyperlinks to place the next difference highlight at the top of the browser without any leading context).
make_table(fromlines,  tolines,  fromdesc='',  todesc='',  context=False,  numlines=5)
Compares fromlines and tolines (lists of strings) and returns a string that is a complete HTML table showing line by line differences with inter-line and intra-line changes highlighted.

The arguments for this method are the same as those for the make_file() method.
Tools/scripts/diff.py is a command-line front-end to this class and contains a good example of its use.
difflib.context_diff(a,  b,  fromfile='',  tofile='',  fromfiledate='',  tofiledate='',  n=3,  lineterm='\n')
Compare a and b (lists of strings); return a delta (a generator generating the delta lines) in context diff format.

Context diffs are a compact way of showing only the lines that have changed plus a few lines of context. The changes are shown in a before/after style. The number of context lines is set by n which defaults to three.

By default, the diff control lines (those with *** or ---) are created with a trailing newline. This is helpful so that inputs created from io.IOBase.readlines() result in diffs that are suitable for use with io.IOBase.writelines() since both the inputs and outputs have trailing newlines.

For inputs that do not have trailing newlines, set the lineterm argument to "" so that the output will be uniformly newline free.

The context diff format normally has a header for file names and modification times. Any or all of these may be specified using strings for fromfile, tofile, fromfiledate, and tofiledate. The modification times are normally expressed in the ISO 8601 format. If not specified, the strings default to blanks.

>>> s1 = ['bacon\n', 'eggs\n', 'ham\n', 'guido\n']>>> s2 = ['python\n', 'eggy\n', 'hamster\n', 'guido\n']>>> for line in context_diff(s1, s2, fromfile='before.py', tofile='after.py'):...     sys.stdout.write(line)  *** before.py--- after.py****************** 1,4 ****! bacon! eggs! ham guido--- 1,4 ----! python! eggy! hamster guido
difflib.get_close_matches(word,  possibilities,  n=3,  cutoff=0.6)
Return a list of the best “good enough” matches. word is a sequence for which close matches are desired (often a string), and possibilities is a list of sequences against which to match word (often a list of strings).

Optional argument n (default 3) is the maximum number of close matches to return; n must be greater than 0.

Optional argument cutoff (default 0.6) is a float in the range [0, 1]. Possibilities that don’t score at least that similar to word are ignored.

The best (no more than n) matches among the possibilities are returned in a list, sorted by similarity score, most similar first.

>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])['apple', 'ape']>>> import keyword>>> get_close_matches('wheel', keyword.kwlist)['while']>>> get_close_matches('apple', keyword.kwlist)[]>>> get_close_matches('accept', keyword.kwlist)['except']
difflib.ndiff(a,  b,  linejunk=None,  charjunk=IS_CHARACTER_JUNK)
Compare a and b (lists of strings); return a Differ-style delta (a generator generating the delta lines).

Optional keyword parameters linejunk and charjunk are for filter functions (or None):

linejunk: A function that accepts a single string argument, and returns true if the string is junk, or false if not. The default is None. There is also a module-level function IS_LINE_JUNK(), which filters out lines without visible characters, except for at most one pound character ('#') – however the underlying SequenceMatcher class does a dynamic analysis of which lines are so frequent as to constitute noise, and this usually works better than using this function.

charjunk: A function that accepts a character (a string of length 1), and returns if the character is junk, or false if not. The default is module-level function IS_CHARACTER_JUNK(), which filters out whitespace characters (a blank or tab; note: bad idea to include newline in this!).

Tools/scripts/ndiff.py is a command-line front-end to this function.

>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(keepends=True),...              'ore\ntree\nemu\n'.splitlines(keepends=True))>>> print(''.join(diff), end="")- one?  ^+ ore?  ^- two- three?  -+ tree+ emu
difflib.restore(sequence,  which)
Return one of the two sequences that generated a delta.

Given a sequence produced by Differ.compare() or ndiff(), extract lines originating from file 1 or 2 (parameter which), stripping off line prefixes.

Example:

>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(keepends=True),...              'ore\ntree\nemu\n'.splitlines(keepends=True))>>> diff = list(diff) # materialize the generated delta into a list>>> print(''.join(restore(diff, 1)), end="")onetwothree>>> print(''.join(restore(diff, 2)), end="")oretreeemu
difflib.unified_diff(a,  b,  fromfile='',  tofile='',  fromfiledate='',  tofiledate='',  n=3,  lineterm='\n')
Compare a and b (lists of strings); return a delta (a generator generating the delta lines) in unified diff format.

Unified diffs are a compact way of showing only the lines that have changed plus a few lines of context. The changes are shown in a inline style (instead of separate before/after blocks). The number of context lines is set by n which defaults to three.

By default, the diff control lines (those with ---, +++, or @@) are created with a trailing newline. This is helpful so that inputs created from io.IOBase.readlines() result in diffs that are suitable for use with io.IOBase.writelines() since both the inputs and outputs have trailing newlines.

For inputs that do not have trailing newlines, set the lineterm argument to "" so that the output will be uniformly newline free.

The context diff format normally has a header for file names and modification times. Any or all of these may be specified using strings for fromfile, tofile, fromfiledate, and tofiledate. The modification times are normally expressed in the ISO 8601 format. If not specified, the strings default to blanks.

>>> s1 = ['bacon\n', 'eggs\n', 'ham\n', 'guido\n']>>> s2 = ['python\n', 'eggy\n', 'hamster\n', 'guido\n']>>> for line in unified_diff(s1, s2, fromfile='before.py', tofile='after.py'):...     sys.stdout.write(line)   --- before.py+++ after.py@@ -1,4 +1,4 @@-bacon-eggs-ham+python+eggy+hamster guido
difflib.IS_LINE_JUNK(line)
Return true for ignorable lines. The line is ignorable if line is blank or contains a single '#', otherwise it is not ignorable. Used as a default for parameter linejunk in ndiff() in older versions.
difflib.IS_CHARACTER_JUNK(ch)
Return true for ignorable characters. The character ch is ignorable if ch is a space or tab, otherwise it is not ignorable. Used as a default for parameter charjunk in ndiff().

SequenceMatcher objects

The SequenceMatcher class has this constructor:

class difflib.SequenceMatcher(isjunk=None,  a='',  b='',  autojunk=True)
Optional argument isjunk must be None (the default) or a one-argument function that takes a sequence element and returns true if and only if the element is “junk” and should be ignored. Passing None for isjunk is equivalent to passing lambda x: 0; in other words, no elements are ignored. For example, pass:

lambda x: x in " \t"
if you’re comparing lines as sequences of characters, and don’t want to synch up on blanks or hard tabs.

The optional arguments a and b are sequences to be compared; both default to empty strings. The elements of both sequences must be hashable.

The optional argument autojunk can be used to disable the automatic junk heuristic.

SequenceMatcher objects get three data attributes: bjunk is the set of elements of b for which isjunk is True; bpopular is the set of non-junk elements considered popular by the heuristic (if it's not disabled); b2j is a dict mapping the remaining elements of b to a list of positions where they occur. All three are reset whenever b is reset with set_seqs() or set_seq2().

SequenceMatcher objects have the following methods:

set_seqs(a, b)
Set the two sequences to be compared.

(SequenceMatcher computes and caches detailed information about the second sequence, so if you want to compare one sequence against many sequences, use set_seq2() to set the commonly used sequence once and call set_seq1() repeatedly, once for each of the other sequences.)

set_seq1(a)
Set the first sequence to be compared. The second sequence to be compared is not changed.
set_seq2(b)
Set the second sequence to be compared. The first sequence to be compared is not changed.
find_longest_match(alo,  ahi,  blo,  bhi)
Find longest matching block in a[alo:ahi] and b[blo:bhi].

If isjunk was omitted or None, find_longest_match() returns (i, j, k) such that a[i:i+k] equals b[j:j+k], where alo <= i <= i+k <= ahi and blo <= j <= j+k <= bhi. For all (i', j', k') meeting those conditions, the additional conditions k >= k', i <= i', and if i == i', j <= j' are also met. In other words, of all maximal matching blocks, return one that starts earliest in a, and of all those maximal matching blocks that start earliest in a, return the one that starts earliest in b.

>>> s = SequenceMatcher(None, " abcd", "abcd abcd")>>> s.find_longest_match(0, 5, 0, 9)Match(a=0, b=4, size=5)
If isjunk was provided, first the longest matching block is determined as above, but with the additional restriction that no junk element appears in the block. Then that block is extended as far as possible by matching (only) junk elements on both sides. So the resulting block never matches on junk except as identical junk happens to be adjacent to an interesting match.

Here's the same example as before, but considering blanks to be junk. That prevents 'abcd' from matching the 'abcd' at the tail end of the second sequence directly. Instead only the 'abcd' can match, and matches the leftmost 'abcd' in the second sequence:

>>> s = SequenceMatcher(lambda x: x==&quot; &quot;, &quot; abcd&quot;, &quot;abcd abcd&quot;)>>> s.find_longest_match(0, 5, 0, 9)Match(a=1, b=0, size=4)
If no blocks match, this returns (alo, blo, 0).

This method returns a named tuple Match(a, b, size).
get_matching_blocks()
Return list of triples describing matching subsequences. Each triple is of the form (i, j, n), and means that a[i:i+n] == b[j:j+n]. The triples are monotonically increasing in i and j.

The last triple is a dummy, and has the value (len(a), len(b), 0). It is the only triple with n == 0. If (i, j, n) and (i', j', n') are adjacent triples in the list, and the second is not the last triple in the list, then i+n != i' or j+n != j'; in other words, adjacent triples always describe non-adjacent equal blocks.

>>> s = SequenceMatcher(None, "abxcd", "abcd")>>> s.get_matching_blocks()[Match(a=0, b=0, size=2), Match(a=3, b=2, size=2), Match(a=5, b=4, size=0)]
get_opcodes()

Return list of 5-tuples describing how to turn a into b. Each tuple is of the form (tag, i1, i2, j1, j2). The first tuple has i1 == j1 == 0, and remaining tuples have i1 equal to the i2 from the preceding tuple, and, likewise, j1 equal to the previous j2.

The tag values are strings, with these meanings:

Value Meaning
'replace' a[i1:i2] should be replaced by b[j1:j2].
'delete' a[i1:i2] should be deleted. Note that j1 == j2 in this case.
'insert' b[j1:j2] should be inserted at a[i1:i1]. Note that i1 == i2 in this case.
'equal' a[i1:i2] == b[j1:j2] (the sub-sequences are equal).
For example:

>>> a = "qabxcd"
>>> b = "abycdf"
>>> s = SequenceMatcher(None, a, b)
>>> for tag, i1, i2, j1, j2 in s.get_opcodes():
    print('{:7}   a[{}:{}] --> b[{}:{}] {!r:>8} --> {!r}'.format(
        tag, i1, i2, j1, j2, a[i1:i2], b[j1:j2]))
Result:

delete a[0:1] –> b[0:0] ‘q’ –> ‘’ equal a[1:3] –> b[0:2] ‘ab’ –> ‘ab’ replace a[3:4] –> b[2:3] ‘x’ –> ‘y’ equal a[4:6] –> b[3:5] ‘cd’ –> ‘cd’ insert a[6:6] –> b[5:6] ‘’ –> ‘f’



get_grouped_opcodes(n=3)
Return a generator of groups with up to n lines of context.

Starting with the groups returned by get_opcodes(), this method splits out smaller change clusters and eliminates intervening ranges which have no changes.

The groups are returned in the same format as get_opcodes().
ratio()
Return a measure of the sequences’ similarity as a float in the range [0, 1].

Where T is the total number of elements in both sequences, and M is the number of matches, this is 2.0*M / T. Note that this is 1.0 if the sequences are identical, and 0.0 if they have nothing in common.

This is expensive to compute if get_matching_blocks() or get_opcodes() hasn’t already been called, in which case you may want to try quick_ratio() or real_quick_ratio() first to get an upper bound.
quick_ratio()
Return an upper bound on ratio() relatively quickly.
real_quick_ratio()
Return an upper bound on ratio() very quickly.

The three methods that return the ratio of matching to total characters can give different results due to differing levels of approximation, although quick_ratio() and real_quick_ratio() are always at least as large as ratio():

>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75
>>> s.quick_ratio()
0.75
>>> s.real_quick_ratio()
1.0

Examples

This example compares two strings, considering blanks to be “junk”:

>>> s = SequenceMatcher(lambda x: x == " ",
...                     "private Thread currentThread;",
...                     "private volatile Thread currentThread;")

ratio() returns a float in [0, 1], measuring the similarity of the sequences. As a rule of thumb, a ratio() value over 0.6 means the sequences are close matches:

>>> print(round(s.ratio(), 3))
0.866

If you’re only interested in where the sequences match, get_matching_blocks() is handy:

>>> for block in s.get_matching_blocks():
...     print("a[%d] and b[%d] match for %d elements" % block)
a[0] and b[0] match for 8 elements
a[8] and b[17] match for 21 elements
a[29] and b[38] match for 0 elements

Note that the last tuple returned by get_matching_blocks() is always a dummy, (len(a), len(b), 0), and this is the only case in which the last tuple element (number of elements matched) is 0.

To know how to change the first sequence into the second, use get_opcodes():

>>> for opcode in s.get_opcodes():
...     print("%6s a[%d:%d] b[%d:%d]" % opcode)
 equal a[0:8] b[0:8]
insert a[8:8] b[8:17]
 equal a[8:29] b[17:38]

Differ objects

Note that Differ-generated deltas make no claim to be minimal diffs. To the contrary, minimal diffs are often counter-intuitive, because they synch up anywhere possible, sometimes accidental matches 100 pages apart. Restricting synch points to contiguous matches preserves some notion of locality, at the occasional cost of producing a longer diff.

The Differ class has this constructor:

class difflib.Differ(linejunk=None,  charjunk=None)
Optional keyword parameters linejunk and charjunk are for filter functions (or None):

linejunk: A function that accepts a single string argument, and returns true if the string is junk. The default is None, meaning that no line is considered junk.

charjunk: A function that accepts a single character argument (a string of length 1), and returns true if the character is junk. The default is None, meaning that no character is considered junk.

Differ objects are used (deltas generated) via a single method:

compare(a, b)
Compare two sequences of lines, and generate the delta (a sequence of lines).

Each sequence must contain individual single-line strings ending with newlines. Such sequences can be obtained from the readlines() method of file-like objects. The delta generated also consists of newline-terminated strings, ready to be printed as-is via the writelines() method of a file-like object.

Examples

This example compares two texts. First we set up the texts, sequences of individual single-line strings ending with newlines (such sequences can also be obtained from the readlines() method of file-like objects):

>>> text1 = '''  1. Beautiful is better than ugly.
...   2. Explicit is better than implicit.
...   3. Simple is better than complex.
...   4. Complex is better than complicated.
... '''.splitlines(keepends=True)
>>> len(text1)
4
>>> text1[0][-1]
'\n'
>>> text2 = '''  1. Beautiful is better than ugly.
...   3.   Simple is better than complex.
...   4. Complicated is better than complex.
...   5. Flat is better than nested.
... '''.splitlines(keepends=True)

Next we instantiate a Differ object:

>>> d = Differ()

Note that when instantiating a Differ object we may pass functions to filter out line and character “junk.” See the Differ() constructor for details.

Finally, we compare the two:

>>> result = list(d.compare(text1, text2))

result is a list of strings, so let's pretty-print it:

>>> from pprint import pprint
>>> pprint(result)
['    1. Beautiful is better than ugly.\n',
 '-   2. Explicit is better than implicit.\n',
 '-   3. Simple is better than complex.\n',
 '+   3.   Simple is better than complex.\n',
 '?     ++\n',
 '-   4. Complex is better than complicated.\n',
 '?            ^                     ---- ^\n',
 '+   4. Complicated is better than complex.\n',
 '?           ++++ ^                      ^\n',
 '+   5. Flat is better than nested.\n']

As a single multi-line string it looks like this:

>>> import sys
>>> sys.stdout.writelines(result)
    1. Beautiful is better than ugly.
-   2. Explicit is better than implicit.
-   3. Simple is better than complex.
+   3.   Simple is better than complex.
?     ++
-   4. Complex is better than complicated.
?            ^                     ---- ^
+   4. Complicated is better than complex.
?           ++++ ^                      ^
+   5. Flat is better than nested.

A command-line interface to difflib

This example shows how to use difflib to create a diff-like utility. It is also contained in the Python source distribution, as Tools/scripts/diff.py.

""" Command line interface to difflib.py providing diffs in four formats:
* ndiff:    lists every line and highlights interline changes.
* context:  highlights clusters of changes in a before/after format.
* unified:  highlights clusters of changes in an inline format.
* html:     generates side by side comparison with change highlights.
"""
import sys, os, time, difflib, optparse
def main():
     # Configure the option parser
    usage = "usage: %prog [options] fromfile tofile"
    parser = optparse.OptionParser(usage)
    parser.add_option("-c", action="store_true", default=False,
                      help='Produce a context format diff (default)')
    parser.add_option("-u", action="store_true", default=False,
                      help='Produce a unified format diff')
    hlp = 'Produce HTML side by side diff (can use -c and -l in conjunction)'
    parser.add_option("-m", action="store_true", default=False, help=hlp)
    parser.add_option("-n", action="store_true", default=False,
                      help='Produce a ndiff format diff')
    parser.add_option("-l", "--lines", type="int", default=3,
                      help='Set number of context lines (default 3)')
    (options, args) = parser.parse_args()
    if len(args) == 0:
        parser.print_help()
        sys.exit(1)
    if len(args) != 2:
        parser.error("need to specify both a fromfile and tofile")
    n = options.lines
    fromfile, tofile = args # as specified in the usage string
    # we're passing these as arguments to the diff function
    fromdate = time.ctime(os.stat(fromfile).st_mtime)
    todate = time.ctime(os.stat(tofile).st_mtime)
    with open(fromfile) as fromf, open(tofile) as tof:
        fromlines, tolines = list(fromf), list(tof)
    if options.u:
        diff = difflib.unified_diff(fromlines, tolines, fromfile, tofile,
                                    fromdate, todate, n=n)
    elif options.n:
        diff = difflib.ndiff(fromlines, tolines)
    elif options.m:
        diff = difflib.HtmlDiff().make_file(fromlines, tolines, fromfile,
                                            tofile, context=options.c,
                                            numlines=n)
    else:
        diff = difflib.context_diff(fromlines, tolines, fromfile, tofile,
                                    fromdate, todate, n=n)
    # we're using writelines because diff is a generator
    sys.stdout.writelines(diff)
if __name__ == '__main__':
    main()

textwrap — Text wrapping and filling

The textwrap module provides some convenience functions, and TextWrapper, the class that does all the work. If you’re only wrapping or filling one or two text strings, the convenience functions should be good enough; otherwise, use an instance of TextWrapper for efficiency.

textwrap.wrap(text, width=70, **kwargs)
Wraps the single paragraph in text (a string) so every line is at most width characters long. Returns a list of output lines, without final newlines.

Optional keyword arguments correspond to the instance attributes of TextWrapper, documented below. width defaults to 70.

See the TextWrapper.wrap() method for additional details on how wrap() behaves.
textwrap.fill(text, width=70, **kwargs)
Wraps the single paragraph in text, and returns a single string containing the wrapped paragraph. fill() is shorthand for

"\n".join(wrap(text, ...))
In particular, fill() accepts exactly the same keyword arguments as wrap().
textwrap.shorten(text, width, **kwargs)
Collapse and truncate the given text to fit in the given width.

First the whitespace in text is collapsed (all whitespace is replaced by single spaces). If the result fits in the width, it is returned. Otherwise, enough words are dropped from the end so that the remaining words plus the placeholder fit within width:

>>> textwrap.shorten("Hello  world!", width=12)'Hello world!'>>> textwrap.shorten("Hello  world!", width=11)'Hello [...]'>>> textwrap.shorten("Hello world", width=10, placeholder="...")'Hello...'
Optional keyword arguments correspond to the instance attributes of TextWrapper, documented below. Note that the whitespace is collapsed before the text is passed to the TextWrapper fill() function, so changing the value of tabsize, expand_tabs, drop_whitespace, and replace_whitespace have no effect.
textwrap.dedent(text)
Remove any common leading whitespace from every line in text.

This can be used to make triple-quoted strings line up with the left edge of the display, while still presenting them in the source code in indented form.

Note that tabs and spaces are both treated as whitespace, but they are not equal: the lines "hello" and "\thello" are considered to have no common leading whitespace.

For example:

def test(): # end first line with \ to avoid the empty line! s = '''\ hello world ''' print(repr(s))          # prints '    hello\n      world\n    ' print(repr(dedent(s)))  # prints 'hello\n  world\n'
textwrap.indent(text,  prefix,  predicate=None)
Add prefix to the beginning of selected lines in text.

Lines are separated by calling text.splitlines(True).

By default, prefix is added to all lines that do not consist solely of whitespace (including any line endings).

For example:

>>> print(indent(s, '+ ', lambda line: True))+ hello+++ world
wrap(), fill() and shorten() work by creating a TextWrapper instance and calling a single method on it. That instance is not reused, so for applications that process many text strings using wrap() and/or fill(), it may be more efficient to create a TextWrapper object.

Text is preferably wrapped on whitespaces and right after the hyphens in hyphenated words; only then will long words be broken if necessary, unless TextWrapper.break_long_words is set to false.
class textwrap.TextWrapper(**kwargs)
The TextWrapper constructor accepts some optional keyword arguments. Each keyword argument corresponds to an instance attribute, so for example

wrapper = TextWrapper(initial_indent="* ")
is the same as

wrapper = TextWrapper()wrapper.initial_indent = "* "
You can re-use the same TextWrapper object many times, and you can change any of its options through direct assignment to instance attributes between uses.

The TextWrapper instance attributes (and keyword arguments to the constructor) are as follows:

width
(default: 70) The maximum length of wrapped lines. As long as there are no individual words in the input text longer than width, TextWrapper guarantees that no output line will be longer than width characters.
expand_tabs
(default: True) If true, then all tab characters in text will be expanded to spaces using the expandtabs() method of text.
tabsize
(default: 8) If expand_tabs is true, then all tab characters in text will be expanded to zero or more spaces, depending on the current column and the given tab size.
replace_whitespace
(default: True) If true, after tab expansion but before wrapping, the wrap() method replaces each whitespace character with a single space. The whitespace characters replaced are as follows: tab, newline, vertical tab, formfeed, and carriage return ('\t\n\v\f\r').

Note: If expand_tabs is false and replace_whitespace is true, each tab character will be replaced by a single space, which is not the same as tab expansion.

Note: If replace_whitespace is false, newlines may appear in the middle of a line and cause strange output. For this reason, text should be split into paragraphs (using str.splitlines() or similar) which are wrapped separately.
drop_whitespace
(default: True) If true, whitespace at the beginning and ending of every line (after wrapping but before indenting) is dropped. Whitespace at the beginning of the paragraph, however, is not dropped if non-whitespace follows it. If whitespace being dropped takes up an entire line, the whole line is dropped.
initial_indent
(default: '') String that will be prepended to the first line of wrapped output. Counts towards the length of the first line. The empty string is not indented.
subsequent_indent
(default: '') String that will be prepended to all lines of wrapped output except the first. Counts towards the length of each line except the first.
fix_sentence_endings
(default: False) If true, TextWrapper attempts to detect sentence endings and ensure that sentences are always separated by exactly two spaces. This is generally desired for text in a monospaced font. However, the sentence detection algorithm is imperfect: it assumes that a sentence ending consists of a lowercase letter followed by one of '.', '!', or '?', possibly followed by one of '"' or "'", followed by a space. One problem with this is algorithm is that it is unable to detect the difference between “Dr.” in

[...] Dr. Frankenstein's monster [...]
and “Spot.” in

[...] See Spot. See Spot run [...]
fix_sentence_endings is false by default.

Since the sentence detection algorithm relies on string.lowercase for the definition of “lowercase letter,” and a convention of using two spaces after a period to separate sentences on the same line, it is specific to English-language texts.
break_long_words
(default: True) If true, then words longer than width will be broken to ensure that no lines are longer than width. If it's false, long words are not broken, and some lines may be longer than width. (Long words will be put on a line by themselves, to minimize the amount by which width is exceeded.)
break_on_hyphens
(default: True) If true, wrapping occurs preferably on whitespaces and right after hyphens in compound words, as it is customary in English. If false, only whitespaces will be considered as potentially good places for line breaks, but you need to set break_long_words to false if you want truly insecable words. Default behaviour in previous versions was to always allow breaking hyphenated words.
max_lines
(default: None) If not None, then the output contains at most max_lines lines, with placeholder appearing at the end of the output.
placeholder
(default: '[...]') String that appears at the end of the output text if it's truncated.

TextWrapper also provides some public methods, analogous to the module-level convenience functions:

wrap(text)
Wraps the single paragraph in text (a string) so every line is at most width characters long. All wrapping options are taken from instance attributes of the TextWrapper instance. Returns a list of output lines, without final newlines. If the wrapped output has no content, the returned list is empty.
fill(text)
Wraps the single paragraph in text, and returns a single string containing the wrapped paragraph.

unicodedata: Unicode database

This module provides access to the UCD (Unicode Character Database) which defines character properties for all Unicode characters. The data contained in this database is compiled from the UCD version 6.3.0.

The module uses the same names and symbols as defined by Unicode Standard Annex #44, “Unicode Character Database”. It defines the following functions:

unicodedata.lookup(name)
Look up character by name. If a character with the given name is found, return the corresponding character. If not found, KeyError is raised.
unicodedata.name(chr [, default])
Returns the name assigned to the character chr as a string. If no name is defined, default is returned, or, if not given, ValueError is raised.
unicodedata.decimal(chr [, default])
Returns the decimal value assigned to the character chr as integer. If no such value is defined, default is returned, or, if not given, ValueError is raised.
unicodedata.digit(chr [, default])
Returns the digit value assigned to the character chr as integer. If no such value is defined, default is returned, or, if not given, ValueError is raised.
unicodedata.numeric(chr [, default])
Returns the numeric value assigned to the character chr as float. If no such value is defined, default is returned, or, if not given, ValueError is raised.
unicodedata.category(chr)
Returns the general category assigned to the character chr as string.
unicodedata.bidirectional(chr)
Returns the bidirectional class assigned to the character chr as string. If no such value is defined, an empty string is returned.
unicodedata.combining(chr)
Returns the canonical combining class assigned to the character chr as integer. Returns 0 if no combining class is defined.
unicodedata.east_asian_width(chr)
Returns the east asian width assigned to the character chr as string.
unicodedata.mirrored(chr)
Returns the mirrored property assigned to the character chr as integer. Returns 1 if the character is identified as a “mirrored” character in bidirectional text, 0 otherwise.
unicodedata.decomposition(chr)
Returns the character decomposition mapping assigned to the character chr as string. An empty string is returned in case no such mapping is defined.
unicodedata.normalize(form,  unistr)
Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.

The Unicode standard defines various normalization forms of a Unicode string, based on the definition of canonical equivalence and compatibility equivalence. In Unicode, several characters can be expressed in various way. For example, the character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).

For each character, there are two normal forms: normal form C and normal form D. Normal form D (NFD) is also known as canonical decomposition, and translates each character into its decomposed form. Normal form C (NFC) first applies a canonical decomposition, then composes pre-combined characters again.

In addition to these two forms, there are two additional normal forms based on compatibility equivalence. In Unicode, certain characters are supported which normally would be unified with other characters. For example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I). However, it is supported in Unicode for compatibility with existing character sets (e.g., gb2312).

The normal form KD (NFKD) will apply the compatibility decomposition, i.e., replace all compatibility characters with their equivalents. The normal form KC (NFKC) first applies the compatibility decomposition, followed by the canonical composition.

Even if two unicode strings are normalized and look the same to a human reader, if one has combining characters and the other doesn’t, they may not compare equal.

Also, the module exposes the following constant:

unicodedata.unidata_version
The version of the Unicode database used in this module.
unicodedata.ucd_3_2_0
This is an object with the same methods as the entire module, but uses the Unicode database version 3.2 instead, for applications that require this specific version of the Unicode database (such as IDNA).

Examples:

>>> import unicodedata
>>> unicodedata.lookup('LEFT CURLY BRACKET')
'{'
>>> unicodedata.name('/')
'SOLIDUS'
>>> unicodedata.decimal('9')
9
>>> unicodedata.decimal('a')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
ValueError: not a decimal
>>> unicodedata.category('A')  # 'L'etter, 'u'ppercase
'Lu'
>>> unicodedata.bidirectional('\u0660') # 'A'rabic, 'N'umber
'AN'

stringprep — Internet string preparation

When identifying things (such as hostnames) in the Internet, it is often necessary to compare such identifications for “equality”. Exactly how this comparison is executed may depend on the application domain, e.g., whether it should be case-insensitive or not. It may be also necessary to restrict the possible identifications, to allow only identifications consisting of “printable” characters.

RFC 3454 defines a procedure for “preparing” Unicode strings in Internet protocols. Before passing strings onto the wire, they are processed with the preparation procedure, after which they have a certain normalized form. The RFC defines a set of tables, which can be combined into profiles. Each profile must define which tables it uses, and what other optional parts of the stringprep procedure are part of the profile. One example of a stringprep profile is nameprep, which is used for internationalized domain names.

The module stringprep only exposes the tables from RFC 3454. As these tables would be very large to represent them as dictionaries or lists, the module uses the Unicode character database internally. The module source code itself was generated using the mkstringprep.py utility.

As a result, these tables are exposed as functions, not as data structures. There are two kinds of tables in the RFC: sets and mappings. For a set, stringprep provides the “characteristic function”, i.e., a function that returns true if the parameter is part of the set. For mappings, it provides the mapping function: given the key, it returns the associated value. Below is a list of all functions available in the module.

stringprep.in_table_a1(code)
Determine whether code is in tableA.1 (Unassigned code points in Unicode 3.2).
stringprep.in_table_b1(code)
Determine whether code is in tableB.1 (Commonly mapped to nothing).
stringprep.map_table_b2(code)
Return the mapped value for code according to tableB.2 (Mapping for case-folding used with NFKC).
stringprep.map_table_b3(code)
Return the mapped value for code according to tableB.3 (Mapping for case-folding used with no normalization).
stringprep.in_table_c11(code)
Determine whether code is in tableC.1.1 (ASCII space characters).
stringprep.in_table_c12(code)
Determine whether code is in tableC.1.2 (Non-ASCII space characters).
stringprep.in_table_c11_c12(code)
Determine whether code is in tableC.1 (Space characters, union of C.1.1 and C.1.2).
stringprep.in_table_c21(code)
Determine whether code is in tableC.2.1 (ASCII control characters).
stringprep.in_table_c22(code)
Determine whether code is in tableC.2.2 (Non-ASCII control characters).
stringprep.in_table_c21_c22(code)
Determine whether code is in tableC.2 (Control characters, union of C.2.1 and C.2.2).
stringprep.in_table_c3(code)
Determine whether code is in tableC.3 (Private use).
stringprep.in_table_c4(code)
Determine whether code is in tableC.4 (Non-character code points).
stringprep.in_table_c5(code)
Determine whether code is in tableC.5 (Surrogate codes).
stringprep.in_table_c6(code)
Determine whether code is in tableC.6 (Inappropriate for plain text).
stringprep.in_table_c7(code)
Determine whether code is in tableC.7 (Inappropriate for canonical representation).
stringprep.in_table_c8(code)
Determine whether code is in tableC.8 (Change display properties or are deprecated).
stringprep.in_table_c9(code)
Determine whether code is in tableC.9 (Tagging characters).
stringprep.in_table_d1(code)
Determine whether code is in tableD.1 (Characters with bidirectional property “R” or “AL”).
stringprep.in_table_d2(code)
Determine whether code is in tableD.2 (Characters with bidirectional property “L”).

readline — GNU readline interface

The readline module defines many functions to facilitate completion and reading/writing of history files from the Python interpreter. This module can be used directly or via the rlcompleter module. Settings made using this module affect the behaviour of both the interpreter's interactive prompt and the prompts offered by the built-in input() function.

Note

On MacOS X the readline module can be implemented using the libedit library instead of GNU readline.

The configuration file for libedit is different from that of GNU readline. If you programmatically load configuration strings you can check for the text “libedit” in readline.__doc__ to differentiate between GNU readline and libedit.

The readline module defines the following functions:

readline.parse_and_bind(string)
Parse and execute single line of a readline init file.
readline.get_line_buffer()
Return the current contents of the line buffer.
readline.insert_text(string)
Insert text into the command line.
readline.read_init_file([filename])
Parse a readline initialization file. The default filename is the last filename used.
readline.read_history_file([filename])
Load a readline history file. The default filename is ~/.history.
readline.write_history_file([filename])
Save a readline history file. The default filename is ~/.history.
readline.clear_history()
Clear the current history. (Note: this function is not available if the installed version of GNU readline doesn’t support it.)
readline.get_history_length()
Return the desired length of the history file. Negative values imply unlimited history file size.
readline.set_history_length(length)
Set the number of lines to save in the history file. write_history_file() uses this value to truncate the history file when saving. Negative values imply unlimited history file size.
readline.get_current_history_length()
Return the number of lines currently in the history. (This is different from get_history_length(), which returns the maximum number of lines that will be written to a history file.)
readline.get_history_item(index)
Return the current contents of history item at index.
readline.remove_history_item(pos)
Remove history item specified by its position from the history.
readline.replace_history_item(pos, line)
Replace history item specified by its position with the given line.
readline.redisplay()
Change what's displayed on the screen to reflect the current contents of the line buffer.
readline.set_startup_hook([function])
Set or remove the startup_hook function. If function is specified, it will be used as the new startup_hook function; if omitted or None, any hook function already installed is removed. The startup_hook function is called with no arguments before readline prints the first prompt.
readline.set_pre_input_hook([function])
Set or remove the pre_input_hook function. If function is specified, it will be used as the new pre_input_hook function; if omitted or None, any hook function already installed is removed. The pre_input_hook function is called with no arguments after the first prompt is printed and only before readline starts reading input characters.
readline.set_completer([function])will
Set or remove the completer function. If function is specified, it will be used as the new completer function; if omitted or None, any completer function already installed is removed. The completer function is called as function(text, state), for state in 0, 1, 2, ..., until it returns a non-string value. It should return the next possible completion starting with text.
readline.get_completer()
Get the completer function, or None if no completer function is set.
readline.get_completion_type()
Get the type of completion being attempted.
readline.get_begidx()
Get the beginning index of the readline tab-completion scope.
readline.get_endidx()
Get the ending index of the readline tab-completion scope.
readline.set_completer_delims(string)
Set the readline word delimiters for tab-completion.
readline.get_completer_delims()
Get the readline word delimiters for tab-completion.
readline.set_completion_display_matches_hook([function])
Set or remove the completion display function. If function is specified, it will be used as the new completion display function; if omitted or None, any completion display function already installed is removed. The completion display function is called as function(substitution, [matches], longest_match_length) once each time matches need to be displayed.
readline.add_history(line)
Append a line to the history buffer, as if it was the last line typed.

readline example

The following example demonstrates how to use the readline module's history reading and writing functions to automatically load and save a history file named .python_history from the user's home directory. The code below would normally be executed automatically during interactive sessions from the user's PYTHONSTARTUP file.

import atexit
import os
import readline
histfile = os.path.join(os.path.expanduser("~"), ".python_history")
try:
    readline.read_history_file(histfile)
except FileNotFoundError:
    pass
atexit.register(readline.write_history_file, histfile)

This code is actually automatically run when Python is run in interactive mode.

The following example extends the code.InteractiveConsole class to support history save/restore.

import atexit
import code
import os
import readline
class HistoryConsole(code.InteractiveConsole):
    def __init__(self, locals=None, filename="<console>",
                 histfile=os.path.expanduser("~/.console-history")):
        code.InteractiveConsole.__init__(self, locals, filename)
        self.init_history(histfile)
    def init_history(self, histfile):
        readline.parse_and_bind("tab: complete")
        if hasattr(readline, "read_history_file"):
            try:
                readline.read_history_file(histfile)
            except FileNotFoundError:
                pass
            atexit.register(self.save_history, histfile)
    def save_history(self, histfile):
        readline.write_history_file(histfile)

rlcompleter — completion function for GNU readline

The rlcompleter module defines a completion function suitable for the readline module by completing valid Python identifiers and keywords.

When this module is imported on a Unix platform with the readline module available, an instance of the Completer class is automatically created and its complete() method is set as the readline completer.

Example:

>>> import rlcompleter
>>> import readline
>>> readline.parse_and_bind("tab: complete")
>>> readline. <TAB PRESSED>
readline.__doc__          readline.get_line_buffer(  readline.read_init_file(
readline.__file__         readline.insert_text(      readline.set_completer(
readline.__name__         readline.parse_and_bind(
>>> readline.

The rlcompleter module is designed for use with Python's interactive mode. Unless Python is run with the -S option, the module is automatically imported and configured (see Readline configuration).

On platforms without readline, the Completer class defined by this module can still be used for custom purposes.

Completer objects

Completer objects have the following method:

Completer.complete(text, state)
Return the state completion for text.

If called for text that doesn’t include a period character ('.'), it completes from names currently defined in __main__, builtins and keywords (as defined by the keyword module).

If called for a dotted name, it will try to evaluate anything without obvious side-effects (functions are not evaluated, but it can generate calls to __getattr__()) up to the last part, and find matches for the rest via the dir() function. Any exception raised during the evaluation of the expression is caught, silenced and None is returned.

The modules described in this chapter provide some basic services operations for manipulation of binary data. Other operations on binary data, specifically concerning file formats and network protocols, are described in the relevant sections.

Some libraries described under Text Processing Services also work with either ASCII-compatible binary formats (for example, re) or all binary data (for example, difflib).

Also, see the documentation for Python's built-in binary data types in Binary Sequence Types — bytes, bytearray, memoryview.