HTML support in Python

Updated: 11/06/2021 by Computer Hope
python command

This page describes the standard tools for working with HTML in Python 3, and how to use them.

Html: HTML and XHTML parsing

The html module defines utilities to manipulate HTML.

html.escape(s, quote=True)
Convert the characters &, < and > in string s to HTML-safe sequences. Use this if you need to display text that might contain such characters in HTML. If the optional flag quote is true, the characters (") and (') are also translated; this helps for inclusion in an HTML attribute value delimited by quotes, as in <a href="...">.
html.unescape(s)
Convert all named and numeric character references (e.g., &gt;, &#62;, &x3e;) in the string s to the corresponding unicode characters.

html.parser: Simple HTML and XHTML parser

This module defines a class HTMLParser that serves as the basis for parsing text files formatted in HTML (Hypertext Markup Language) and XHTML.

class html.parser.HTMLParser(strict=False, *,  convert_charrefs=False)
Create a parser instance.

If convert_charrefs is True (default: False), all character references (except the ones in script/style elements) are automatically converted to the corresponding Unicode characters. The use of convert_charrefs=True is encouraged and becomes the default in Python 3.5.

If strict is False (the default), the parser accepts and parses invalid markup. If strict is True, the parser will raise an HTMLParseError exception instead when it's not able to parse the markup. The use of strict=True is discouraged and the strict argument is deprecated.

An HTMLParser instance is fed HTML data and calls handler methods when start tags, end tags, text, comments, and other markup elements are encountered. The user should subclass HTMLParser and override its methods to implement the desired behavior.

This parser does not check that end tags match start tags or call the end tag handler for elements that are closed implicitly by closing an outer element.

An exception is defined as well:

exception html.parser.HTMLParseError
Exception raised by the HTMLParser class when it encounters an error while parsing and strict is True. This exception provides three attributes: msg is a brief message explaining the error, lineno is the number of the line on which the broken construct was detected, and offset is the number of characters into the line at which the construct starts.

Example HTML Parser application

As a basic example, below is a simple HTML parser that uses the HTMLParser class to print out start tags, end tags, and data as they are encountered:

from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)
    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)
    def handle_data(self, data):
        print("Encountered some data  :", data)
parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1></body></html>')

The output will then be:

Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data  : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html

HTMLParser methods

HTMLParser instances have the following methods:

HTMLParser.feed(data)
Feed some text to the parser. It is processed insofar as it consists of complete elements; incomplete data is buffered until more data is fed or close() is called. data must be str.
HTMLParser.close()
Force processing of all buffered data as if it were followed by an end-of-file mark. This method may be redefined by a derived class to define additional processing at the end of the input, but the redefined version should always call the HTMLParser base class method close().
HTMLParser.reset()
Reset the instance. Loses all unprocessed data. This is called implicitly at instantiation time.
HTMLParser.getpos()
Return current line number and offset.
HTMLParser.get_starttag_text()
Return the text of the most recently opened start tag. This should not normally be needed for structured processing, but may be useful in dealing with HTML "as deployed" or for re-generating input with minimal changes (whitespace between attributes can be preserved, etc.).

The following methods are called when data or markup elements are encountered and they are meant to be overridden in a subclass. The base class implementations do nothing (except for handle_startendtag()):

HTMLParser.handle_starttag(tag, attrs)
This method is called to handle the start of a tag (e.g., <div id="main">).

The tag argument is the name of the tag converted to lowercase. The attrs argument is a list of (name, value) pairs containing the attributes found inside the tag's <> brackets. The name will be translated to lowercase, and quotes in the value are removed, and character and entity references are replaced.

For instance, for the tag <A HREF="https://www.cwi.nl/">, this method would be called as handle_starttag('a', [('href', 'https://www.cwi.nl/')]).

All entity references from html.entities are replaced in the attribute values.
HTMLParser.handle_endtag(tag)
This method is called to handle the end tag of an element (e.g., </div>).

The tag argument is the name of the tag converted to lowercase.
HTMLParser.handle_startendtag(tag, attrs)
Similar to handle_starttag(), but called when the parser encounters an XHTML-style empty tag (<img ... />). This method may be overridden by subclasses which require this particular lexical information; the default implementation calls handle_starttag() and handle_endtag().
HTMLParser.handle_data(data)
This method is called to process arbitrary data (e.g., text nodes and the content of <script>...</script> and <style>...</style>).
HTMLParser.handle_entityref(name)
This method is called to process a named character reference of the form &name; (e.g., >), where name is a general entity reference (e.g., 'gt'). This method is never called if convert_charrefs is True.
HTMLParser.handle_charref(name)
This method is called to process decimal and hexadecimal numeric character references of the form &#NNN; and &#xNNN;. For example, the decimal equivalent for > is &#62;, whereas the hexadecimal is &#x3E;; in this case the method will receive '62' or 'x3E'. This method is never called if convert_charrefs is True.
HTMLParser.handle_comment(data)wil
This method is called when a comment is encountered (e.g., <!--comment-->).

For example, the comment <!-- comment --> causes this method to be called with the argument 'comment '.

The content of Internet Explorer conditional comments (condcoms) also sends to this method, so, for <!--[if IE 9]>IE9-specific content<![endif]-->, this method will receive '[if IE 9]>IE-specific content<![endif]'.
HTMLParser.handle_decl(decl)
This method is called to handle an HTML doctype declaration (e.g., <!DOCTYPE html>).

The decl parameter will be the entire contents of the declaration inside the <!...> markup (e.g., 'DOCTYPE html').
HTMLParser.handle_pi(data)
Method called when a processing instruction is encountered. The data parameter contains the entire processing instruction. For example, for the processing instruction <?proc color='red'>, this method would be called as handle_pi("proc color='red'"). It is intended to be overridden by a derived class; the base class implementation does nothing.

Note: The HTMLParser class uses the SGML syntactic rules for processing instructions. An XHTML processing instruction using the trailing '?' causes the '?' to be included in data.
HTMLParser.unknown_decl(data)
This method is called when an unrecognized declaration is read by the parser.

The data parameter will be the entire contents of the declaration inside the <![...]> markup. It is sometimes useful to be overridden by a derived class. The base class implementation raises an HTMLParseError when strict is True.

Examples

The following class implements a parser that will be used to illustrate more examples:

from html.parser import HTMLParser
from html.entities import name2codepoint
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Start tag:", tag)
        for attr in attrs:
            print("     attr:", attr)
    def handle_endtag(self, tag):
        print("End tag  :", tag)
    def handle_data(self, data):
        print("Data     :", data)
    def handle_comment(self, data):
        print("Comment  :", data)
    def handle_entityref(self, name):
        c = chr(name2codepoint[name])
        print("Named ent:", c)
    def handle_charref(self, name):
        if name.startswith('x'):
            c = chr(int(name[1:], 16))
        else:
            c = chr(int(name))
        print("Num ent  :", c)
    def handle_decl(self, data):
        print("Decl     :", data)
parser = MyHTMLParser()

Parsing a doctype:

>>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
...             '"http://www.w3.org/TR/html4/strict.dtd">')
Decl     : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"

Parsing an element with a few attributes and a title:

>>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
Start tag: img
     attr: ('src', 'python-logo.png')
     attr: ('alt', 'The Python logo')
>>>
>>> parser.feed('<h1>Python</h1>')
Start tag: h1
Data     : Python
End tag  : h1

The content of script and style elements is returned as is, without further parsing:

>>> parser.feed('<style type="text/css">#python { color: green }</style>')
Start tag: style
     attr: ('type', 'text/css')
Data     : #python { color: green }
End tag  : style
>>>
>>> parser.feed('<script type="text/javascript">'
...             'alert("<strong>hello!</strong>");</script>')
Start tag: script
     attr: ('type', 'text/javascript')
Data     : alert("<strong>hello!</strong>");
End tag  : script

Parsing comments:

>>> parser.feed('<!-- a comment -->'
...             '<!--[if IE 9]>IE-specific content<![endif]-->')
Comment  :  a comment
Comment  : [if IE 9]>IE-specific content<![endif]

Parsing named and numeric character references and converting them to the correct char (note: these 3 references are all equivalent to '>'):

>>> parser.feed('>>>')
Named ent: >
Num ent  : >
Num ent  : >

Feeding incomplete chunks to feed() works, but handle_data() might be called more than once (unless convert_charrefs is set to True):

>>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
...     parser.feed(chunk)
...
Start tag: span
Data     : buff
Data     : ered
Data     : text
End tag  : span

Parsing invalid HTML (e.g., unquoted attributes) also works:

>>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
Start tag: p
Start tag: a
     attr: ('class', 'link')
     attr: ('href', '#main')
Data     : tag soup
End tag  : p
End tag  : a

Note that some invalid HTML is tolerated even in strict mode.

html.entities: Definitions of HTML general entities

This module defines four dictionaries: html5, name2codepoint, codepoint2name, and entitydefs.

html.entities.html5
A dictionary that maps HTML5 named character references to the equivalent Unicode character(s), e.g., html5['gt;'] == '>'. Note that the trailing semicolon is included in the name (e.g., 'gt;'), however some of the names are accepted by the standard even without the semicolon: in this case the name is present with and without the ';'. See also html.unescape().
html.entities.entitydefs
A dictionary mapping XHTML 1.0 entity definitions to their replacement text in ISO Latin-1.
html.entities.name2codepoint
A dictionary that maps HTML entity names to the Unicode codepoints.
html.entities.codepoint2name
A dictionary that maps Unicode codepoints to HTML entity names.

XML parsing

Python's interfaces for processing XML are grouped in the xml package.

Warning

The XML modules are not secure against erroneous or maliciously constructed data. If you need to parse untrusted or unauthenticated data see the XML vulnerabilities, defusedxml, and defusedexpat packages sections.

It is important to note that modules in the xml package require that there be at least one SAX-compliant XML parser available. The Expat parser is included with Python, so the xml.parsers.expat module is available.

The documentation for the xml.dom and xml.sax packages are the definition of the Python bindings for the DOM and SAX interfaces.

The XML handling submodules are:

  • xml.etree.ElementTree: the ElementTree API, a simple and lightweight XML processor
  • xml.dom: the DOM API definition
  • xml.dom.minidom: a minimal DOM implementation
  • xml.dom.pulldom: support for building partial DOM trees
  • xml.sax: SAX2 base classes and convenience functions
  • xml.parsers.expat: the Expat parser binding

XML vulnerabilities

The XML processing modules are not secure against maliciously constructed data. An attacker can abuse XML features to carry out denial of service attacks, access local files, generate network connections to other machines, or circumvent firewalls.

The following table gives an overview of the known attacks and whether the various modules are vulnerable to them.

kind sax etree minidom pulldom xmlrpc
billion laughs Yes Yes Yes Yes Yes
quadratic blowup Yes Yes Yes Yes Yes
external entity expansion Yes No (1.) No (2.) Yes No (3.)
DTD retrieval Yes No No Yes No
decompression bomb No No No No Yes
  1. xml.etree.ElementTree doesn’t expand external entities and raises a ParserError when an entity occurs.
  2. xml.dom.minidom doesn’t expand external entities and returns the unexpanded entity verbatim.
  3. xmlrpclib doesn’t expand external entities and omits them.
billion laughs / exponential entity expansion The Billion Laughs attack – also known as exponential entity expansion – uses multiple levels of nested entities. Each entity refers to another entity several times, and the final entity definition contains a small string. The exponential expansion results in several gigabytes of text and consumes lots of memory and CPU time.
quadratic blowup entity expansion A quadratic blowup attack is similar to a Billion Laughs attack; it abuses entity expansion, too. Instead of nested entities it repeats one large entity with several thousand chars over and over again. The attack isn’t as efficient as the exponential case but it avoids triggering parser countermeasures that forbid deeply-nested entities.
external entity expansion Entity declarations can contain more than only text for replacement. They can also point to external resources or local files. The XML parser accesses the resource and embeds the content into the XML document.
DTD retrieval Some XML libraries like Python's xml.dom.pulldom retrieve document type definitions from remote or local locations. The feature has similar implications as the external entity expansion issue.
decompression bomb Decompression bombs (aka ZIP bomb) apply to all XML libraries to parse compressed XML streams such as gzipped HTTP streams or LZMA-compressed files. For an attacker, it can reduce the amount of transmitted data by three magnitudes or more.

The defusedxml and defusedexpat Packages

defusedxml is a pure Python package with modified subclasses of all stdlib XML parsers that prevent any potentially malicious operation. Use of this package is recommended for any server code that parses untrusted XML data. The package also ships with example exploits and extended documentation on more XML exploits such as XPath injection.

defusedexpat provides a modified libexpat and a patched pyexpat module that have countermeasures against entity expansion DoS attacks. The defusedexpat module still allows a sane and configurable amount of entity expansions. The modifications may be included in some future release of Python, but are not included in any bug fix releases of Python because they break backward compatibility.

xml.etree.ElementTree: The ElementTree XML API

The xml.etree.ElementTree module implements a simple and efficient API for parsing and creating XML data.

Warning

The xml.etree.ElementTree module is not secure against maliciously constructed data. If you need to parse untrusted or unauthenticated data see XML vulnerabilities.

XML tree tutorial

This is a short tutorial for using xml.etree.ElementTree (ET for short). The goal is to demonstrate some of the building blocks and basic concepts of the module.

XML tree and elements

XML is an inherently hierarchical data format, and the most natural way to represent it is with a tree. ET has two classes for this purpose - ElementTree represents the whole XML document as a tree, and Element represents a single node in this tree. Interactions with the whole document (reading and writing to/from files) are usually done on the ElementTree level. Interactions with a single XML element and its sub-elements are done on the Element level.

Parsing XML

We’ll be using the following XML document as the sample data for this section:

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

We can import this data by reading from a file:

import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()

Or directly from a string:

root = ET.fromstring(country_data_as_string)

fromstring() parses XML from a string directly into an Element, which is the root element of the parsed tree. Other parsing functions may create an ElementTree. Check the documentation to be sure.

As an Element, root has a tag and a dictionary of attributes:

>>> root.tag
'data'
>>> root.attrib
{}

It also has children nodes over which we can iterate:

>>> for child in root:
...   print(child.tag, child.attrib)
...
country {'name': 'Liechtenstein'}
country {'name': 'Singapore'}
country {'name': 'Panama'}

Children are nested, and we can access specific child nodes by index:

>>> root[0][1].text
'2008'

Note: Not all elements of the XML input will end up as elements of the parsed tree. Currently, this module skips over any XML comments, processing instructions, and document type declarations in the input. Nevertheless, trees built using this module's API rather than parsing from XML text can have comments and processing instructions in them; they are included when generating XML output. A document type declaration may be accessed by passing a custom TreeBuilder instance to the XMLParser constructor.

Pull API for non-blocking parsing

Most parsing functions provided by this module require the whole document to be read at once before returning any result. It is possible to use an XMLParser and feed data into it incrementally, but it is a push API that calls methods on a callback target, which is too low-level and inconvenient for most needs. Sometimes what the user really wants is to be able to parse XML incrementally, without blocking operations, while enjoying the convenience of fully constructed Element objects.

The most powerful tool for doing this is XMLPullParser. It does not require a blocking read to obtain the XML data, and is instead fed with data incrementally with XMLPullParser.feed() calls. To get the parsed XML elements, call XMLPullParser.read_events(). Here is an example:

>>> parser = ET.XMLPullParser(['start', 'end'])
>>> parser.feed('<mytag>sometext')
>>> list(parser.read_events())
[('start', <Element 'mytag' at 0x7fa66db2be58>)]
>>> parser.feed(' more text</mytag>')
>>> for event, elem in parser.read_events():
...   print(event)
...   print(elem.tag, 'text=', elem.text)
...
end

The obvious use case is applications that operate in a non-blocking fashion where the XML data is being received from a socket or read incrementally from some storage device. In such cases, blocking reads are unacceptable.

Because it's so flexible, XMLPullParser can be inconvenient to use for simpler use-cases. If you don’t mind your application blocking on reading XML data but would still like to have incremental parsing capabilities, take a look at iterparse(). It can be useful when you’re reading a large XML document and don’t want to hold it wholly in memory.

Finding interesting elements

Element has some useful methods that help iterate recursively over all the sub-tree below it (its children, their children, and so on). For example, Element.iter():

>>> for neighbor in root.iter('neighbor'):
...   print(neighbor.attrib)
...
{'name': 'Austria', 'direction': 'E'}
{'name': 'Switzerland', 'direction': 'W'}
{'name': 'Malaysia', 'direction': 'N'}
{'name': 'Costa Rica', 'direction': 'W'}
{'name': 'Colombia', 'direction': 'E'}

Element.findall() finds only elements with a tag that are direct children of the current element. Element.find() finds the first child with a particular tag, and Element.text accesses the element's text content. Element.get() accesses the element's attributes:

>>> for country in root.findall('country'):
...   rank = country.find('rank').text
...   name = country.get('name')
...   print(name, rank)
...
Liechtenstein 1
Singapore 4
Panama 68

More sophisticated specification of which elements to look for is possible using XPath.

Modifying an XML file

>>> for rank in root.iter('rank'):
...   new_rank = int(rank.text) + 1
...   rank.text = str(new_rank)
...   rank.set('updated', 'yes')
...
>>> tree.write('output.xml')

Our XML now looks like this:

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank updated="yes">2</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank updated="yes">5</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank updated="yes">69</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

We can remove elements using Element.remove(). Let's say we want to remove all countries with a rank higher than 50:

>>> for country in root.findall('country'):
...   rank = int(country.find('rank').text)
...   if rank > 50:
...     root.remove(country)
...
>>> tree.write('output.xml')

Our XML now looks like this:

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank updated="yes">2</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank updated="yes">5</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
</data>

Building XML documents

The SubElement() function also provides a convenient way to create new sub-elements for a given element:

>>> a = ET.Element('a')
>>> b = ET.SubElement(a, 'b')
>>> c = ET.SubElement(a, 'c')
>>> d = ET.SubElement(c, 'd')
>>> ET.dump(a)
<a><b /><c><d /></c></a>

XPath support

This module provides limited support for XPath expressions for locating elements in a tree. The goal is to support a small subset of the abbreviated syntax; a full XPath engine is outside the scope of the module.

Example

Here's an example that demonstrates some of the XPath capabilities of the module. We’ll be using the countrydata XML document from the Parsing XML section:

import xml.etree.ElementTree as ET
root = ET.fromstring(countrydata)
# Top-level elements
root.findall(".")
# All 'neighbor' grand-children of 'country' children of the top-level
# elements
root.findall("./country/neighbor")
# Nodes with name='Singapore' with a 'year' child
root.findall(".//year/..[@name='Singapore']")
# 'year' nodes that are children of nodes with name='Singapore'
root.findall(".//*[@name='Singapore']/year")
# All 'neighbor' nodes that are the second child of their parent
root.findall(".//neighbor[2]")

Supported XPath syntax

Syntax Meaning
tag Selects all child elements with the given tag. For example, spam selects all child elements named spam, and spam/egg selects all grandchildren named egg in all children named spam.
* Selects all child elements. For example, */egg selects all grandchildren named egg.
. Selects the current node. This is mostly useful at the beginning of the path, to indicate that it's a relative path.
// Selects all subelements, on all levels beneath the current element. For example, .//egg selects all egg elements in the entire tree.
.. Selects the parent element. Returns None if the path attempts to reach the ancestors of the start element (the element find was called on).
[@attrib] Selects all elements that have the given attribute.
[@attrib='value'] Selects all elements for which the given attribute has the given value. The value cannot contain quotes.
[tag] Selects all elements with a child named tag. Only immediate children are supported.
[position] Selects all elements that are located at the given position. The position can be either an integer (1 is the first position), the expression last() (for the last position), or a position relative to the last position (e.g., last()-1).

Predicates (expressions within square brackets) must be preceded by a tag name, an asterisk, or another predicate. The position predicates must be preceded by a tag name.

xml.etree functions

xml.etree.ElementTree.Comment(text=None)
Comment element factory. This factory function creates a special element that will be serialized as an XML comment by the standard serializer. The comment string can be either a bytestring or a Unicode string. text is a string containing the comment string. Returns an element instance representing a comment.

Note that XMLParser skips over comments in the input instead of creating comment objects for them. An ElementTree only contains comment nodes if they are inserted into to the tree using one of the Element methods.
xml.etree.ElementTree.dump(elem)
Writes an element tree or element structure to sys.stdout. This function should be used for debugging only.

The exact output format is implementation dependent. In this version, it's written as an ordinary XML file.

elem is an element tree or an individual element.
xml.etree.ElementTree.fromstring(text)
Parses an XML section from a string constant. Same as XML(). text is a string containing XML data. Returns an Element instance.
xml.etree.ElementTree.fromstringlist(sequence,  parser=None)
Parses an XML document from a sequence of string fragments. The sequence is a list or other sequence containing XML data fragments. The parser is an optional parser instance. If not given, the standard XMLParser parser is used. Returns an element instance.
xml.etree.ElementTree.iselement(element)
Checks if an object appears to be a valid element object. The element is an element instance. Returns a true value if this is an element object.
xml.etree.ElementTree.iterparse(source,  events=None,  parser=None)

Parses an XML section into an element tree incrementally, and reports what's going on to the user. The source is a file name or file object containing XML data. The events is a sequence of events to report back. The supported events are the strings "start", "end", "start-ns" and "end-ns" (the "ns" events are used to get detailed namespace information). If events is omitted, only "end" events are reported. The parser is an optional parser instance. If not given, the standard XMLParser parser is used. The parser must be a subclass of XMLParser and can only use the default TreeBuilder as a target. Returns an iterator providing (event, elem) pairs.

Note that while iterparse() builds the tree incrementally, it issues blocking reads on source (or the file it names). As such, it's unsuitable for applications where blocking reads can’t be made. For fully non-blocking parsing, see XMLPullParser.

Note: iterparse() only guarantees that it has seen the ">" character of a starting tag when it emits a "start" event, so the attributes are defined, but the contents of the text and tail attributes are undefined at that point. The same applies to the element children; they may or may not be present.

If you need a fully populated element, look for "end" events instead.

xml.etree.ElementTree.parse(source, parser=None)
Parses an XML section into an element tree. The source is a file name or file object containing XML data. The parser is an optional parser instance. If not given, the standard XMLParser parser is used. Returns an ElementTree instance.
xml.etree.ElementTree.ProcessingInstruction(target,  text=None)
PI element factory. This factory function creates a special element that will be serialized as an XML processing instruction. target is a string containing the PI target. The text is a string containing the PI contents, if given. Returns an element instance, representing a processing instruction.

Note that XMLParser skips over processing instructions in the input instead of creating comment objects for them. An ElementTree only contains processing instruction nodes if they are inserted into to the tree using one of the Element methods.
xml.etree.ElementTree.register_namespace(prefix,  uri)
Registers a namespace prefix. The registry is global, and any existing mapping for either the given prefix or the namespace URI will be removed. The prefix is a namespace prefix. The uri is a namespace uri. Tags and attributes in this namespace will be serialized with the given prefix, if at all possible.
xml.etree.ElementTree.SubElement(parent,  tag,  attrib={},  **extra)
Subelement factory. This function creates an element instance, and appends it to an existing element.

The element name, attribute names, and attribute values can be either bytestrings or Unicode strings. The parent is the parent element. The tag is the subelement name. The attrib is an optional dictionary, containing element attributes. The extra contains additional attributes, given as keyword arguments. Returns an element instance.
xml.etree.ElementTree.tostring(element,  encoding="us-ascii",  method="xml", *,  short_empty_elements=True)
Generates a string representation of an XML element, including all subelements. The element is an Element instance. encoding is the output encoding (default is US-ASCII). Use encoding="unicode" to generate a Unicode string (otherwise, a bytestring is generated). The method is either "xml", "html" or "text" (default is "xml"). The short_empty_elements has the same meaning as in ElementTree.write(). Returns an (optionally) encoded string containing the XML data.
xml.etree.ElementTree.tostringlist(element,  encoding="us-ascii",  method="xml", *,  short_empty_elements=True)
Generates a string representation of an XML element, including all subelements. The element is an Element instance. The encoding is the output encoding (default is US-ASCII). Use encoding="unicode" to generate a Unicode string (otherwise, a bytestring is generated). The method is either "xml", "html" or "text" (default is "xml"). short_empty_elements has the same meaning as in ElementTree.write(). Returns a list of (optionally) encoded strings containing the XML data. It does not guarantee any specific sequence, except that b"".join(tostringlist(element)) == tostring(element).
xml.etree.ElementTree.XML(text, parser=None)
Parses an XML section from a string constant. This function can embed "XML literals" in Python code. The text is a string containing XML data. The parser is an optional parser instance. If not given, the standard XMLParser parser is used. Returns an Element instance.
xml.etree.ElementTree.XMLID(text,  parser=None)
Parses an XML section from a string constant, and also returns a dictionary which maps from element id:'s to elements. The text is a string containing XML data. The parser is an optional parser instance. If not given, the standard XMLParser parser is used. Returns a tuple containing an Element instance and a dictionary.

Element objects

class xml.etree.ElementTree.Element(tag,  attrib={},  **extra)
Element class. This class defines the Element interface, and provides a reference implementation of this interface.

The element name, attribute names, and attribute values can be either bytestrings or Unicode strings. The tag is the element name. The attrib is an optional dictionary, containing element attributes. The extra contains additional attributes, given as keyword arguments.

tag
A string identifying what kind of data this element represents (the element type, in other words).
text
The text attribute can hold additional data associated with the element. As the name implies, this attribute is usually a string but may be any application-specific object. If the element is created from an XML file, the attributecontains any text found between the element tags.
tail
The tail attribute can hold additional data associated with the element. This attribute is usually a string but may be any application-specific object. If the element is created from an XML file, the attribute contains any text found after the element's end tag and before the next tag.
attrib
A dictionary containing the element's attributes. Note that while the attrib value is always a real mutable Python dictionary, an ElementTree implementation may choose to use another internal representation, and create the dictionary only if someone asks for it. To take advantage of such implementations, use the dictionary methods below whenever possible.
The following dictionary-like methods work on the element attributes.

clear()
Resets an element. This function removes all subelements, clears all attributes, and sets the text and tail attributes to None.
get(key,  default=None)
Gets the element attribute named key.

Returns the attribute value, or default if the attribute was not found.
items()
Returns the element attributes as a sequence of (name, value) pairs. The attributes are returned in an arbitrary order.
keys()
Returns the elements attribute names as a list. The names are returned in an arbitrary order.
set(key,  value)
Set the attribute key on the element to value.
The following methods work on the element's children (subelements).

append(subelement)
Adds the element subelement to the end of this element's internal list of subelements. Raises TypeError if subelement is not an Element.
extend(subelements)
Appends subelements from a sequence object with zero or more elements. Raises TypeError if a subelement is not an Element.
find(match,  namespaces=None)
Finds the first subelement matching match. The match may be a tag name or a path. Returns an element instance or None. namespaces is an optional mapping from namespace prefix to full name.
findall(match,  namespaces=None)
Finds all matching subelements, by tag name or path. Returns a list containing all matching elements in document order. The namespaces is an optional mapping from namespace prefix to full name.
findtext(match,  default=None,  namespaces=None)
Finds text for the first subelement matching match. The match may be a tag name or a path. Returns the text content of the first matching element, or default if no element was found. Note that if the matching element has no text content an empty string is returned. The namespaces is an optional mapping from namespace prefix to full name.
getchildren()
Deprecated since version 3.2: Use list(elem) or iteration.
getiterator(tag=None)
Deprecated since version 3.2: Use method Element.iter() instead.
insert(index,  subelement)
Inserts subelement at the given position in this element. Raises TypeError if subelement is not an Element.
iter(tag=None)
Creates a tree iterator with the current element as the root. The iterator iterates over this element and all elements below it, in document (depth first) order. If tag is not None or '*', only elements whose tag equals tag are returned from the iterator. If the tree structure is modified during iteration, the result is undefined.
iterfind(match,  namespaces=None)
Finds all matching subelements, by tag name or path. Returns an iterable yielding all matching elements in document order. The namespaces is an optional mapping from namespace prefix to full name.
itertext()
Creates a text iterator. The iterator loops over this element and all subelements, in document order, and returns all inner text.
makeelement(tag, i attrib)
Creates a new element object of the same type as this element. Do not call this method, use the SubElement() factory function instead.
remove(subelement)
Removes subelement from the element. Unlike the find* methods, this method compares elements based on the instance identity, not on tag value or contents.
Element objects also support the following sequence type methods for working with subelements: __delitem__(), __getitem__(), __setitem__(), __len__().

Caution: Elements with no subelements will test as False. This behavior changes in future versions. Use specific len(elem) or elem is None test instead.

element = root.find('foo')
if not element:  # careful!
    print("element not found, or element \
           has no subelements")
if element is None:
    print("element not found")

ElementTree objects

class xml.etree.ElementTree.ElementTree(element=None,  file=None)
ElementTree wrapper class. This class represents an entire element hierarchy, and adds some extra support for serialization to and from standard XML.

element is the root element. The tree is initialized with the contents of the XML file if given.
_setroot(element)
Replaces the root element for this tree. This discards the current contents of the tree, and replaces it with the given element. Use with care. The element is an element instance.
find(match, namespaces=None)
Same as Element.find(), starting at the root of the tree.
findall(match, namespaces=None)
Same as Element.findall(), starting at the root of the tree.
findtext(match, default=None, namespaces=None)
Same as Element.findtext(), starting at the root of the tree.
getiterator(tag=None)
Deprecated since version 3.2: Use method ElementTree.iter() instead.
getroot()
Returns the root element for this tree.
iter(tag=None)
Creates and returns a tree iterator for the root element. The iterator loops over all elements in this tree, in section order. The tag is the tag to look for (default is to return all elements).
iterfind(match, namespaces=None)
Same as Element.iterfind(), starting at the root of the tree.
parse(source,  parser=None)
Loads an external XML section into this element tree. The source is a file name or file object. The parser is an optional parser instance. If not given, the standard XMLParser parser is used. Returns the section root element.
write(file,  encoding="us-ascii",  xml_declaration=None,  default_namespace=None,  method="xml", *,  short_empty_elements=True)
Writes the element tree to a file, as XML. The file is a file name, or a file object opened for writing. The encoding is the output encoding (default is US-ASCII). The xml_declaration controls if an XML declaration should be added to the file. Use False for never, True for always, None for only if not US-ASCII or UTF-8 or Unicode (default is None). The default_namespace sets the default XML namespace (for "xmlns"). The method is either "xml", "html" or "text" (default is "xml"). The keyword-only short_empty_elements parameter controls the formatting of elements that contain no content. If True (the default), they are emitted as a single self-closed tag, otherwise they are emitted as a pair of start/end tags.

The output is either a string (str) or binary (bytes). This is controlled by the encoding argument. If encoding is "unicode", the output is a string; otherwise, it's binary. Note that this may conflict with the type of file if it's an open file object; make sure you do not try to write a string to a binary stream and vice versa.

This is the XML file that is going to be manipulated:

<html>
    <head>
        <title>Example page</title>
    </head>
    <body>
        <p>Moved to <a href="http://example.org/">example.org</a>
        or <a href="http://example.com/">example.com</a>.</p>
    </body>
</html>

Example of changing the attribute "target" of every link in first paragraph:

>>> from xml.etree.ElementTree import ElementTree
>>> tree = ElementTree()
>>> tree.parse("index.xhtml")
<Element 'html' at 0xb77e6fac>
>>> p = tree.find("body/p")     # Finds first occurrence of tag p in body
>>> p
<Element 'p' at 0xb77ec26c>
>>> links = list(p.iter("a"))   # Returns list of all links
>>> links
[<Element 'a' at 0xb77ec2ac>, <Element 'a' at 0xb77ec1cc>]
>>> for i in links:             # Iterates through all found links
...     i.attrib["target"] = "blank"
>>> tree.write("output.xhtml")

QName objects

class xml.etree.ElementTree.QName(text_or_uri, tag=None)
QName wrapper. This can wrap a QName attribute value, to get proper namespace handling on output. text_or_uri is a string containing the QName value, in the form {uri}local, or, if the tag argument is given, the URI part of a QName. If tag is given, the first argument is interpreted as an URI, and this argument is interpreted as a local name. QName instances are opaque.

TreeBuilder objects

class xml.etree.ElementTree.TreeBuilder(element_factory=None)
Generic element structure builder. This builder converts a sequence of start, data, and end method calls to a well-formed element structure. You can use this class to build an element structure using a custom XML parser, or a parser for another XML-like format. element_factory, when given, must be a callable accepting two positional arguments: a tag and a dict of attributes. It is expected to return a new element instance.

Methods:

close()
Flushes the builder buffers, and returns the toplevel document element. Returns an Element instance.
data(data)
Adds text to the current element. The data is a string. This should be either a bytestring, or a Unicode string.
end(tag)
Closes the current element. The tag is the element name. Returns the closed element.
start(tag, attrs)
Opens a new element. The tag is the element name. The attrs is a dictionary containing element attributes. Returns the opened element.

Also, a custom TreeBuilder object can provide the following method:

doctype(name, pubid, system)
Handles a doctype declaration. The name is the doctype name. The pubid is the public identifier. The system is the system identifier. This method does not exist on the default TreeBuilder class.

XMLParser objects

class xml.etree.ElementTree.XMLParser(html=0,  target=None,  encoding=None)
This class is the low-level building block of the module. It uses xml.parsers.expat for efficient, event-based parsing of XML. It can be fed XML data incrementall with the feed() method, and parsing events are translated to a push API by invoking callbacks on the target object. If target is omitted, the standard TreeBuilder is used. The html argument was historically used for backward compatibility and is now deprecated. If encoding is given, the value overrides the encoding specified in the XML file.

Methods:

close()
Finishes feeding data to the parser. Returns the result of calling the close() method of the target passed during construction; by default, this is the toplevel document element.
doctype(name, pubid, system)
Deprecated (since Python 3.2). Define the TreeBuilder.doctype() method on a custom TreeBuilder target.
feed(data)
Feeds data to the parser. The data is encoded data.

XMLParser.feed() calls target‘s start(tag, attrs_dict) method for each opening tag, its end(tag) method for each closing tag, and data is processed by method data(data). XMLParser.close() calls target‘s method close(). XMLParser can be used not only for building a tree structure. This is an example of counting the maximum depth of an XML file:

>>> from xml.etree.ElementTree import XMLParser
>>> class MaxDepth:                     # The target object of the parser
...     maxDepth = 0
...     depth = 0
...     def start(self, tag, attrib):   # Called for each opening tag.
...         self.depth += 1
...         if self.depth > self.maxDepth:
...             self.maxDepth = self.depth
...     def end(self, tag):             # Called for each closing tag.
...         self.depth -= 1
...     def data(self, data):
...         pass            # We do not need to do anything with data.
...     def close(self):    # Called when all data has been parsed.
...         return self.maxDepth
...
>>> target = MaxDepth()
>>> parser = XMLParser(target=target)
>>> exampleXml = """
... <a>
...   <b>
...   </b>
...   <b>
...     <c>
...       <d>
...       </d>
...     </c>
...   </b>
... </a>"""
>>> parser.feed(exampleXml)
>>> parser.close()
4

XMLPullParser objects

class xml.etree.ElementTree.XMLPullParser(events=None)
A pull parser suitable for non-blocking applications. Its input-side API is similar to that of XMLParser, but instead of pushing calls to a callback target, XMLPullParser collects an internal list of parsing events and lets the user read from it. The events is a sequence of events to report back. The supported events are the strings "start", "end", "start-ns" and "end-ns" (the "ns" events are used to get detailed namespace information). If events is omitted, only "end" events are reported.

Methods:

feed(data)
Feed the given bytes data to the parser.
close()
Signal the parser that the data stream is terminated. Unlike XMLParser.close(), this method always returns None. Any events not yet retrieved when the parser is closed can still be read with read_events().
read_events()
Return an iterator over the events which are encountered in the data fed to the parser. The iterator yields (event, elem) pairs, where event is a string representing the type of event (e.g., "end") and elem is the encountered Element object.

Events provided in a previous call to read_events() are not yielded again. Events are consumed from the internal queue only when they are retrieved from the iterator, so multiple readers iterating in parallel over iterators obtained from read_events() have unpredictable results.
Note

XMLPullParser only guarantees that it has seen the ">" character of a starting tag when it emits a "start" event, so the attributes are defined, but the contents of the text and tail attributes are undefined at that point. The same applies to the element children; they may or may not be present.

If you need a fully populated element, look for "end" events instead.

ElementTree exceptions

class xml.etree.ElementTree.ParseError
XML parse error, raised by the various parsing methods in this module when parsing fails. The string representation of an instance of this exception contains a user-friendly error message. Also, it has the following attributes available:

code
A numeric error code from the expat parser. See the documentation of xml.parsers.expat for the list of error codes and their meanings.
position
A tuple of line, column numbers, specifying where the error occurred.

xml.dom: The Document Object Model API

The Document Object Model, or "DOM," is a cross-language API from the World Wide Web Consortium (W3C) for accessing and modifying XML documents. A DOM implementation presents an XML document as a tree structure, or allows client code to build such a structure from scratch. It then gives access to the structure through a set of objects which provided well-known interfaces.

The DOM is extremely useful for random-access applications. SAX only allows you a view of one bit of the document at a time. If you are looking at one SAX element, you have no access to another. If you are looking at a text node, you have no access to a containing element. When you write an SAX application, you need to keep track of your program's position in the document somewhere in your code. SAX does not do it for you. Also, if you need to look ahead in the XML document, you are out of luck.

Some applications are impossible in an event driven model with no access to a tree. Of course, you could build some sort of tree yourself in SAX events, but the DOM allows you to avoid writing that code. The DOM is a standard tree representation for XML data.

The Document Object Model is being defined by the W3C in stages, or "levels" in their terminology. The Python mapping of the API is substantially based on the DOM Level 2 recommendation.

DOM applications often start by parsing some XML into a DOM. How this is accomplished is not covered at all by DOM Level 1, and Level 2 provides only limited improvements: There is a DOMImplementation object class which provides access to document creation methods, but no way to access an XML reader/parser/Document builder in an implementation-independent way. There is also no well-defined way to access these methods without an existing document object. In Python, each DOM implementation provides a function getDOMImplementation(). DOM Level 3 adds a Load/Store specification, which defines an interface to the reader, but this is not yet available in the Python standard library.

Once you have a DOM document object, you can access the parts of your XML document through its properties and methods. These properties are defined in the DOM specification; this portion of the reference manual describes the interpretation of the specification in Python.

The specification provided by the W3C defines the DOM API for Java, ECMAScript, and OMG IDL. The Python mapping defined here is based in large part on the IDL version of the specification, but strict compliance is not required (though implementations are free to support the strict mapping from IDL).

xml.dom module contents

xml.dom contains the following functions:

xml.dom.registerDOMImplementation(name,  factory)
Register the factory function with the name name. The factory function should return an object which implements the DOMImplementation interface. The factory function can return the same object every time, or a new one for each call, as appropriate for the specific implementation (e.g., if that implementation supports some customization).
xml.dom.getDOMImplementation(name=None,  features=())
Return a suitable DOM implementation. The name is either well-known, the module name of a DOM implementation, or None. If it is not None, imports the corresponding module and returns a DOMImplementation object if the import succeeds. If no name is given, and if the environment variable PYTHON_DOM is set, this variable is used to find the implementation.

If name is not given, this examines the available implementations to find one with the required feature set. If no implementation is found, raise an ImportError. The features list must be a sequence of (feature, version) pairs that are passed to the hasFeature() method on available DOMImplementation objects.

Some convenience constants are also provided:

xml.dom.EMPTY_NAMESPACE
The value used to indicate that no namespace is associated with a node in the DOM. This is often found as the namespaceURI of a node, or used as the namespaceURI parameter to a namespaces-specific method.
xml.dom.XML_NAMESPACE
The namespace URI associated with the reserved prefix xml, as defined by Namespaces in XML (section 4).
xml.dom.XMLNS_NAMESPACE
The namespace URI for namespace declarations, as defined by Document Object Model (DOM) Level 2 Core Specification.
xml.dom.XMLNS_NAMESPACE
The namespace URI for namespace declarations, as defined by Document Object Model (DOM) Level 2 Core Specification (section 1.1.8).
xml.dom.XHTML_NAMESPACE
The URI of the XHTML namespace as defined by XHTML 1.0: The Extensible Hypertext Markup Language (section 3.1.1).

Also, xml.dom contains a base Node class and the DOM exception classes. The Node class provided by this module does not implement any of the methods or attributes defined by the DOM specification; concrete DOM implementations must provide those. The Node class provided as part of this module does provide the constants used for the nodeType attribute on concrete Node objects; they are located in the class rather than at the module level to conform with the DOM specifications.

Objects in the DOM

The definitive documentation for the DOM is the DOM specification from the W3C.

Note that DOM attributes may also be manipulated as nodes instead of as simple strings. It is fairly rare you must do this, however, so this usage is not yet documented.Special requirement for pickling: A tzinfo subclass must have an __init__() method that can be called with no arguments, else it can be pickled but possibly not unpickled again. This is a technical requirement that may be relaxed in the future.

Interface Section Purpose
DOMImplementation DOMImplementation objects Interface to the underlying implementation.
Node Node objects Base interface for most objects in a document.
NodeList NodeList objects Interface for a sequence of nodes.
DocumentType DocumentType objects Information about the declarations needed to process a document.
Document Document objects Object which represents an entire document.
Element Element objects Element nodes in the document hierarchy.
Attr Attr objects Attribute value nodes on element nodes.
Comment Comment objects Representation of comments in the source document.
Text Text and CDATASection objects Nodes containing textual content from the document.
ProcessingInstruction ProcessingInstruction objects Processing instruction representation.

An additional section describes the exceptions defined for working with the DOM in Python.

DOMImplementation objects

The DOMImplementation interface provides a way for applications to determine the availability of particular features in the DOM they are using. DOM Level 2 added the ability to create new Document and DocumentType objects using the DOMImplementation as well.

DOMImplementation.hasFeature(feature, version)
Return true if the feature identified by the pair of strings feature and version is implemented.
DOMImplementation.createDocument(namespaceUri, qualifiedName, doctype)
Return a new Document object (the root of the DOM), with a child Element object having the given namespaceUri and qualifiedName. The doctype must be a DocumentType object created by createDocumentType(), or None. In the Python DOM API, the first two arguments can also be None to indicate that no element child is to be created.
DOMImplementation.createDocumentType(qualifiedName, publicId, systemId)
Return a new DocumentType object that encapsulates the given qualifiedName, publicId, and systemId strings, representing the information contained in an XML document type declaration.

Node objects

All the components of an XML document are subclasses of Node.

Node.nodeType
An integer representing the node type. Symbolic constants for the types are on the Node object: ELEMENT_NODE, ATTRIBUTE_NODE, TEXT_NODE, CDATA_SECTION_NODE, ENTITY_NODE, PROCESSING_INSTRUCTION_NODE, COMMENT_NODE, DOCUMENT_NODE, DOCUMENT_TYPE_NODE, NOTATION_NODE. This is a read-only attribute.
Node.parentNode
The parent of the current node, or None for the document node. The value is always a Node object or None. For Element nodes, this will be the parent element, except for the root element, in which case it will be the Document object. For Attr nodes, this is always None. This is a read-only attribute.
Node.attributes
A NamedNodeMap of attribute objects. Only elements have actual values for this; others provide None for this attribute. This is a read-only attribute.
Node.previousSibling
The node that immediately precedes this one with the same parent. For instance the element with an end tag that comes before the self element's start tag. Of course, XML documents are made up of more than elements so the previous sibling could be text, a comment, or something else. If this node is the first child of the parent, this attribute will be None. This is a read-only attribute.
Node.nextSibling
The node that immediately follows this one with the same parent. See also previousSibling. If this is the last child of the parent, this attribute will be None. This is a read-only attribute.
Node.childNodes
A list of nodes contained in this node. This is a read-only attribute.
Node.firstChild
The first child of the node, if there are any, or None. This is a read-only attribute.
Node.lastChild
The last child of the node, if there are any, or None. This is a read-only attribute.
Node.localName
The part of the tagName following the colon if there is one, else the entire tagName. The value is a string.
Node.prefix
The part of the tagName preceding the colon if there is one, else the empty string. The value is a string, or None.
Node.namespaceURI
The namespace associated with the element name. This will be a string or None. This is a read-only attribute.
Node.nodeName
This has a different meaning for each node type; see the DOM specification for details. You can always get the information you would get here from another property such as the tagName property for elements or the name property for attributes. For all node types, the value of this attribute will be either a string or None. This is a read-only attribute.
Node.nodeValue
This has a different meaning for each node type; see the DOM specification for details. The situation is similar to that with nodeName. The value is a string or None.
Node.hasAttributes()
Returns true if the node has any attributes.
Node.hasChildNodes()
Returns true if the node has any child nodes.
Node.isSameNode(other)
Returns true if other refers to the same node as this node. This is especially useful for DOM implementations which use any sort of proxy architecture (because more than one object can refer to the same node).

Note: This is based on a proposed DOM Level 3 API that is still in the "working draft" stage, but this particular interface appears uncontroversial. Changes from the W3C will not necessarily affect this method in the Python DOM interface (though any new W3C API for this would also be supported).
Node.appendChild(newChild)
Add a new child node to this node at the end of the list of children, returning newChild. If the node was already in the tree, it is removed first.
Node.insertBefore(newChild, refChild)
Insert a new child node before an existing child. It must be the case that refChild is a child of this node; if not, ValueError is raised. newChild is returned. If refChild is None, it inserts newChild at the end of the children's list.
Node.removeChild(oldChild)
Remove a child node. oldChild must be a child of this node; if not, ValueError is raised. oldChild is returned on success. If oldChild isn't used further, its unlink() method should be called.
Node.replaceChild(newChild, oldChild)
Replace an existing node with a new node. It must be the case that oldChild is a child of this node; if not, ValueError is raised.
Node.normalize()
Join adjacent text nodes so that all stretches of text are stored as single Text instances. This simplifies processing text from a DOM tree for many applications.
Node.cloneNode(deep)
Clone this node. Setting deep means to clone all child nodes as well. This returns the clone.

NodeList objects

A NodeList represents a sequence of nodes. These objects are used in two ways in the DOM Core recommendation: the Element objects provides one as its list of child nodes, and the getElementsByTagName() and getElementsByTagNameNS() methods of Node return objects with this interface to represent query results.

The DOM Level 2 recommendation defines one method and one attribute for these objects:

NodeList.item(i)
Return the i‘th item from the sequence, if there is one, or None. The index i is not allowed to be less than zero or greater than or equal to the length of the sequence.
NodeList.length
The number of nodes in the sequence.

Also, the Python DOM interface requires that some additional support is provided to allow NodeList objects to be used as Python sequences. All NodeList implementations must include support for __len__() and __getitem__(); this allows iteration over the NodeList in for statements and proper support for the len() built-in function.

If a DOM implementation supports modification of the document, the NodeList implementation must also support the __setitem__() and __delitem__() methods.

DocumentType objects

Information about the notations and entities declared by a document (including the external subset if the parser uses it and can provide the information) is available from a DocumentType object. The DocumentType for a document is available from the Document object's doctype attribute; if there is no DOCTYPE declaration for the document, the document's doctype attribute will be set to None instead of an instance of this interface.

DocumentType is a specialization of Node, and adds the following attributes:

DocumentType.publicId
The public identifier for the external subset of the document type definition. This will be a string or None.
DocumentType.systemId
The system identifier for the external subset of the document type definition. This will be a URI as a string, or None.
DocumentType.internalSubset
A string giving the complete internal subset from the document. This does not include the brackets which enclose the subset. If the document has no internal subset, this should be None.
DocumentType.name
The name of the root element as given in the DOCTYPE declaration, if present.
DocumentType.entities
This is a NamedNodeMap giving the definitions of external entities. For entity names defined more than once, only the first definition is provided (others are ignored as required by the XML recommendation). This may be None if the information is not provided by the parser, or if no entities are defined.
DocumentType.notations
This is a NamedNodeMap giving the definitions of notations. For notation names defined more than once, only the first definition is provided (others are ignored as required by the XML recommendation). This may be None if the information is not provided by the parser, or if no notations are defined.

Document objects

A Document represents an entire XML document, including its constituent elements, attributes, processing instructions, comments etc. Remember that it inherits properties from Node.

Document.documentElement
The one and only root element of the document.
Document.createElement(tagName)
Create and return a new element node. The element is not inserted into the document when it is created. You need to explicitly insert it with one of the other methods such as insertBefore() or appendChild().
Document.createElementNS(namespaceURI, tagName)
Create and return a new element with a namespace. The tagName may have a prefix. The element is not inserted into the document when it is created. You need to explicitly insert it with one of the other methods such as insertBefore() or appendChild().
Document.createTextNode(data)
Create and return a text node containing the data passed as a parameter. As with the other creation methods, this one does not insert the node into the tree.
Document.createComment(data)
Create and return a comment node containing the data passed as a parameter. As with the other creation methods, this one does not insert the node into the tree.
Document.createProcessingInstruction(target, data)
Create and return a processing instruction node containing the target and data passed as parameters. As with the other creation methods, this one does not insert the node into the tree.
Document.createAttribute(name)
Create and return an attribute node. This method does not associate the attribute node with any particular element. You must use setAttributeNode() on the appropriate Element object to use the newly created attribute instance.
Document.createAttributeNS(namespaceURI, qualifiedName)
Create and return an attribute node with a namespace. The tagName may have a prefix. This method does not associate the attribute node with any particular element. You must use setAttributeNode() on the appropriate Element object to use the newly created attribute instance.
Document.getElementsByTagName(tagName)
Search for all descendants (direct children, children's children, etc.) with a particular element type name.
Document.getElementsByTagNameNS(namespaceURI,  localName)
Search for all descendants (direct children, children's children, etc.) with a particular namespace URI and localname. The localname is the part of the namespace after the prefix.

Element objects

Element is a subclass of Node, so inherits all the attributes of that class.

Element.tagName
The element type name. In a namespace-using document, it may have colons in it. The value is a string.
Element.getElementsByTagName(tagName)
Same as equivalent method in the Document class.
Element.getElementsByTagNameNS(namespaceURI,  localName)
Same as equivalent method in the Document class.
Element.hasAttribute(name)
Returns true if the element has an attribute named by name.
Element.hasAttributeNS(namespaceURI, localName)
Returns true if the element has an attribute named by namespaceURI and localName.
Element.getAttribute(name)
Return the value of the attribute named by name as a string. If no such attribute exists, an empty string is returned, as if the attribute had no value.
Element.getAttributeNode(attrname)
Return the Attr node for the attribute named by attrname.
Element.getAttributeNS(namespaceURI, localName)
Return the value of the attribute named by namespaceURI and localName as a string. If no such attribute exists, an empty string is returned, as if the attribute had no value.
Element.getAttributeNodeNS(namespaceURI, localName)
Return an attribute value as a node, given a namespaceURI and localName.
Element.removeAttribute(name)
Remove an attribute by name. If there is no matching attribute, a NotFoundErr is raised.
Element.removeAttributeNode(oldAttr)
Remove and return oldAttr from the attribute list, if present. If oldAttr is not present, NotFoundErr is raised.
Element.removeAttributeNS(namespaceURI, localName)
Remove an attribute by name. Note that it uses a localName, not a qname. No exception is raised if there is no matching attribute.
Element.setAttribute(name,  value)
Set an attribute value from a string.
Element.setAttributeNode(newAttr)
Add a new attribute node to the element, replacing an existing attribute if necessary if the name attribute matches. If a replacement occurs, the old attribute node is returned. If newAttr is already in use, InuseAttributeErr is raised.
Element.setAttributeNodeNS(newAttr)
Add a new attribute node to the element, replacing an existing attribute if necessary if the namespaceURI and localName attributes match. If a replacement occurs, the old attribute node is returned. If newAttr is already in use, InuseAttributeErr is raised.
Element.setAttributeNS(namespaceURI,  qname,  value)
Set an attribute value from a string, given a namespaceURI and a qname. Note that a qname is the whole attribute name. This is different than above.

Attr objects

Attr inherits from Node, so inherits all its attributes.

Attr.name
The attribute name. In a namespace-using document, it may include a colon.
Attr.localName
The part of the name following the colon if there is one, else the entire name. This is a read-only attribute.
Attr.prefix
The part of the name preceding the colon if there is one, else the empty string.
Attr.value
The text value of the attribute. This is a synonym for the nodeValue attribute.

NamedNodeMap objects

NamedNodeMap does not inherit from Node.

NamedNodeMap.length
The length of the attribute list.
NamedNodeMap.item(index)
Return an attribute with a particular index. The order you get the attributes in is arbitrary, but is consistent for the life of a DOM. Each item is an attribute node. Get its value with the value attribute.

Comment objects

Comment represents a comment in the XML document. It is a subclass of Node, but cannot have child nodes.

Comment.data
The content of the comment as a string. The attribute contains all characters between the leading <!-- and trailing -->, but does not include them.

Text and CDATASection objects

The Text interface represents text in the XML document. If the parser and DOM implementation support the DOM's XML extension, portions of the text enclosed in CDATA marked sections are stored in CDATASection objects. These two interfaces are identical, but provide different values for the nodeType attribute.

These interfaces extend the Node interface. They cannot have child nodes.

Text.data
The content of the text node as a string.

Note: The use of a CDATASection node does not indicate that the node represents a complete CDATA marked section, only that the content of the node was part of a CDATA section. A single CDATA section may be represented by more than one node in the document tree. There is no way to determine whether two adjacent CDATASection nodes represent different CDATA marked sections.

ProcessingInstruction objects

Represents a processing instruction in the XML document; this inherits from the Node interface and cannot have child nodes.

ProcessingInstruction.target
The content of the processing instruction up to the first whitespace character. This is a read-only attribute.
ProcessingInstruction.data
The content of the processing instruction following the first whitespace character.

Exceptions

The DOM Level 2 recommendation defines a single exception, DOMException, and some constants that allow applications to determine what sort of error occurred. DOMException instances carry a code attribute that provides the appropriate value for the specific exception.

The Python DOM interface provides the constants, but also expands the set of exceptions so that a specific exception exists for each of the exception codes defined by the DOM. The implementations must raise the appropriate specific exception, each of which carries the appropriate value for the code attribute.

exception xml.dom.DOMException
Base exception class used for all specific DOM exceptions. This exception class cannot be directly instantiated.
exception xml.dom.DomstringSizeErr
Raised when a specified range of text does not fit into a string. This is not known to be used in the Python DOM implementations, but may be received from DOM implementations not written in Python.
exception xml.dom.HierarchyRequestErr
Raised when an attempt is made to insert a node where the node type is not allowed.
exception xml.dom.IndexSizeErr
Raised when an index or size parameter to a method is negative or exceeds the allowed values.
exception xml.dom.InuseAttributeErr
Raised when an attempt is made to insert an Attr node that is already present elsewhere in the document.
exception xml.dom.InvalidAccessErr
Raised if a parameter or an operation is not supported on the underlying object.
exception xml.dom.InvalidCharacterErr
This exception is raised when a string parameter contains a character that is not permitted in the context it's being used in by the XML 1.0 recommendation. For example, attempting to create an Element node with a space in the element type name causes this error to be raised.
exception xml.dom.InvalidModificationErr
Raised when an attempt is made to modify the type of a node.
exception xml.dom.InvalidStateErr
Raised when an attempt is made to use an object that is not defined or is no longer usable.
exception xml.dom.NamespaceErr
If an attempt is made to change any object in a way that is not permitted with regard to the Namespaces in XML recommendation, this exception is raised.
exception xml.dom.NotFoundErr
Exception when a node does not exist in the referenced context. For example, NamedNodeMap.removeNamedItem() will raise this if the node passed in does not exist in the map.
exception xml.dom.NotSupportedErr
Raised when the implementation does not support the requested type of object or operation.
exception xml.dom.NoDataAllowedErr
This is raised if data is specified for a node which does not support data.
exception xml.dom.NoModificationAllowedErr
Raised on attempts to modify an object where modifications are not allowed (such as for read-only nodes).
exception xml.dom.SyntaxErr
Raised when an invalid or illegal string is specified.
exception xml.dom.WrongDocumentErr
Raised when a node is inserted in a different document than it currently belongs to, and the implementation does not support migrating the node from one document to the other.

The exception codes defined in the DOM recommendation map to the exceptions described above according to this table:

Constant Exception
DOMSTRING_SIZE_ERR
DomstringSizeErr
HIERARCHY_REQUEST_ERR
HierarchyRequestErr
INDEX_SIZE_ERR
IndexSizeErr
INUSE_ATTRIBUTE_ERR
InuseAttributeErr
INVALID_ACCESS_ERR
InvalidAccessErr
INVALID_CHARACTER_ERR
InvalidCharacterErr
INVALID_MODIFICATION_ERR
InvalidModificationErr
INVALID_STATE_ERR
InvalidStateErr
NAMESPACE_ERR
NamespaceErr
NOT_FOUND_ERR
NotFoundErr
NOT_SUPPORTED_ERR
NotSupportedErr
NO_DATA_ALLOWED_ERR
NoDataAllowedErr
NO_MODIFICATION_ALLOWED_ERR
NoModificationAllowedErr
SYNTAX_ERR
SyntaxErr
WRONG_DOCUMENT_ERR
WrongDocumentErr

Type mapping

The IDL types used in the DOM specification are mapped to Python types according to the following table.

IDL Type Python Type
boolean bool or int
int int
long int int
unsigned int int
DOMString str or bytes
null None

Accessor methods

The mapping from OMG IDL to Python defines accessor functions for IDL attribute declarations in much the way the Java mapping does. Mapping the IDL declarations

readonly attribute string someValue;
         attribute string anotherValue;

yields three accessor functions: a "get" method for someValue (_get_someValue()), and "get" and "set" methods for anotherValue (_get_anotherValue() and _set_anotherValue()). The mapping, in particular, does not require that the IDL attributes are accessible as normal Python attributes: object.someValue is not required to work, and may raise an AttributeError.

The Python DOM API, however, does require that normal attribute access work. This indicates the typical surrogates generated by Python IDL compilers are not likely to work, and wrapper objects may be needed on the client if the DOM objects are accessed via CORBA. While this does require some additional consideration for CORBA DOM clients, the implementers with experience using DOM over CORBA from Python do not consider this a problem. Attributes that are declared readonly may not restrict write access in all DOM implementations.

In the Python DOM API, accessor functions are not required. If provided, they should take the form defined by the Python IDL mapping, but these methods are considered unnecessary since the attributes are accessible directly from Python. "Set" accessors should never be provided for readonly attributes.

The IDL definitions do not fully embody the requirements of the W3C DOM API, such as the notion of certain objects, such as the return value of getElementsByTagName(), being "live". The Python DOM API does not require implementations to enforce such requirements.

xml.dom.minidom: Minimal DOM implementation

xml.dom.minidom is a minimal implementation of the Document Object Model interface, with an API similar to that in other languages. It is intended to be simpler than the full DOM and also significantly smaller. Users who are not already proficient with the DOM should consider using the xml.etree.ElementTree module for their XML processing instead.

Warning

The xml.dom.minidom module is not secure against maliciously constructed data. If you need to parse untrusted or unauthenticated data see XML vulnerabilities. DOM applications often start by parsing some XML into a DOM.

With xml.dom.minidom, this is done through the parse functions:

from xml.dom.minidom import parse, parseString
dom1 = parse('c:\\temp\\mydata.xml') # parse an XML file by name
datasource = open('c:\\temp\\mydata.xml')
dom2 = parse(datasource)   # parse an open file
dom3 = parseString('<myxml>Some data<empty/> some more data</myxml>')

The parse() function can take either a file name or an open file object.

xml.dom.minidom.parse(filename_or_file, parser=None, bufsize=None)
Return a document from the given input. The filename_or_file may be either a file name, or a file-like object. The parser, if given, must be an SAX2 parser object. This function changes the document handler of the parser and activate namespace support; other parser configuration (like setting an entity resolver) must be done in advance.

If you have XML in a string, you can use the parseString() function instead:

xml.dom.minidom.parseString(string,  parser=None)
Return a Document that represents the string. This method creates a io.StringIO object for the string and passes that on to parse().

Both functions return a Document object representing the content of the document.

What the parse() and parseString() functions do is connect an XML parser with a "DOM builder" can accept parse events from any SAX parser and convert them into a DOM tree. The name of the functions are perhaps misleading, but are easy to grasp when learning the interfaces. The parsing of the document will be completed before these functions return; it's that these functions do not provide a parser implementation themselves.

You can also create a Document by calling a method on a "DOM Implementation" object. You can get this object either by calling the getDOMImplementation() function in the xml.dom package or the xml.dom.minidom module. Once you have a Document, you can add child nodes to it to populate the DOM:

from xml.dom.minidom import getDOMImplementation
impl = getDOMImplementation()
newdoc = impl.createDocument(None, "some_tag", None)
top_element = newdoc.documentElement
text = newdoc.createTextNode('Some textual content.')
top_element.appendChild(text)

Once you have a DOM document object, you can access the parts of your XML document through its properties and methods. These properties are defined in the DOM specification. The main property of the document object is the documentElement property. It gives you the main element in the XML document: the one that holds all others. Here is an example program:

dom3 = parseString("<myxml>Some data</myxml>")
assert dom3.documentElement.tagName == "myxml"

When you are finished with a DOM tree, you may optionally call the unlink() method to encourage early cleanup of the now-unneeded objects. unlink() is a xml.dom.minidom-specific extension to the DOM API that renders the node and its descendants are essentially useless. Otherwise, Python's garbage collector will eventually take care of the objects in the tree.

DOM objects

The definition of the DOM API for Python is given as part of the xml.dom module documentation. This section lists the differences between the API and xml.dom.minidom.

Break internal references in the DOM so that it is garbage collected on versions of Python without cyclic GC. Even when cyclic GC is available, using this can make large amounts of memory available sooner, so calling this on DOM objects as soon as they are no longer needed is good practice. This only needs to be called on the Document object, but may be called on child nodes to discard children of that node.

You can avoid calling this method explicitly using the with statement. The following code will automatically unlink dom when the with block is exited:

with xml.dom.minidom.parse(datasource) as dom: ... # Work with dom.
Node.writexml(writer,  indent="",  addindent="",  newl="")
Write XML to the writer object. The writer should have a write() method which matches that of the file object interface. The indent parameter is the indentation of the current node. The addindent parameter is the incremental indentation to use for subnodes of the current one. The newl parameter specifies the string to use to terminate newlines.

For the Document node, an additional keyword argument encoding can specify the encoding field of the XML header.
Node.toxml(encoding=None)
Return a string or byte string containing the XML represented by the DOM node.

With an explicit encoding argument, the result is a byte string in the specified encoding. With no encoding argument, the result is a Unicode string, and the XML declaration in the resulting string does not specify an encoding. Encoding this string in an encoding other than UTF-8 is likely incorrect since UTF-8 is the default encoding of XML.
Node.toprettyxml(indent="",  newl="",  encoding="")
Return a pretty-printed version of the document. The indent specifies the indentation string and defaults to a tabulator; newl specifies the string emitted at the end of each line and defaults to \n.

The encoding argument behaves like the corresponding argument of toxml().

DOM example

This example program is a fairly realistic example of a simple program. In this particular case, we do not take much advantage of the flexibility of the DOM.

import xml.dom.minidom
document = """\
<slideshow>
<title>Demo slideshow</title>
<slide><title>Slide title</title>
<point>This is a demo</point>
<point>Of a program for processing slides</point>
</slide>
<slide><title>Another demo slide</title>
<point>It is important</point>
<point>To have more than</point>
<point>one slide</point>
</slide>
</slideshow>
"""
dom = xml.dom.minidom.parseString(document)
def getText(nodelist):
    rc = []
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc.append(node.data)
    return ''.join(rc)
def handleSlideshow(slideshow):
    print("<html>")
    handleSlideshowTitle(slideshow.getElementsByTagName("title")[0])
    slides = slideshow.getElementsByTagName("slide")
    handleToc(slides)
    handleSlides(slides)
    print("</html>")
def handleSlides(slides):
    for slide in slides:
        handleSlide(slide)
def handleSlide(slide):
    handleSlideTitle(slide.getElementsByTagName("title")[0])
    handlePoints(slide.getElementsByTagName("point"))
def handleSlideshowTitle(title):
    print("<title>%s</title>" % getText(title.childNodes))
def handleSlideTitle(title):
    print("<h2>%s</h2>" % getText(title.childNodes))
def handlePoints(points):
    print("<ul>")
    for point in points:
        handlePoint(point)
    print("</ul>")
def handlePoint(point):
    print("<li>%s</li>" % getText(point.childNodes))
def handleToc(slides):
    for slide in slides:
 title = slide.getElementsByTagName("title")[0]
        print("<p>%s</p>" % getText(title.childNodes))
handleSlideshow(dom)

minidom and the DOM standard

The xml.dom.minidom module is essentially a DOM 1.0-compatible DOM with some DOM 2 features (primarily namespace features).

Usage of the DOM interface in Python is straight-forward. The following mapping rules apply:

  • Interfaces are accessed through instance objects. Applications should not instantiate the classes themselves; they should use the creator functions available on the Document object. Derived interfaces support all operations (and attributes) from the base interfaces, plus any new operations.
  • Operations are used as methods. Since the DOM uses only in parameters, the arguments are passed in normal order (from left to right). There are no optional arguments. The void operations return None.
  • IDL attributes map to instance attributes. For compatibility with the OMG IDL language mapping for Python, an attribute foo can also be accessed through accessor methods _get_foo() and _set_foo(). readonly attributes must not be changed; this is not enforced at runtime.
  • The types short int, unsigned int, unsigned long long, and boolean all map to Python integer objects.
  • The type DOMString maps to Python strings. xml.dom.minidom supports either bytes or strings, but normally produces strings. Values of type DOMString may also be None where allowed to have the IDL null value by the DOM specification from the W3C.
  • const declarations map to variables in their respective scope (e.g., xml.dom.minidom.Node.PROCESSING_INSTRUCTION_NODE); they must not be changed.
  • DOMException is currently not supported in xml.dom.minidom. Instead, xml.dom.minidom uses standard Python exceptions such as TypeError and AttributeError.
  • NodeList objects are implemented using Python's built-in list type. These objects provide the interface defined in the DOM specification, but with earlier versions of Python they do not support the official API. They are, however, much more "Pythonic" than the interface defined in the W3C recommendations.

The following interfaces have no implementation in xml.dom.minidom:

  • DOMTimeStamp
  • DocumentType
  • DOMImplementation
  • CharacterData
  • CDATASection
  • Notation
  • Entity
  • EntityReference
  • DocumentFragment

Most of these reflect information in the XML document that is not of general utility to most DOM users.