Process XML using regular expression with Python

I wrote a post about using xml.etree.ElementTree as a xml parser in Python. However, this cause some problem, because xml parser will automatically transfer the HTML escape characters in the xml into unicode characters, and unfortunately this is not what we expected… I haven’t fully understand the different encoding system right now, anyway here is a good post discussing about Unicode and HTML Escape characters in Python.

So I go to the straightforward method: I will use the regular expression (RE) to get the corresponding fields. Let’s look at the example xml file again, which has a crazy ssid attribute. The xml parser will transfer the HTML escape characters, e.g. Ü, into some unicode characters.

<configuration auth=”OPEN” encryption=”NONE” type=”wlan” ssid=”FIkUÜ{uo8wfS&lt;5&amp;MXMqgve&lt;lTG” hwaddr=”68:9f:f2:fd:da:e0″ allocation=”dhcp”/>

If just use RE, we just look for a string starts with ssid=”, and ends with  .


import re
xml_string = open('test.xml').read() #the whole xml file now is a string
match_ssid = re.search('ssid="([^"]+)', xml_string)
print match_ssid.group(1)
#Result: FIkUÜ{uo8wfS&lt;5&amp;MXMqgve&lt;lTG

The RE [^”]+ means any characters except for . RE is so powerful tool when dealing with strings, I will explore RE more in the future.

Advertisements
Tagged , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: