Tag Archives: unicode

Process XML using regular expression with Python

I wrote a post about using xml.etree.ElementTree as a xml parser in Python. However, this cause some problem, because xml parser will automatically transfer the HTML escape characters in the xml into unicode characters, and unfortunately this is not what we expected… I haven’t fully understand the different encoding system right now, anyway here is a good post discussing about Unicode and HTML Escape characters in Python.

So I go to the straightforward method: I will use the regular expression (RE) to get the corresponding fields. Let’s look at the example xml file again, which has a crazy ssid attribute. The xml parser will transfer the HTML escape characters, e.g. Ü, into some unicode characters.

<configuration auth=”OPEN” encryption=”NONE” type=”wlan” ssid=”FIkUÜ{uo8wfS&lt;5&amp;MXMqgve&lt;lTG” hwaddr=”68:9f:f2:fd:da:e0″ allocation=”dhcp”/>

If just use RE, we just look for a string starts with ssid=”, and ends with  .

import re
xml_string = open('test.xml').read() #the whole xml file now is a string
match_ssid = re.search('ssid="([^"]+)', xml_string)
print match_ssid.group(1)
#Result: FIkUÜ{uo8wfS&lt;5&amp;MXMqgve&lt;lTG

The RE [^”]+ means any characters except for . RE is so powerful tool when dealing with strings, I will explore RE more in the future.

Tagged , , , ,