I recently had to request the content of tree-like structures (in python). After a few iterations, I realized that I needed a language for such a task. I certainly prefer to read a simple request than a complex algorithm. After all, we already have something similar: XPath to request XML structures.
First (stupid) approach: create a comprehensive XML tree
My first approach was to transform my data into an XML tree, then request it with the corresponding XPath engine. It definitely works if you can create the tree structure (because it is not too large, nor does it take too long to create the tree).
Let's take an example (a very simple one, but that's for demo purposes). We want to request the file system. Let's say we want to find a folder that contains 2 things: a file with a "tgz" extension, and a sub-folder containing a "foo.txt" file. The folder should be located directly under the root node. (I want to avoid requests that need to traverse the whole tree)
A file system is definitely a tree structure. To keep it simple, we will create an XML structure with the following properties: the directories will have the "dir" name, and there will be two attribute: the name of the directory, and longname (the full name, given where the first node is). The files will be represented as "file" nodes, with the name and extensions as attributes "name" and "ext".
BTW, we have the following tree file structure:
C:\Tmp
Baz (dir)
Bar (dir)
A (dir)
text.tgz (file)
B (dir)
foo.txt (file)
Here is the code (with ElementTree 1.3a3-20070912-preview, taken from svn)
import unittest
from elementtree.ElementPath import findall
import elementtree.ElementTree as ET
import os, sys
def createXmlTree (start):
elements = {} # a dictionary of nodes
# Create the very root node
root = ET.Element ("dir")
root.set ("name", start)
root.set ("longname", start)
elements [start] = root
for nodes in os.walk (start):
parent = elements [nodes[0]] # we find which node we should be attanched to
for folder in nodes[1]: #nodes[1] is the children directories
elt = ET.SubElement (parent, "dir")
#Name of the folder
elt.set ("name", folder)
#Long name, including parent long name
long_name = os.sep.join ((parent.get("longname"), folder))
elt.set ("longname", long_name)
elements [long_name] = elt
for file in nodes [2]: #nodes[1] is the children files
elt = ET.SubElement (parent, "file")
elt.set ("name", file)
elt.set ("ext", os.path.splitext(file)[1][1:])
return ET.ElementTree(root)
class TestXPathOnRealXmlTree (unittest.TestCase):
def testOnCTmp (self):
doc = createXmlTree (r"C:\tmp")
result= [e for e in findall (doc.getroot(), "./dir/dir/file[@ext='tgz']/../dir/file[@name='foo.txt']/../..")]
assert len(result) == 1
assert result[0].get("name") == "A"
assert result[0].tag == "dir"
OK, the XPath request is not very simple, but it works quite well.
It says: for each "dir" element, find a "dir" element in it, check that it contains a file with a "ext" attribute named "tgz". If found come back to the "dir" element (with ..), then find a "dir" element under it, then find a "file" element under it, with a "name" attribute which is equal to "foo.txt". If found, come back to the previous dir (..), then the previous one (..).
Ok, it works, I can see my "A" directory.
But we have to create the entire XML document first. If we want to do this type of request against the entire file system, this is definitely not a good idea. Therefore, we will try another approach, with dynamically created elements.
(BTW, you can try to do the same exact thing with
lxml,
pdis-xpaths, and with
pyXml (no longer maintained, but still used)
Using ElementTree with dynamically created elements and XPath requestsNow we need to understand how the XPath engine traverse the XML tree, calling which functions and accessing which attributes.
For this, I investigated how the ElementTree XPath engine accessed the internal data from its Elements, and I re-implemented them as properties:
class DynamicElement (ET.Element):
def __init__ (self):
self._attrib = None
self._children_list = None
self.cachedTag = None
def _dyn_get_attrs (self):
raise NotImplementedError ("must have attributes")
def _dyn_get_children (self):
raise NotImplementedError ("must have children")
def _dyn_get_tag (self):
raise NotImplementedError ("must have a tag")
def _children (self):
if not self._children_list :
self._children_list = self._dyn_get_children ()
return self._children_list
_children = property (_children)
def attrib (self):
if not self._attrib :
self._attrib = self._dyn_get_attrs ()
return self._attrib
attrib = property (attrib)
def tag (self):
if not self.cachedTag :
self.cachedTag = self._dyn_get_tag()
return self.cachedTag
tag = property (tag)
class FileSystemElement (DynamicElement):
def __init__ (self, isDir, name, parent):
self.name = name
self.isDir = isDir
self.parent = parent
super (FileSystemElement,self).__init__ ()
def _dyn_get_tag (self):
return "dir" if self.isDir else "file"
# We need to have an attrib property that the XPath engine will call
def _dyn_get_attrs (self):
attrib = {}
attrib ["longname"] = os.path.join (self.parent.get("longname"), self.name) if self.parent else self.name
attrib ["name"] = self.name
if not self.isDir : attrib ["ext"] = os.path.splitext(self.name)[1][1:]
return attrib
def _dyn_get_children (self):
(dirs, files) = os.walk(self.get("longname")).next()[1:]
children_list = [FileSystemElement (isDir=True, name=dir, parent=self) for dir in dirs]
children_list.extend ([FileSystemElement (isDir=False, name=f, parent=self) for f in files])
return children_list
class TestXPathOnDynamicXmlTree (unittest.TestCase):
def testOnCTmp (self):
root = FileSystemElement(isDir = True, name="C:\\tmp", parent=None)
doc = ET.ElementTree(root)
result= [e for e in findall (doc.getroot(), "./dir/dir/file[@ext='tgz']/../dir/file[@name='foo.txt']/../..")]
assert len(result) == 1
assert result[0].get("name") == "A"
assert result[0].tag == "dir"
In fact, you just need to inherit the DynamicElement class, and to implement 3 functions, and you can have XPath requests against them.
Now a few examples (with root = FileSystemElement(isDir = True, name="put your root dir here", parent=None)):
- Finding all tarballs: findall (doc.getroot(), ".//file[@ext='tgz'])
- Finding all tarballs that are somewhere under all directories named "foo": findall (doc.getroot(), ".//dir[@name='foo']//file[@ext='tgz'])
- Finding all directories: findall (doc.getroot(), ".//dir)
Remarks:
- Don't use "..' in your XPath expression. The ElementTree XPath engine creates the entire tree structure if you do so... which is a problem if the tree is huge
- If you use "//' in the XPath expression and "findall", the XPath engine will generate all elements while traversing the entire tree again.
------------
Update: I just realized there was already something similar to request unix file system with
xfind