dimanche 29 mars 2009

XPath for my trees, continued

In my previous post, I demonstrated how to use XPath to request dynamically created trees. These dynamically created trees can be used to encapsulate any type of tree structure, making it possible to request any tree structure with XPath.

A few more remarks and ideas now:
  • This works also for infinite trees. For instance, I used this system to encapsulate web pages, the children of each page being the pages it links to. Well, it "works" as long as you don't let the XPath engine request the entire tree structure, and I found 2 issues so far: ".." will try to create a map with *all* nodes. And "//" is obviously an XPath idiom that will traverse the entire tree. Don't do that with very large tree...

  • You can also use this to request database structures: If an Element encapsulates a record in a table, you can see the children of it as the records in other tables with foreign keys pointing to it. Examples: Table1[@col1='foo']/Table2[@col2='bar'] will request all elements from Table2 where the attributes (the column) col2 = 'bar', and it has a link to Table1, where the col1 = 'foo'. Whao, I prefer to write that than SQL, but there are drawbacks: First, it will create a lot of SQL requests, and the XPath engine I used (ElementTree 1.3) will get *all* children of a node before taking only the ones that correspond to the request. If you have *many* children, you have a performance issue (I do...). Small (but time consuming, I think) enhancements to the ElementTree API would solve this problem. If the XPath engine used a function that requests only the children with specific attributes, I could fool it with my own implementation of the function (that would send an actual SQL request with the proper WHERE close).


So far I've used the idea for different tree structures, and the results are the following:
  • The XPath requests are usually very simple, much simpler in fact than the code there used to be to request the tree structure

  • You must be careful when writing the request. The XPath engine will do what you ask it to do... including trying to load the entire tree structure if necessary... even if it's infinite.

  • Enhancements to the ElementTree API would help a lot making the algorithms even more efficient. Today, the XPath engine loads too much data then sorts it.

  • The power of your XPath requests depends a lot from the attributes you can use. Therefore you have to expose attributes of the encapsulated data, so that they can be used. For instance, if you create an encapsulation of web documents, you probably want to expose the mime type of the "document".

  • You can also use mixed types of encapsulation: an Element can return any type of children, as long as they are elements. So... if you encapsulate a document that happen to be XML, don't hesitate to expose the root node of the XML document as the only child of your document... that enables you to create XPath requests like this one: "/dir[@name="my directory"]"/dir/file[@ext="xml"]/Tag1//Tag2", that will request all Tag2 in the xml documents that have a root element named "Tag1".

  • Effbot, thanks for the worked done in ElementTree!

  • But may be I should try other XPath implementations too? Today ElementTree XPath engine is limited, and my requests are therefore limited as well.

Xpath for my trees

I recently had to request the content of tree-like structures (in python). After a few iterations, I realized that I needed a language for such a task. I certainly prefer to read a simple request than a complex algorithm. After all, we already have something similar: XPath to request XML structures.

First (stupid) approach: create a comprehensive XML tree

My first approach was to transform my data into an XML tree, then request it with the corresponding XPath engine. It definitely works if you can create the tree structure (because it is not too large, nor does it take too long to create the tree).

Let's take an example (a very simple one, but that's for demo purposes). We want to request the file system. Let's say we want to find a folder that contains 2 things: a file with a "tgz" extension, and a sub-folder containing a "foo.txt" file. The folder should be located directly under the root node. (I want to avoid requests that need to traverse the whole tree)

A file system is definitely a tree structure. To keep it simple, we will create an XML structure with the following properties: the directories will have the "dir" name, and there will be two attribute: the name of the directory, and longname (the full name, given where the first node is). The files will be represented as "file" nodes, with the name and extensions as attributes "name" and "ext".

BTW, we have the following tree file structure:


C:\Tmp
Baz (dir)
Bar (dir)
A (dir)
text.tgz (file)
B (dir)
foo.txt (file)
Here is the code (with ElementTree 1.3a3-20070912-preview, taken from svn)

import unittest
from elementtree.ElementPath import findall
import elementtree.ElementTree as ET
import os, sys

def createXmlTree (start):

elements = {} # a dictionary of nodes

# Create the very root node
root = ET.Element ("dir")
root.set ("name", start)
root.set ("longname", start)
elements [start] = root

for nodes in os.walk (start):
parent = elements [nodes[0]] # we find which node we should be attanched to
for folder in nodes[1]: #nodes[1] is the children directories
elt = ET.SubElement (parent, "dir")
#Name of the folder
elt.set ("name", folder)
#Long name, including parent long name
long_name = os.sep.join ((parent.get("longname"), folder))
elt.set ("longname", long_name)
elements [long_name] = elt
for file in nodes [2]: #nodes[1] is the children files
elt = ET.SubElement (parent, "file")
elt.set ("name", file)
elt.set ("ext", os.path.splitext(file)[1][1:])

return ET.ElementTree(root)

class TestXPathOnRealXmlTree (unittest.TestCase):
def testOnCTmp (self):
doc = createXmlTree (r"C:\tmp")
result= [e for e in findall (doc.getroot(), "./dir/dir/file[@ext='tgz']/../dir/file[@name='foo.txt']/../..")]
assert len(result) == 1
assert result[0].get("name") == "A"
assert result[0].tag == "dir"
OK, the XPath request is not very simple, but it works quite well.

It says: for each "dir" element, find a "dir" element in it, check that it contains a file with a "ext" attribute named "tgz". If found come back to the "dir" element (with ..), then find a "dir" element under it, then find a "file" element under it, with a "name" attribute which is equal to "foo.txt". If found, come back to the previous dir (..), then the previous one (..).

Ok, it works, I can see my "A" directory.

But we have to create the entire XML document first. If we want to do this type of request against the entire file system, this is definitely not a good idea. Therefore, we will try another approach, with dynamically created elements.

(BTW, you can try to do the same exact thing with lxml, pdis-xpaths, and with pyXml (no longer maintained, but still used)

Using ElementTree with dynamically created elements and XPath requests

Now we need to understand how the XPath engine traverse the XML tree, calling which functions and accessing which attributes.

For this, I investigated how the ElementTree XPath engine accessed the internal data from its Elements, and I re-implemented them as properties:


class DynamicElement (ET.Element):
def __init__ (self):
self._attrib = None
self._children_list = None
self.cachedTag = None

def _dyn_get_attrs (self):
raise NotImplementedError ("must have attributes")
def _dyn_get_children (self):
raise NotImplementedError ("must have children")
def _dyn_get_tag (self):
raise NotImplementedError ("must have a tag")

def _children (self):
if not self._children_list :
self._children_list = self._dyn_get_children ()
return self._children_list
_children = property (_children)

def attrib (self):
if not self._attrib :
self._attrib = self._dyn_get_attrs ()
return self._attrib
attrib = property (attrib)

def tag (self):
if not self.cachedTag :
self.cachedTag = self._dyn_get_tag()
return self.cachedTag
tag = property (tag)


class FileSystemElement (DynamicElement):
def __init__ (self, isDir, name, parent):
self.name = name
self.isDir = isDir
self.parent = parent
super (FileSystemElement,self).__init__ ()

def _dyn_get_tag (self):
return "dir" if self.isDir else "file"

# We need to have an attrib property that the XPath engine will call
def _dyn_get_attrs (self):
attrib = {}
attrib ["longname"] = os.path.join (self.parent.get("longname"), self.name) if self.parent else self.name
attrib ["name"] = self.name
if not self.isDir : attrib ["ext"] = os.path.splitext(self.name)[1][1:]
return attrib

def _dyn_get_children (self):
(dirs, files) = os.walk(self.get("longname")).next()[1:]
children_list = [FileSystemElement (isDir=True, name=dir, parent=self) for dir in dirs]
children_list.extend ([FileSystemElement (isDir=False, name=f, parent=self) for f in files])

return children_list

class TestXPathOnDynamicXmlTree (unittest.TestCase):
def testOnCTmp (self):
root = FileSystemElement(isDir = True, name="C:\\tmp", parent=None)
doc = ET.ElementTree(root)
result= [e for e in findall (doc.getroot(), "./dir/dir/file[@ext='tgz']/../dir/file[@name='foo.txt']/../..")]
assert len(result) == 1
assert result[0].get("name") == "A"
assert result[0].tag == "dir"

In fact, you just need to inherit the DynamicElement class, and to implement 3 functions, and you can have XPath requests against them.

Now a few examples (with root = FileSystemElement(isDir = True, name="put your root dir here", parent=None)):

  • Finding all tarballs: findall (doc.getroot(), ".//file[@ext='tgz'])

  • Finding all tarballs that are somewhere under all directories named "foo": findall (doc.getroot(), ".//dir[@name='foo']//file[@ext='tgz'])

  • Finding all directories: findall (doc.getroot(), ".//dir)
Remarks:
  • Don't use "..' in your XPath expression. The ElementTree XPath engine creates the entire tree structure if you do so... which is a problem if the tree is huge
  • If you use "//' in the XPath expression and "findall", the XPath engine will generate all elements while traversing the entire tree again.


------------
Update: I just realized there was already something similar to request unix file system with xfind

Google earth, what's next ?

Google Earth has changed the way my kids understand the little planet we live on. They can play with it, zoom in and see the grand father house, etc.

But hey, there is much more to come, don't you think ? Here are a few ideas for Google Earth and Google Map in the following years:

Choose your date

Why can't I see the map of any time in the past? I may want to see how the earth was a few months ago, just to have a look at the snow limit in the mountains last year, or just to see how beautiful this forest was before they started digging for petrol.

How, by the way, I would like to see the satellite view of my house now. I mean: less than a few hours ago.

And I also want to see the differences between 2 maps of the same place, but at different times. Yes, please highlight only was is relevant.

Community-driven google map and GPS

I already found myself lost with the help of a GPS. This nice and smart piece of technology just did not know that a track was not a road. Annoying when you don't drive a 4WD.

Let's say we have a google-map GPS (I'm pretty sure it exists already).

Why doesn't google store the ways people actually take? If a road is never used by anybody, there may be a reason, right? May be we could tell the GPS users that this road does not seem very popular. Or the GPS could issue a warning and suggest another way

In order to do so, you would need to record, meter by meter, which road took which GPS. Annoying, really, for our privacy concerns. That is not very different from recording all the pages where people surf, but well...

But with that, we could also know the traffic jam statistics, the actual average speed you can drive on each road, etc...

Flight simulator

Well, it was expected: there is a flight simulator in Google Earth. But why can't we have real-time weather? Why can't we see the planes of all other players? Why can't we see the real cars that have a google-GPS?

Conclusion

In terms of privacy issues, we have seen nothing so far. There is no reason to stop now; and recording where people go, when, and even why (ex: this guy just googled a restaurant five minutes ago...) is around the corner.

How much will we be ready to give up in terms of privacy to get better services?

Don't know. This is scary, but some really useful services could be offered.

samedi 3 janvier 2009

Recursive dictionary

The is a note to self. Found this while googling dictionaries (author = Paul McGuire), and this can definitely be very useful.


from collections import defaultdict

class recursivedefaultdict(defaultdict):
    def __init__(self):
        self.default_factory = type(self)

Here is this recursivedefaultdict in action:

data = [
('A','B','Z',1),
('A','C','Y',2),
('A','C','X',3),
('B','A','W',4),
('B','B','V',5),
('B','B','U',6),
('B','D','T',7),
]

table = recursivedefaultdict()

for k1,k2,k3,v in data:
    table[k1][k2][k3] = v

for kk in sorted(table.keys()):
    print "-",kk
    for jj in sorted(table[kk].keys()):
        print " -",jj
        for ii in sorted(table[kk][jj].keys()):
            print " -",ii,table[kk][jj][ii]

Prints:

- A
  - B
    - Z 1
  - C
    - X 3
    - Y 2
- B
  - A
    - W 4
  - B
    - U 6
    - V 5
  - D
    - T 7