Table of Contentsnext
 

Chapter 1: Introduction

La sabiduría no vale la pena si no es posible servirse de ella para inventar una nueva manera de preparar los garbanzos.
—Un sabio catalán
in "Cien años de soledad"
Gabriel García Márquez

The goal of PyTables is to enable the end user to manipulate easily scientific data tables and array objects objects in a hierarchical structure. The foundation of the underlying hierarchical data organization is the excellent HDF5 library (see ).

It is important to remark that this package is not intended to serve as a complete wrapper for the entire HDF5 API, but to provide a flexible, very Pythonic tool to deal with (arbitrary) large amounts of data (typically bigger than available memory) in tables and arrays organized in a hierarchical, persistent disk storage.

A table is defined as a collection of records whose values are stored in fixed-length fields. All records have the same structure and all values in each field have the same data type. The terms fixed-length and strict data types seems to be quite a strange requirement for an interpreted language like Python, but they serve a useful function if the goal is to save very large quantities of data (such as is generated by many scientific applications, for example) in an efficient manner that reduces demand on CPU time and I/O.

In order to emulate records (that will be mapped to C structs in HDF5) in Python PyTables implements a special metaclass object in order to easily define all its fields and other properties. PyTables also provides a powerful interface to mine data in table. Records in tables are also known, in the HDF5 naming scheme, as compound data types.

For example, you can define arbitrary tables in Python simply by declaring a class with the name field and types information, like in:

class Particle(IsDescription):
    name      = StringCol(16)   # 16-character String
    idnumber  = Int64Col()      # Signed 64-bit integer
    ADCcount  = UInt16Col()     # Unsigned short integer
    TDCcount  = UInt8Col()      # unsigned byte
    grid_i    = Int32Col()      # integer
    grid_j    = IntCol()        # integer (equivalent to Int32Col)
    pressure  = Float32Col(shape=(2,3)) # 2-D float array (single-precision)
    energy    = FloatCol(shape=(2,3,4)) # 3-D float array (double-precision) 

then, you have to pass this class to the table constructor, fill its rows with your values, and save (arbitrary large) collections of them in a file for persistent storage. After that, this data can be retrieved and post-processed quite easily with PyTables or even with another HDF5 application (in C, Fortran, Java or whatever language that provides an interface to HDF5).

Next section describes the most interesting capabilities of PyTables.

1.1 Main Features

PyTables take advantage of the powerful object orientation and introspection capabilities offered by Python to bring the next features to the user:

  • Support of table entities: Allows working with a large number of records, i.e. that don't fit in memory.
  • Appendable tables: It supports adding records to already created tables. This can be done without copying the dataset or redefining its structure, even between different Python sessions.
  • Multidimensional table cells: You can declare a column to be formed by general array cells, in addition to only scalars, as the majority of relational databases do.
  • Support of arrays: Numeric (see ) and numarray (see ) arrays are a very useful complement of tables to keep homogeneous table slices (like selections of table columns).
  • Supports a hierarchical data model: That way, you can structure very clearly all your data. PyTables builds up an object tree in memory that replicates the underlying file data structure. Access to the file objects is achieved by walking throughout this object tree, and manipulating it.
  • Support of files bigger than 2 GB: The underlying HDF5 library already can do that (if your platform supports the C long long integer, or, on Windows, __int64), and PyTables automatically inherits this capability.
  • Can read generic HDF5 files: PyTables can access to objects in generic HDF5 files provided they contain any combination of groups, compound type datasets (that will be mapped to Table objects) or homogeneous datasets (that will be mapped to Array objects). However, as these kind of data is the most common to be saved HDF5 format, PyTables can probably access most of the HDF5 files out there.
  • Data compression: It supports data compression (through the use of the Zlib, LZO and UCL libraries) out of the box. This become important when you have repetitive data patterns and don't want to loose your time searching for an optimized way to save them (i.e. it saves you data organization analysis time).
  • High performance I/O: On modern systems, and for large amounts of data, tables and array objects can be read and written at a speed only limited by the performance of the underlying I/O subsystem. Moreover, if your data is compressible, even faster than that!.
  • Architecture-independent: PyTables has been carefully coded (as HDF5 itself) with little-endian/big-endian byte orderings issues in mind . So, in principle, you can write a file in a big-endian machine (like a Sparc or MIPS) and read it in other little-endian (like Intel or Alpha) without problems.

1.2 The Object Tree

The hierarchical model of the underlying HDF5 library allows PyTables to manage tables and arrays in a tree-like structure. In order to achieve this, an object tree entity is dynamically created imitating the HDF5 structure on disk. That way, the access to the HDF5 objects is made by walking throughout this object tree, and, by looking at their metadata nodes, you can get a nice picture of what kind data is kept there.

The different nodes in the object tree are instances of PyTables classes. There are several types of those classes, but the most important ones are the Group and the Leaf. Group instances (that we will be calling groups from now on) are a grouping structure containing instances of zero or more groups or leaves, together with supplementary metadata. Leaf instances (that will be called leaves) are containers for actual data and cannot contain further groups or leaves. The Table and Array classes are descendants of Leaf, and inherits all its properties.

Working with groups and leaves is similar in many ways to working with directories and files, respectively, in a Unix filesystem. As with Unix directories and files, objects in the object tree are often described by giving their full (or absolute) path names. In PyTables this full path can be specified either as string (like in '/subgroup2/table3') or as a complete object path written in a certain way known as natural name schema (like in file.root.subgroup2.table3).

The support for natural naming is a key aspect of PyTables and means that the names of instance variables of the node objects are the same as the names of the element's children1). This is very Pythonic and comfortable in many cases, as you can check in the tutorial section 3.1.6.

You should also note that not all the data present on file is loaded in the object tree, but only the metadata (i.e. special data that describes the structure of the actual data). The actual data is not read until you ask for it (by calling a method on a particular node). By making use of the object tree (the metadata) you can get information on the objects on disk such as table names, title, name columns, data types in columns, the number of rows, or, in the case of arrays, the shape, the typecode, and so on. You can also traverse the tree in order to search for something and when you find the data you are interested in you can read it and process it. In some sense, you can think of PyTables as a tool that provide the same introspection capabilities of Python objects, but applied to the persistent storage of large amounts of data.

To better understand the dynamic nature of this object tree entity, let's start by a first example and try to realize what kind of object tree the next script (you can find it in examples/objecttree.py) would create:

from tables import *

class Particle(IsDescription):
    identity = StringCol(length=22, dflt=" ", pos = 0)  # character String
    idnumber = Int16Col(1, pos = 1)  # short integer
    speed    = Float32Col(1, pos = 1)  # single-precision

# Open a file in "w"rite mode
fileh = openFile("objecttree.h5", mode = "w")
# Get the HDF5 root group
root = fileh.root

# Create the groups:
group1 = fileh.createGroup(root, "group1")
group2 = fileh.createGroup(root, "group2")

# Now, create a table in "group0" group
array1 = fileh.createArray(root, "array1", ["string", "array"], "String array")
# Create 2 new tables in group1
table1 = fileh.createTable(group1, "table1", Particle)
table2 = fileh.createTable("/group2", "table2", Particle)
# Create the last table in group2
array2 = fileh.createArray("/group1", "array2", [1,2,3,4])

# Now, fill the tables:
for table in (table1, table2):
    # Get the record object associated with the table:
    row = table.row
    # Fill the table with 10 records
    for i in xrange(10):
        # First, assign the values to the Particle record
        row['identity']  = 'This is particle: %2d' % (i)
        row['idnumber'] = i
        row['speed']  = i * 2.
        # This injects the Record values
        row.append()

    # Flush the table buffers
    table.flush()

# Finally, close the file (this also will flush all the remaining buffers!)
fileh.close()
	

This small program creates a simple HDF5 file, called objecttree.h5, with the structure that appears in figure 1.1. During creation time, metadata in the object tree is updated in memory while the actual data is being saved on disk and when you close the file the object tree becomes unavailable. But, when you will open again this file the object tree with will be re-constructed in memory from the metadata existent on disk, so that you can work with it exactly in the same way than during the original creation process.

An HDF5 example with 2 subgroups, 2 tables ... (Click for original bitmap)
Figure 1.1: An HDF5 example with 2 subgroups, 2 tables and 1 array.

In figure 1.2 you can see an example of the object tree created by reading the above objecttree.h5 file (in fact, such an object is always created when reading any supported generic HDF5 file). If you are going to become a PyTables user, take your time to understand it2). That will also make you more proactive by avoiding programming mistakes.

An object tree example in PyTables.
Figure 1.2: An object tree example in PyTables.

1) I have got this simple but powerful idea from the excellent Objectify module by David Mertz (see )
2) Bear in mind, however, that this diagram is not a standard UML class diagram; it is rather meant to show the connections between the PyTables objects and some of its most important attributes and methods.

 Table of Contentsnext