On this chapter, you will get deeper knowledge of PyTables internals. PyTables has several places where the user can improve the performance of his application. If you are planning to deal with really large data, you should read carefully this section in order to learn how to get an important boost for your code. But if your dataset is small or medium size (say, up to 1 MB), you should not worry about that as the default parameters in PyTables are already tuned to handle that perfectly.
Psyco (see )is a kind of specialized compiler for Python that typically accelerates Python applications with no change in source code. You can think of Psyco as a kind of just-in-time (JIT) compiler, a little bit like Java's, that emit machine code on the fly instead of interpreting your Python program step by step. The result is that your unmodified Python programs run faster.
Psyco is very easy to install and use, so in most scenarios it is worth to have it a try. However, it only runs on Intel 386 architectures, so if you are using other architectures, you are out of luck (at least until Psyco will support yours).
As an example, imagine that you have a small script that reads and selects data over a series of datasets, like this:
def readFile(filename):
"Select data from all the tables in filename"
fileh = openFile(filename, mode = "r")
result = []
for table in fileh("/", 'Table'):
result = [ p['var3'] for p in table if p['var2'] <= 20 ]
fileh.close()
return e
if __name__=="__main__":
print readFile("myfile.h5")
In order to accelerate this piece of code, you can rewrite your main program to look like:
if __name__=="__main__":
import pysco
psyco.bind(readFile)
print readFile("myfile.h5")
That's all!. From now on, each time that you execute your python script, Psyco will deploy its sophisticated algorithms so as to accelerate your calculations.
You can see in the graphs 5.1 and 5.2 how much I/O speed improvement you can get by using Psyco. By looking at this figures you can get an idea if these improvements are of your interest or not. In general, if you are not going to use compression you will take advantage of Psyco if your tables are medium sized (1e+3 < nrows < 1e+6), and this advantage will disappear progressively when the number of rows grows well over one million. However if you use compression, you will probably see improvements even beyond this limit (see section 5.2). As always, there is no substitute for experimentation with your own dataset.
One of the beauties of PyTables is that it supports compression on tables (but not on arrays!, that may come later), although it is disabled by default. Compression of big amounts of data might be a bit controversial feature, because compression has a legend of being a very big CPU time resources consumer. However, if you are willing to check if compression can help not only reducing your dataset file size but also improving your I/O efficiency, keep reading.
There is an usual scenario where users need to save duplicated data in some record fields, while the others have varying values. In a relational database approach such a redundant data can normally be moved to other tables and a relationship between the rows on the separate tables can be created. But that takes analysis and implementation time, and made the underlying libraries more complex and slower.
PyTables transparent compression allows the user to not worry about finding which is their optimum data tables strategy, but rather use less, not directly related, tables with a larger number of columns while still not cluttering the database too much with duplicated data (compression is responsible to avoid that). As a side effect, data selections can be made more easily because you have more fields available in a single table, and they can be referred in the same loop. This process may normally end in a simpler, yet powerful manner to process your data (although you should still be careful about what kind of scenarios compression use is convenient or not).
The compression library used by default is the Zlib (see ), and as HDF5 requires it, you can safely use it and expect that your HDF5 files can be read on any other platform that has HDF5 libraries installed. Zlib provides good compression ratio, although somewhat slow, and reasonably fast decompression. Because of that, it is a good candidate to be used for compress you data.
However, in many situations (i.e. write once, read multiple), it is critical to have very good decompression speed (at expense of whether less compression or more CPU wasted on compression, as we will see soon). This is why support for two additional compressors has been added to PyTables: LZO and UCL (see ). Following his author (and checked by the author of this manual), LZO offers pretty fast compression (although small compression ratio) and extremely fast decompression while UCL achieve an excellent compression ratio (at the price of spending much more CPU time) while allowing very fast decompression (and very close to the LZO one). In fact, LZO and UCL are so fast when decompressing that, in general (that depends on your data, of course), writing and reading a compressed table is actually faster (and sometimes much faster) than if it is uncompressed. This fact is very important, specially if you have to deal with very large amounts of data.
Be aware that the LZO and UCL support in PyTables is not standard on HDF5, so if you are going to use your PyTables files in other contexts different from PyTables you will not be able to read them.
In order to give you a raw idea of what ratios would be achieved, and what resources would be consumed, look at the table 5.1. This table has been obtained from synthetic data and with a somewhat outdated PyTables version (0.5), so take this just as a guide because your mileage will probably vary. Have also a look at the graphs 5.3 and 5.4 (these graphs has been obtained with tables with different row sizes and PyTables version than the previous example, so, do not try to directly compare the figures). They show how evolves the speed of writing/reading rows as the size (the row number) of tables grows. Even though in these graphs the size of one single row is 56 bytes, you can most probably extrapolate this figures to other row sizes. If you are curious how well can perform compression together with Psyco, look at the graphs 5.5 and 5.6. As you can see, the results are pretty interesting.
| Compr. Lib | File size (MB) | Time writing (s) | Time reading (s) | Speed writing (Krow/s) | Speed reading (Krow/s) |
|---|---|---|---|---|---|
| NO COMPR | 244.0 | 24.4 | 16.0 | 18.0 | 27.8 |
| Zlib (lvl 1) | 8.5 | 17.0 | 3.11 | 26.5 | 144.4 |
| Zlib (lvl 6) | 7.1 | 20.1 | 3.10 | 22.4 | 144.9 |
| Zlib (lvl 9) | 7.2 | 42.5 | 3.10 | 10.6 | 145.1 |
| LZO (lvl 1) | 9.7 | 14.6 | 1.95 | 30.6 | 230.5 |
| UCL (lvl 1) | 6.9 | 38.3 | 2.58 | 11.7 | 185.4 |
By looking at graphs, you can expect that, generally speaking, LZO would be the fastest both compressing and uncompressing, but the one that achieves the worse compression ratio (although that may be just ok for many situations). UCL is the slowest when compressing, but is faster than Zlib when decompressing, and, besides, it achieves very good compression ratios (generally better than Zlib). Zlib represents a balance between them: it's somewhat slow compressing, the slowest during decompressing, but it normally achieves fairly good compression ratios.
So, if your ultimate goal is reading as fast as possible, choose LZO. If you want to reduce as much as possible your data, while retaining good read speed, choose UCL. If you don't mind too much about the above parameters and/or portability is important for you, Zlib is your best bet.
The compression level that I recommend to use for all compression libraries is 1. This is the lowest level of compression, but if you take the approach suggested above, normally the redundant data is to be found in the same row, so the redundant data locality is very high and such a small level of compression should be enough to achieve a good compression ratio on your data tables, saving CPU cycles for doing other things. Nonetheless, in some situations you may want to check how compression level affects your application.
You can select the compression library and level by setting the complib and compress keywords in the createTable method (see ??). A compression level of 0 will completely disable compression (the default), 1 is the less CPU time demanding level, while 9 is the maximum level and most CPU intensive. Finally, have in mind that LZO is not accepting a compression level right now, so, when using LZO, 0 means that compression is not active, and any other value means that LZO is active.
The underlying HDF5 library that is used by PyTables takes the data in bunches of a certain length, so-called chunks, to write them on disk as a whole, i.e. the HDF5 library treats chunks as atomic objects and disk I/O is always made in terms of complete chunks. This allows data filters to be defined by the application to perform tasks such as compression, encryption, checksumming, etc. on entire chunks.
An in-memory B-tree is used to map chunk structures on disk. The more chunks that are allocated for a dataset the larger the B-tree. Large B-trees take memory and causes file storage overhead as well as more disk I/O and higher contention for the metadata cache. Consequently, it's important to balance between memory and I/O overhead (small B-trees) and time to access to data (big B-trees).
PyTables can determine an optimum chunk size to make B-trees adequate to your dataset size if you help it by providing an estimation of the number of rows for a table. This must be made in table creation time by passing this value in the expectedrows keyword of createTable method (see ??).
When your table size is bigger than 1 MB (take this figure only as a reference, not strictly), by providing this guess of the number of rows you will be optimizing the access to your data. When the table size is larger than, say 100MB, you are strongly suggested to provide such a guess; failing to do that may cause your application doing very slow I/O operations and demanding huge amounts of memory. You have been warned!.
If you have a huge tree in your data file with many nodes on it, creating the object tree would take long time. Many times, however, you are interested only in access to a part of the complete tree, so you won't strictly need PyTables to build the entire object tree in-memory, but only the interesting part.
This is where the rootUEP parameter of openFile() function (see ??) can be helpful. Imagine that you have a file called "test.h5" with the associated tree that you can see in figure 5.7, and you are interested only in the section marked in red. You can avoid the build of all the object tree by saying to openFile that your root will be the /Group2/Group3 group. That is:
fileh = openFile("test.h5", rootUEP="/Group2/Group3")
As a result, the actual object tree built will be like the one that can be seen in figure 5.8.
Of course this has been a simple example and the use of the rootUEP parameter was not very necessary. But when you have thousands of nodes on a tree, you will certainly appreciate the rootUEP parameter.