ulif.openoffice.cachemanager – A Cache Manager

A manager for storing generated files.

class ulif.openoffice.cachemanager.Bucket(path)

A bucket where we store files with same hash sums.

Warning

Bucket is not thread-safe!

Buckets store ‘source’ files and their representations. A representation is simply another file, optionally marked with a ‘suffix’. This is meant to be used like a certain office document (the ‘source’ file) for which different converted representations (for instance an HTML, or PDF version) might be stored.

For each source file there can be an arbitrary number of representations, as long as each representation provides a different ‘suffix’. The Bucket does not introspect the files and makes no assumptions about the file-type or format. So, you could store a PDF representation with an ‘xhtml’ suffix if you like.

The ‘suffix’ for a representation is a simple string and can be chosen by the user. Normally, you would choose something like ‘pdf’ for a PDF version of a certain source file.

Each bucket can hold several source files and knows which representations belong to which source file.

To make a distinction between different sources inside the same bucket, the bucket manages ‘markers’ which normally are simple stringified numbers, one for each source and the representations connected to it. You should, however, make no assumptions about the marker, except that it is a string.

Currently, you can store as much source files in a bucket, as the the maximum integer number can address.

create()

Create the default dirs for this bucket.

This method is called when instantiating a bucket.

You should therefore be aware that constructing a bucket will try to modify the file system.

getAllSourcePaths()

Get the paths of all source files stored in this bucket.

Returns a generator of paths.

getCurrentNum()

Get current source num.

getResultPath(path, suffix=None)

Get the cached result for path.

getResultPathFromMarker(marker, suffix=None)

Get path of a result file stored with marker marker and suffix suffix

If the path does not exist None is returned.

getSourcePath(path)

Get a path to a source file that equals file stored in path.

Returns a tuple (path, marker) or (None, None) if the source cannot be found.

setCurrentNum(num)

Set current source num.

storeResult(src_path, result_path, suffix=None)

Store file in result_path as representation of source in src_path.

Optionally store this result marked with a certain suffix string.

The result has to be a path to a single file.

If suffix is given, the representation will be stored marked with the suffix in order to be able to distinguish this representation from possible others.

If the source file given by src_path already exist in the bucket, the file in result_path will be stored as a representation of the already existing source file.

We determine wether an identical source file already exists in the bucket by comparing the given file in src_path with the source files already stored in the bucket byte-wise.

class ulif.openoffice.cachemanager.CacheManager(cache_dir, level=1)

A cache manager.

This cache manager caches processed files and their sources. It uses hashes and buckets to find paths of cached files quickly.

Overall it maps input files on output files. The cache manager is interesting when the computation of an output file is expensive but must be repeated often.

A sample application is to cache converted office files: as the computation is expensive, we can store the results of conversion in the cache manager and get it any time we want much more quickly. See cachemanager.txt for more infos.

It also checks for hash collisions: if two input files give the same hash, they will be handled correctly.

contains(path=None, marker=None, suffix=None)

Check, whether the file in path or marked by marker and with suffix suffix is already cached.

This is a convenience method for easy checking of caching state for certain files. You can also get the information by using other API methods of CacheManager.

You must at least give one of path or marker, not both.

The suffix parameter is optional.

Returns True or False.

getAllSources(parent=None, level=0)

Return all source documents.

getBucketFromHash(hash_digest)

Get a bucket in which a source with ‘hash_digest’ would be stored.

Note

This call creates the appropriate bucket in filesystem if it does not exist already!

getBucketFromPath(path)

Get a bucket in which the source given by path would be stored.

Note

This call creates the appropriate bucket in filesystem if it does not exist already!

getCachedFile(path, suffix=None)

Check, whether the file in path is already cached.

Returns the path of cached file or None. Only ‘result’ files are looked up and returned, not sources.

This method does not modify the filesystem if an appropriate bucket does not yet exist.

getCachedFileFromMarker(marker, suffix=None)

Check whether a basket exists for marker and suffix.

Returns the path to a file represented by that marker or None.

A basket exists, if there was already registered a doc, which returned that marker on registration.

The basket might contain a representation of type suffix. If this is true, the path to the file is returned, None else.

getHash(path)

Get the hash of a file stored in path.

Currently we compute the MD5 digest.

Note for derived classes, that the hash digest computed by this method should give only chars that can easily be processed as path elements in URLs. For instance slashes (which can occur in Base64 encoded strings) could make things difficult.

prepareCacheDir()

Prepare the cache dir, create dirs, etc.

registerDoc(source_path, to_cache, suffix=None)

Store a representation of file found in source_path which resides in to_cache to a bucket.

If suffix is not None the representation will be stored under the suffix name. A suffix is only a name and the cache manager makes no assumptions about file types or similar.

Returns a marker string which can be used in conjunction with the appropriate cache manager methods to retrieve the file later on.

ulif.openoffice.cachemanager.internal_suffix(suffix=None)

The suffix used internally in buckets.

Examples

ulif.openoffice.cachemanager – A cache manager

A cache manager tries to cache converted files, so that already converted documents do not have to be converted again.

Cache Manager

A cache manager expects a cache_dir parameter where it can store the cached files. If this parameter is set to None no caching will be performed at all:

>>> from ulif.openoffice.cachemanager import CacheManager
>>> cm = CacheManager(cache_dir=None)

If we pass a path, which already exists and is a file, the cache manager will complain but still be constructed.

If we pass a path that does not exist, it will be created:

>>> ls('home')
>>> cm = CacheManager(cache_dir='home/mycachedir')
>>> ls('home')
d  mycachedir

The cache manager can register files, look for already created conversions and pass them back if found.

We lookup a certain document which, as the cache is yet empty, cannot be found. We create a dummy file for this purpose:

>>> import os
>>> open('dummysource.doc', 'w').write('Just a dummy file.')
>>> docsource = os.path.abspath('dummysource.doc')
>>> docsource_contents = open(docsource, 'r').read()
>>> cm.contains(docsource, suffix='pdf')
False

We can also pass the file contents as argument:

>>> #cm.contains(extension = 'pdf', data = docsource_contents)

False

The cache is based on MD5 sums of source files. Source documents are not stored.

We can pass level to the constructor, if we want a directory level different to 1:

>>> cm = CacheManager(cache_dir='home/mycachedir', level=2)
>>> cm.level
2

This will result in a different organization of all the cached files and directories inside the caching directory. See section below to learn more about this mere internal feature.

Setting the level after creation of a cache manager is not recommended.

Feeding the cache manager

We can register conversion results with the cache manager, which will be available lateron.

Caching files

To demonstrate this, we create a dummy source file and a dummy conversioned file:

>>> import os
>>> open('dummysource.doc', 'w').write('Just a dummy file.')
>>> docsource = os.path.abspath('dummysource.doc')
>>> docsource_contents = open(docsource, 'r').read()
>>> open('dummyresult.pdf', 'w').write('I am not a real PDF.')
>>> pdfresult = os.path.abspath('dummyresult.pdf')

Now we can create a cache manager and register our stuff:

>>> cm = CacheManager(cache_dir='home/mycachedir')
>>> cm.registerDoc(source_path=docsource,
...                to_cache=pdfresult)
'08867237840fabae77b838e9c9226eb2_1'

The string we get back here is a unique marker we can use to identify the uploaded file (see also usage of markers below).

This will create the needed directories inside the cache dir and store all contents of the directory where the file to cache resides in it.

>>> ls('home/mycachedir/08/08867237840fabae77b838e9c9226eb2/')
-  data
d  results
d  sources

The ‘data’ file contains some pickled management infos.

While in ‘sources’ all sources with the same hash are stored, the ‘results’ dir contains all results belonging to a certain source:

>>> ls('home/mycachedir/08/08867237840fabae77b838e9c9226eb2/sources')
-  source_1
>>> ls('home/mycachedir/08/08867237840fabae77b838e9c9226eb2/results')
-  result_1_default

Caching files with a ‘suffix’

We can, however, also store a file with a certain ‘suffix’ in order to cache several results for one source. For example we might want to cache a PDF and an HTML version of the same file.

To do so, we have to provide a suffix on doc registration:

>>> cm.registerDoc(source_path=docsource,
...                to_cache=pdfresult,
...                suffix='pdf')
'08867237840fabae77b838e9c9226eb2_1'

We get back the marker of the sourcefile we use. It’s the same as above. Actually, we have now several stored files in the basket. First, the source file which we store to be able to compare upcoming docs with it:

>>> ls('home/mycachedir/08/08867237840fabae77b838e9c9226eb2/sources')
-  source_1

Then, we store the result file:

>>> ls('home/mycachedir/08/08867237840fabae77b838e9c9226eb2/results')
-  result_1__pdf
-  result_1_default

The cache manager notices, that the source delivered was the same as on first time and so only stored the new result with the suffix in name.

This will become more obvious, when we want to register a certain result file as HTML result:

>>> cm.registerDoc(source_path=docsource,
...                to_cache=pdfresult,
...                suffix='html')
'08867237840fabae77b838e9c9226eb2_1'
>>> ls('home/mycachedir/08/08867237840fabae77b838e9c9226eb2/sources')
-  source_1
>>> ls('home/mycachedir/08/08867237840fabae77b838e9c9226eb2/results')
-  result_1__html
-  result_1__pdf
-  result_1_default

It is up to the caller to choose any suffix she likes.

Getting cache results

When we want to get the result for some input file, we can do so:

>>> cm.getCachedFile(docsource)
'/sample-buildout/home/mycachedir/.../results/result_1_default'
>>> cm.getCachedFile(docsource, suffix='pdf')
'/sample-buildout/home/mycachedir/.../results/result_1__pdf'
>>> cm.getCachedFile(docsource, suffix='html')
'/sample-buildout/home/mycachedir/.../results/result_1__html'

If a file was not cached yet, we will get None:

>>> cm.getCachedFile(docsource, suffix='blah') is None
True

Collision Handling

The cache manager relies very much on hash (MD5) digests to find a cached document quickly. However, hash collisions can occur.

We create a cache manager with a trivial hash algorithm to see this:

>>> from ulif.openoffice.cachemanager import CacheManager
>>> class NotHashingCacheManager(CacheManager):
...   def getHash(self, path=None):
...     return 'somefakedhash'
>>> cm_dir = 'home/newcachedir'
>>> cm = NotHashingCacheManager(cache_dir=cm_dir)

We create two sources to store:

>>> import os
>>> open('dummysource1.doc', 'w').write('Just a dummy file.')
>>> open('dummysource2.doc', 'w').write('Another dummy file.')
>>> docsource1 = os.path.abspath('dummysource1.doc')
>>> docsource2 = os.path.abspath('dummysource2.doc')

Now we create some dummy result files and register both pairs of them:

>>> open('dummyresult1.pdf', 'w').write('Fake result 1')
>>> open('dummyresult1.html', 'w').write('Fake result 2')
>>> open('dummyresult2.pdf', 'w').write('Fake result 3')
>>> open('dummyresult2.html', 'w').write('Fake result 4')
>>> result1 = os.path.abspath('dummyresult1.pdf')
>>> result2 = os.path.abspath('dummyresult1.html')
>>> result3 = os.path.abspath('dummyresult2.pdf')
>>> result4 = os.path.abspath('dummyresult2.html')
>>> m1 = cm.registerDoc(source_path=docsource1,
...                     to_cache=result1,
...                     suffix='pdf')
>>> m2 = cm.registerDoc(source_path=docsource1,
...                     to_cache=result2,
...                     suffix='html')
>>> m3 = cm.registerDoc(source_path=docsource2,
...                     to_cache=result3,
...                     suffix='pdf')
>>> m4 = cm.registerDoc(source_path=docsource2,
...                     to_cache=result4,
...                     suffix='html')

All these sources give the same hash and are therefore stored in the same basket:

>>> ls(cm_dir, 'so', 'somefakedhash', 'sources')
-  source_1
-  source_2
>>> cat(cm_dir, 'so', 'somefakedhash', 'sources', 'source_1')
Just a dummy file.
>>> cat(cm_dir, 'so', 'somefakedhash', 'sources', 'source_2')
Another dummy file.

All results are connected via a number in filename to their respective source:

>>> ls(cm_dir, 'so', 'somefakedhash', 'results')
-  result_1__html
-  result_1__pdf
-  result_2__html
-  result_2__pdf
>>> cat(cm_dir, 'so', 'somefakedhash', 'results', 'result_1__pdf')
Fake result 1
>>> cat(cm_dir, 'so', 'somefakedhash', 'results', 'result_2__pdf')
Fake result 3

Markers: Unique identifiers for cached files

We can use unique markers to distiguish between different files in a bucket. The markers are distributed by the cachemanager. Actually we already got such markers. They were returned when registering the files above:

>>> m1
'somefakedhash_1'
>>> m2
'somefakedhash_1'
>>> m3
'somefakedhash_2'
..note:: You should not make any assumptions about the marker
contents. It’s only guaranteed to be a string.

Using these markers we can get cached files back directly:

>>> #cached_file_info = cm.getFileFromMarker(m1)
>>> #cached_file_info.filename

‘result1’

>>> #cached_file_info.source_filename

‘docsource1’

>>> #cached_file_info.path

‘/.../so/somefakedhash/results/result_1_pdf’

If a marker is not valid, i.e. it is not linked with a file, we will get None:

>>> #cm.getFileFromMarker('blah') is None

True

Cache Maintenance

A cache manager can list all source files stored.

>>> cm = CacheManager(cache_dir='home/mycachedir')
>>> [x for x in cm.getAllSources()]
['/.../mycachedir/08/08867237840fabae77b838e9c9226eb2/sources/source_1']