\title{Transparent on-the-fly data compression {\tt COS595}} \author{Matthias Blume\\ Dept.\ of Computer Science, Princeton University,\\ Princeton, NJ 08544} \maketitle \tableofcontents \section{Introduction} The C library on UNIX provides functions for calling the operating system. Replacing those functions can provide a different program functionality without the need to make changes to the program text. Replacing the functions [[open]], [[creat]], [[close]], [[read]] and [[write]] along with a couple of other routines will change the file system interface. Since on today's computer systems most information are stored with a high redundancy, it seems to be useful to apply data compression algorithms to them. If the compression and uncompression is done by the file system interface itself, then it will become transparent to the programs using it. Although a truly general solution can only be done in the file system itself, it is nevertheless possible to approximate this to a high degree within the C library. I will show a sample implementation, which does this. \section{System call substitutes} Since I am going to replace the functions [[open]], [[read]], [[write]] etc. by my own versions, I will not be able to use them to do actual input and output. Therefore it is necessary to write new versions of the original C library function. The only things to change are the {\em names} of those routines. The following program text gives the implementation for MIPS-based machines. By using the {\em m4} macro processor, which is available on most UNIX systems, I take advantage of the fact, that the sequences of instructions needed for any of the functions follow a common pattern. <>= # include # include .text .globl _cerror define(sc,` .globl _sys_$1 .ent _sys_$1 _sys_$1: li v0, SYS_$1 syscall beq a3, zero, 9f j _cerror 9:$2 j ra .end') sc(open) sc(creat) sc(close) sc(read) sc(write) sc(dup) sc(lseek) sc(pipe,` sw v0, 0 (a0) sw v1, 4 (a0)') @ The Second argument of the [[sc]] macro is necessary to implement the [[pipe]] system call, because it has to store two file descriptors into locations given as the argument of the call. @ \section{Overall structure} My new versions of the file system interface functions are implemented in the file [[compress.c]]. The structure of the file can be described as follows: <>= <> <> <> <> <> <> <> <> <> <> <> @ The necessary system header files are: <>= # include # include # include # include # include # include @ \section{Replacing system calls} In this implementation I will give substitutes for the following functions: \begin{itemize} \item [[creat]] - creates a new file or truncates it to 0 bytes; opens for writing \item [[open]] - opens an existing file for reading or writing \item [[close]] - releases the association between a file descriptor and the corresponding file \item [[read]] - reads a number of bytes from an open file into a given buffer \item [[write]] - writes a number of bytes from a given buffer to an open file \item [[dup]] - duplicates a file descriptor \item [[lseek]] - provides random access to arbitrary positions within an open file (if supported) \item [[tell]] - this is not a true system call and could be simulated by [[lseek]]; it returns the current position in the file \item [[pipe]] - opens an inter-process communication channel known as a {\em pipe}; returns two file descriptors, one for reading from the pipe, one for writing into the pipe \end{itemize} The real file system interface will be driven by the external functions: <>= extern _sys_open (const char *, int); extern _sys_creat (const char *, int); extern int _sys_close (int); extern int _sys_read (int, char *, int); extern int _sys_write (int, const char *, int); extern int _sys_dup (int fd); extern long _sys_lseek (int fd, long offset, int whence); extern int _sys_pipe (int [2]); @ and the replaced system call functions are: <>= <> <> <> <> <> <> <> <> <> @ Data compression and uncompression will be done by the Lempel-Ziv algorithm. It is necessary to maintain several independent compression- or un\-com\-pres\-sion-``engines'', because there can be many files open. There is no fixed relationship between the number of characters read from or written to the real file system and the number of characters seen by the program. This means, both compression and uncompression must be able to stop ``in the middle of the operation''. All the relevant variables, which constitute the ``state'' of the ``engine'' have to be saved in a data structure, which in turn has to be associated with the file descriptor. The existence of the [[dup]] system call introduces some further requirements on the implementation. Basically this means, that the same ``engine'' can be associated with more than one file descriptor. Currently I restrict myself to at most [[MAXFILES]] open files. This can easily be changed by using [[getdtablesize]] to find out the maximum number of open files allowed by the operating system. <>= # define MAXFILES 256 @ The states of ``engines'' are stored in structures of type [[struct cfd]]. <>= typedef struct cfd *cfd; struct cfd { struct methods *methods; int nbits; int shared; <> }; @ One crucial idea to deal with the complexity of the problem is to adopt some ``object-oriented'' techniques. I use the member [[methods]] to point to a collection of function pointers. Depending on whether a file is read or written, compressed or plain, I need different algorithms to access the file. Using the table of [[methods]] allows to do this without complicated {\em if-then-else} chains all over the place. <>= struct methods { int (* read) (int, unsigned char *, int); int (* write) (int, unsigned char *, int); int (* close) (int); long (* seek) (int, long, int); }; @ There will be a global file table [[filetab]], indexed by file descriptors, which contains pointers to structures of type [[struct cfd]]. The contents of this table at a given position depends on the mode of operation used with the file descriptor under question. Possible modes are: \begin{itemize} \item the file is a plain file (no compression or uncompression) \item the file is written using compression \item the file is read using uncompression \item the file is opened for reading, but it is still unclear, whether it is compressed or not \item the file is opened for reading; the ``engine'' found, that the file is not compressed by reading the first few characters of the file; the program has not asked yet for all of those pre-read characters \item the file is unknown to the interface module---this usually happens, if the file descriptor is inherited from a parent process (e.g. the shell) \end{itemize} Descriptors for plain files and unknown file descriptors are passed directly to the real system calls. Therefore, these two modes are represented the same: by a [[NULL]] pointer in the corresponding position of the file table. We need two different method tables for reading or writing in compressed mode. A third table is necessary to deal with the remaining modes of operation. A compressed file is recognized by the first three characters in the file, which are known as the [[compress_prefix]]. <>= <> <> <> @ <>= static unsigned char compress_prefix [] = { 0x1f, 0x9d, 0x10 }; @ <>= static cfd filetab [MAXFILES] = { NULL, }; @ The three collections of methods are: <>= <> <> <> @ For writing: <>= static struct methods cw_m = { refuse_io, write_compressed, close_compressed_write, refuse_seek, }; @ For reading: <>= static struct methods cr_m = { read_compressed, refuse_io, close_compressed_read, refuse_seek, }; @ For deciding, whether to read a compressed or a plain file: <>= static struct methods ir_m = { read_prefix, refuse_io, close_prefix_read, prefix_seek, }; @ To define the above tables, I need the following prototype definitions: <>= static int read_compressed (int fd, unsigned char *buf, int n); static int read_prefix (int fd, unsigned char *buf, int n); static int write_compressed (int fd, unsigned char *buf, int n); static int refuse_io (int fd, unsigned char *buf, int n); static int close_compressed_read (int fd); static int close_prefix_read (int fd); static int close_compressed_write (int fd); static long refuse_seek (int fd, long offset, int whence); static long prefix_seek (int fd, long offset, int whence); @ In order not to confuse other programs started by a combination of [[fork]] and [[exec]], I always use the real system call along with the replaced one. Therefore, the indices into [[filetab]] are real file descriptors provided by the operating system. [[creat]] always sets the mode of operation to ``write with compression'' by using the methods [[cw_m]]. As a general rule, I set the member [[nbits]] to zero when opening a file. This signals, that the data structures have not been fully initialized yet. The member [[shared]] counts the number of [[filetab]] entries, which point to the same data structure. The routines for closing a file will use this to determine, whether the last connection to the file will be closed. Only in this case I can perform cleanup operations like freeing the data structures associated with the file. <>= int creat (const char *path, int mode) { int fd; init (); fd = _sys_creat (path, mode); if (fd < 0 || fd > MAXFILES || (filetab [fd] = malloc (sizeof (struct cfd))) == NULL) return fd; filetab [fd]->nbits = 0; filetab [fd]->methods = &cw_m; filetab [fd]->shared = 1; return fd; } @ Opening a file for writing using [[open]] assumes, that the file already exists. Therefore it is not useful to write with compression, because the compressed data will interfere with what has already been in the file. As a consequence I leave the entry in [[filetab]] unchanged. It turns out, that new versions of the C library don't use [[creat]], but call [[open]] with some additional flags and parameters as specified by [[POSIX]]. This means, that I have to simulate the desired behavior by calling [[creat]] from within [[open]] if necessary. Opening a file for reading is the most complex case. At some time I need to read the first few characters of the file to decide, whether the file is compressed or not. The most natural place to do this seems to be the [[open]] routine. Unfortunately, this would violate the semantics of [[open]]. (Imagine opening a terminal file for reading!) The decision has to be delayed until the first call to [[read]] will be executed. [[open]] sets the mode of operation to ``unclear, whether to use decompression'' by using the method table [[ir_m]]. <>= int open (const char *path, int how, int mode) { int fd; if (how == (O_WRONLY | O_CREAT | O_TRUNC)) return creat (path, mode); init (); fd = _sys_open (path, how); if (fd < 0 || fd > MAXFILES || how != 0) return fd; if ((filetab [fd] = malloc (sizeof (struct cfd))) == NULL) return fd; filetab [fd]->nbits = 0; filetab [fd]->methods = &ir_m; filetab [fd]->shared = 1; return fd; } @ Most of the remaining substitutes for system call functions follow a common pattern: \begin{itemize} \item look, if the file descriptor points to non-[[NULL]] in [[filetab]] \item if not, then simply use the original system call \item otherwise call the appropriate function from the methods table \end{itemize} <>= int close (int fd) { return (fd < 0 || fd > MAXFILES || filetab [fd] == NULL) ? _sys_close (fd) : (* filetab [fd]->methods->close) (fd); } @ <>= int read (int fd, char *buf, int n) { return (fd < 0 || fd > MAXFILES || filetab [fd] == NULL) ? _sys_read (fd, buf, n) : (* filetab [fd]->methods->read) (fd, (unsigned char *) buf, n); } @ <>= int write (int fd, const char *buf, int n) { return (fd < 0 || fd > MAXFILES || filetab [fd] == NULL) ? _sys_write (fd, buf, n) : (* filetab [fd]->methods->write) (fd, (unsigned char *) buf, n); } @ [[dup]] simply copies, what is in [[filetab]] at a given place to another place. The [[cfd]]-member [[shared]] must be incremented. <>= int dup (int fd) { int res = _sys_dup (fd); if (fd < 0 || fd > MAXFILES || filetab [fd] == NULL) return res; assert (res < MAXFILES); assert (filetab [res] == NULL); filetab [res] = filetab [fd]; filetab [res]->shared++; return res; } @ [[lseek]], again, follows the general pattern. There is one minor variation: if the arguments indicate, that no actual repositioning is asked for, [[tell]] gets called. <>= long lseek (int fd, long offset, int whence) { if (offset == 0 && whence == 1) return tell (fd); return (fd < 0 || fd > MAXFILES || filetab [fd] == NULL) ? _sys_lseek (fd, offset, whence) : (* filetab [fd]->methods->seek) (fd, offset, whence); } @ [[tell]] is not a real system call. I simulate it using [[lseek]] if necessary. For files in compressed mode I keep track of the file position myself. Note, that [[filepos]] is not initialized until the first [[read]] from or [[write]] to the file has been executed. <>= long filepos; @ <>= long tell (int fd) { return (fd < 0 || fd > MAXFILES || filetab [fd] == NULL) ? _sys_lseek (fd, 0L, 1) : (filetab [fd]->nbits < MINBITS) ? 0 : filetab [fd]->filepos; } @ Since [[pipe]] creates two file descriptors, I have to deal with two entries in [[filetab]]. Writing to the pipe will be performed in compressed mode. This seems to imply, that reading has to use compressed mode as well, but this is not the case. The most common case of using pipes is in the context of [[fork]] and [[exec]]. It is very likely, that the pipe will be written by another program, and I have to check, whether this program uses compression or not. <>= int pipe (int fd [2]) { cfd p0, p1; init (); if (_sys_pipe (fd) < 0) return -1; if (fd [0] > MAXFILES || fd [1] > MAXFILES) return 0; p0 = malloc (sizeof (struct cfd)); if (p0 == NULL) return 0; p1 = malloc (sizeof (struct cfd)); if (p1 == NULL) { free (p0); return 0; } p0->nbits = p1->nbits = 0; p0->methods = &ir_m; p1->methods = &cw_m; p0->shared = p1->shared = 1; filetab [fd [0]] = p0; filetab [fd [1]] = p1; return 0; } @ \section{Initialization} The library function [[atexit]] provides a way to register a function, which will be called, when the program exits (i.e. when it calls [[exit]]). I use this to register a function, which closes all the open files found in [[filetab]]. As you might have noticed, [[init]] will be called from any of the functions, which create non-[[NULL]] entries in [[filetab]]. <>= static void cleanup (void) { int i; for (i = 0; i < MAXFILES; i++) if (filetab [i] != NULL) (* filetab [i]->methods->close) (i); } @ The registration of [[cleanup]] will be done exactly once. <>= static void init (void) { static int initialized = 0; if (initialized == 0) { atexit (cleanup); initialized = 1; } } @ \section{Compression} Compression employs the adaptive Lempel-Ziv algorithm. Tables are constructed as data are written. Every sequence of characters ever seen by the algorithm (which uses a greedy heuristic to construct these sequences) is associated with a unique code. The [[cfd]]-member [[nextcode]] always holds the next available code to be associated with the next sequence. Because the algorithm writes data, which are not always aligned to byte-boundaries, I have to use a buffer, the size of which is a multiple of the current code size [[nbits]] and a multiple of eight. Since the maximum code size is sixteen, a buffer of at most 16 bytes is required. <>= # define TABSIZE 8192 # define MAXBITS 16 # define MINBITS 9 # define FIRSTCODE 256 @ I use [[struct cfd]] for both compress and uncompress. Some of the members in [[struct cfd]] are only used for either compression or uncompression, and not for both. In order to save some space, these members are placed into a union. <>= unsigned long nextcode; unsigned char buf [MAXBITS]; int bitpos; union { struct { <> } c; struct { <> } d; } u; @ [[lastcode]] holds the code of the character sequence seen so far. [[codes]] is a hashtable, which is used to describe the mapping of strings to codes. <>= struct centry **codes; unsigned long lastcode; @ The hashtable used by this algorithm has fixed size and uses chaining to deal with collisions. The data structure for the chaining is described by: <>= struct centry { unsigned short w; unsigned char c; unsigned short code; struct centry *next; }; @ Here, [[w]] is the code for the string without the last character, [[c]] is the last character, [[code]] is the code for this sequence and [[next]] holds the next entry of the chain. <>= <> <> <> <> @ This is the hashtable management: <>= <> <> @ <>= # define hash(x,y) (((x)<<8|(y))%TABSIZE) @ There is not very much to say about hashtable lookup. The only important thing to note is, that I use a ``move-to-front'' heuristic to speed things up. <>= static struct centry *lookup (cfd fd, unsigned char c) { unsigned lc = fd->u.c.lastcode; struct centry **start = &fd->u.c.codes [hash(lc, c)]; struct centry **cur = start; struct centry *tmp; while (*cur != NULL && ((*cur)->w != lc || (*cur)->c != c)) cur = & (*cur)->next; if (*cur == NULL) return NULL; else { tmp = *cur; *cur = tmp->next; tmp->next = *start; *start = tmp; return tmp; } } @ In order to write a number of bits it is necessary to use the [[buf]] member of the [[cfd]] structure, because I cannot write fractions of a byte. It is just a matter of shifting and masking bits correctly... I give the description of [[invmask]] here, although it is used only later for reading bits. <>= # define mask(x,n) ((x) & (~(~0 << (n)))) # define invmask(x,n) ((x) & (~0 << (n))) @ <>= static int output (cfd fd, int ifd) { unsigned char *byte = fd->buf + fd->bitpos / 8; int bit = fd->bitpos % 8; unsigned code = fd->u.c.lastcode; *byte = mask (*byte, bit) | code << bit; byte++; code >>= 8 - bit; if (fd->nbits + bit > 16) { *byte++ = code; code >>= 8; } *byte = code; fd->bitpos += fd->nbits; if (fd->bitpos == 8 * fd->nbits) { if (_sys_write (ifd, (char *) fd->buf, fd->nbits) < 0) return -1; fd->bitpos = 0; } return fd->bitpos; } @ To be able to write arbitrary arrays of characters I need to suspend compression not after a certain amount of characters {\em written}, but after any amount of characters {\em seen}. This means, that character sequences, which are collapsed into one code, may extend across multiple calls to [[write]]. [[write]] checks first, whether this is the very first call to [[write]] for this file and initializes the data structures. Remember, that [[creat]] and [[pipe]] set [[nbits]] to zero to indicate this situation. After using all possible codes no further entries to the hashtable are made---[[write]] has to live with what is in the table. <>= static int write_compressed (int ifd, unsigned char *buf, int n) { cfd fd = filetab [ifd]; int i, h; struct centry *tmp; unsigned char c; if (n == 0) return 0; <> while (i-- > 0) { c = *buf++; if ((tmp = lookup (fd, c)) == NULL) { <> <> fd->nextcode++; fd->u.c.lastcode = c; } else fd->u.c.lastcode = tmp->code; } fd->filepos += n; return n; } @ A value of zero in [[nbits]] indicates, that the data structures have to be initialized. <>= if (fd->nbits == 0) { if (_sys_write (ifd, (char *) compress_prefix, sizeof (compress_prefix)) < 0) return -1; fd->nbits = MINBITS; fd->nextcode = FIRSTCODE; fd->bitpos = 0; fd->u.c.codes = malloc (TABSIZE * sizeof (struct centry *)); if (fd->u.c.codes == NULL) { errno = ENOMEM; return -1; } for (i = 0; i < TABSIZE; i++) fd->u.c.codes [i] = NULL; fd->filepos = 0; fd->u.c.lastcode = *buf++; i = n - 1; } else i = n; @ <>= if (output (fd, ifd) < 0) return -1; @ As long as not all the possible codes have been used, codes for new sequences have to be introduces. <>= if (fd->nextcode < (1L << MAXBITS)) { if ((tmp = malloc (sizeof (struct centry))) == NULL) { errno = ENOMEM; return -1; } tmp->w = fd->u.c.lastcode; tmp->c = c; tmp->code = fd->nextcode; if (fd->nextcode == (1L << fd->nbits)) { if (fd->bitpos > 0) { if (_sys_write (ifd, (char *) fd->buf, fd->nbits) < 0) return -1; fd->bitpos = 0; } fd->nbits++; } h = hash (fd->u.c.lastcode, c); tmp->next = fd->u.c.codes [h]; fd->u.c.codes [h] = tmp; } @ An important thing to note, is that I cannot simply close a file using the operating system call. It may be the case (in fact, it is always the case) that there is still some accumulated code in [[lastcode]] that wants to be written out. I have to make sure, that I only really close the file, if the last reference to this file is going to be abandoned. Unlike during a switch from [[nbits]] to [[nbits]]+1, where I always flush the {\em entire} buffer [[buf]] (up to [[nbits]] bytes), I write only those parts of [[buf]] which really contain written bits when closing the file. This provides the necessary information about the end of the file to the uncompression algorithm. <>= static int close_compressed_write (int ifd) { cfd fd = filetab [ifd]; int res, i; struct centry *run, *next; if (--fd->shared == 0 && fd->nbits > 0) { for (i = 0; i < TABSIZE; i++) if ((run = fd->u.c.codes [i]) != NULL) do { next = run->next; free (run); run = next; } while (run != NULL); free (fd->u.c.codes); if (output (fd, ifd) < 0) return -1; if (fd->bitpos > 0 && _sys_write (ifd, (char *) fd->buf, (fd->bitpos + 7) / 8) < 0) return -1; } res = _sys_close (ifd); if (fd->shared == 0) free (fd); filetab [ifd] = NULL; return res; } @ \section{Uncompression} The algorithm to uncompress compressed files looks a little bit more complicated. First, I need a stack (described by the members [[stack]], [[stacktop]] and [[stacksize]]) to reverse the sequence of characters, which I obtain from a code. The stack is realized as a rubber-band array, which is automatically expanded when necessary. Furthermore, [[buflen]] keeps track of the number of characters, which have actually been read from the file system---fewer characters than [[nbits]] indicate the end of the file. [[oldcode]] holds the last code that has been read and [[finchar]] is the final character produced from the last code. This is necessary to deal correctly with the {\em ``AwAwA''}-phenomenon, where a code can be read, which is not in the table yet. The ``hashtable'' [[htab]] is a rubber-band array which contains for each code the associated prefix (i.e. the code for the string without the last character) together with that last character. Since entries are made in a sequential order, it is not necessary to use a hash function. The entry for a code {\em k} is at position {\em k-256}, because the first codes 0-255 stand for themselves and don't need to be stored into the table. Here are the missing members of [[struct cfd]]: <>= long tabsize; struct { unsigned w; unsigned char c; } *htab; unsigned stacksize; unsigned stacktop; unsigned char *stack; int buflen; unsigned short oldcode; unsigned char finchar; @ Uncompression is split into the following tasks: <>= <> <> <> <> <> @ The stack management maintains a rubber band array. The array can only expand, therefore ``popping'' items from the stack can be ``in-lined''. [[push]] is more complicated and gets its own function: <>= # define STACKGROWTH 64 @ <>= static int push (cfd fd, unsigned char c) { if (fd->u.d.stacktop >= fd->u.d.stacksize) { fd->u.d.stacksize += STACKGROWTH; if ((fd->u.d.stack = realloc (fd->u.d.stack, fd->u.d.stacksize)) == NULL) { errno = ENOMEM; return -1; } } fd->u.d.stack [fd->u.d.stacktop++] = c; return 0; } @ Reading the bits into the buffer is a little bit more trickier than writing. Consider a pipe: If the pipe contains fewer characters than required, then only those bytes are delivered. The system call blocks for empty pipes only. Therefore [[refill_buffer]] repeats the call to [[_sys_read]] until either the buffer is completely filled or [[_sys_read]] signals end-of-file or an error condition. <>= static int refill_buffer (cfd fd, int ifd) { int n, r; n = 0; r = 0; while (n < fd->nbits && (r = _sys_read (ifd, (char *) (fd->buf + n), fd->nbits - n)) > 0) n += r; if (r < 0) return -1; fd->u.d.buflen = 8 * n; fd->bitpos = 0; return 0; } @ The [[input]]-function is very much like [[output]], except that it has to return an end-of-file condition when the end of the file has been reached. I use -1 to signal the end of the file and -2 to signal an error. [[buflen]] is used to describe, what is really in the buffer (the number of bits). <>= static int input (cfd fd, int ifd) { unsigned char *byte; int bit; unsigned code; if (fd->bitpos + fd->nbits > fd->u.d.buflen) { if (refill_buffer (fd, ifd)) return -2; if (fd->u.d.buflen == 0) return -1; } byte = fd->buf + fd->bitpos / 8; bit = fd->bitpos % 8; code = invmask (*byte, bit) >> bit; byte++; bit = 8 - bit; if (fd->nbits - bit >= 8) { code |= *byte++ << bit; bit += 8; } code |= mask (*byte, fd->nbits - bit) << bit; fd->bitpos += fd->nbits; return code; } @ There is still the problem, that [[read]] has to check first, whether the contents of the file is compressed or not. If [[read]] detects, that the file is not compressed, than it has to arrange for {\em all} file descriptors associated with the file, that it is read using the plain operating system call. This is done using the function [[mark_uncompressed]]. <>= static void mark_uncompressed (cfd fd) { int i; for (i = 0; i < MAXFILES; i++) if (filetab [i] == fd) filetab [i] = NULL; free (fd); } @ The first attempt to read a file will be used to check, whether the file is compressed. To do this, I try to read the first three characters in the file and compare them with [[compress_prefix]]. If I don't get three characters or if those characters do not coincide with [[compress_prefix]], then the file is considered to be not compressed. Otherwise I simply change the method table to be [[cr_m]] and call [[read_compressed]]. In the case that the file is not compressed, the characters read are part of the data and have to be placed into the buffer, which is the argument to read. After doing so, the file has to be marked being uncompressed. If [[read]] has asked for more than three characters, then [[_sys_read]] will try to get those. Difficulties arise, if [[read]] had asked for fewer characters than received by the first [[_sys_read]]. These characters have to be kept for further calls to [[read]]. (To reset the file pointer to the beginning might be impossible, because the file can be a terminal device or a pipe.) I set [[nbits]] to 1 to indicate, that the file has already proven to be in uncompressed state. In this case [[read_prefix]] fetches the characters from the buffer instead of reading them from the file system. When all the characters are ``eaten up'' I mark the file as being uncompressed. <>= static int read_prefix (int ifd, unsigned char *buf, int n) { cfd fd = filetab [ifd]; int l; if (n == 0) return 0; if (fd->nbits == 0) { if ((l = _sys_read (ifd, (char *) fd->buf, sizeof compress_prefix)) < 0) return -1; if (l == sizeof (compress_prefix) && memcmp (fd->buf, compress_prefix, sizeof compress_prefix) == 0) { fd->methods = &cr_m; return read_compressed (ifd, buf, n); } fd->nbits = 1; fd->filepos = 0; fd->bitpos = l; } if (n < fd->bitpos - fd->filepos) { memcpy (buf, fd->buf + fd->filepos, n); fd->filepos += n; return n; } memcpy (buf, fd->buf + fd->filepos, fd->bitpos - fd->filepos); if (n > fd->bitpos - fd->filepos) { l = _sys_read (ifd, (char *) (buf + fd->bitpos - fd->filepos), n - (fd->bitpos - fd->filepos)); if (l < 0) l = 0; n = l + fd->bitpos - fd->filepos; } mark_uncompressed (fd); return n; } @ This is the code for reading compressed files. Note, that the function initialized all the relevant data structures if [[nbits]] equals zero. Later it reads codes, constructs the corresponding character strings using the stack and places those characters into the buffer. Usually there will remain some characters, which have to be kept until the next call to [[read]]. The table of codes is again constructed as the algorithm goes. The uncompress algorithm lags always one step behind, so it may happen, that a code is not yet in the table. In this case, the sequence of characters can be reconstructed using [[finchar]] and [[oldcode]]. <>= static int read_compressed (int ifd, unsigned char *buf, int n) { cfd fd = filetab [ifd]; int i = n; unsigned incode, c; int cin; <> for (;;) { <> <> c = incode = cin; <> <> <> fd->u.d.oldcode = incode; } } @ The first time [[read]] gets called for a compressed file, the following code will be executed: <>= if (fd->nbits == 0) { if (n == 0) return 0; fd->nbits = MINBITS; fd->nextcode = FIRSTCODE; fd->bitpos = 8 * MINBITS; fd->filepos = 0; fd->u.d.tabsize = TABSIZE; if ((fd->u.d.htab = malloc (TABSIZE * sizeof (*fd->u.d.htab))) == NULL) { errno = ENOMEM; return -1; } fd->u.d.stacksize = STACKGROWTH; fd->u.d.stacktop = 0; if ((fd->u.d.stack = malloc (STACKGROWTH)) == NULL) { errno = ENOMEM; return -1; } fd->u.d.buflen = 8 * MINBITS; cin = input (fd, ifd); if (cin < 0) return cin == -1 ? 0 : -1; *buf++ = cin; --i; fd->u.d.oldcode = cin; fd->u.d.finchar = cin; } @ First, [[read]] has to empty the stack: <>= while (fd->u.d.stacktop > 0 && i > 0) { *buf++ = fd->u.d.stack [--fd->u.d.stacktop]; --i; } @ Then it can try to get another code, if necessary: <>= if (i == 0 || (cin = input (fd, ifd)) == -1) { fd->filepos += n - i; return n - i; } if (cin < -1) return -1; @ It may happen, that a code is not in the table yet. [[oldcode]] and [[finchar]] contain enough information to deduce, what has to be in the table: <>= if (c >= fd->nextcode) { if (c > fd->nextcode) { errno = EIO; return -1; } if (push (fd, fd->u.d.finchar)) return -1; c = fd->u.d.oldcode; } @ A code is analyzed from right to left. Therefore, I need to use the stack to reverse the order of the characters: <>= while (c >= FIRSTCODE) { if (push (fd, fd->u.d.htab [c - FIRSTCODE].c)) return -1; c = fd->u.d.htab [c - FIRSTCODE].w; } fd->u.d.finchar = c; if (push (fd, c)) return -1; @ Unless all possible codes are already used, I have to insert the a new code into the table. <>= if (fd->nextcode < (1L << MAXBITS)) { if (fd->nextcode - FIRSTCODE >= fd->u.d.tabsize) { fd->u.d.tabsize += TABSIZE; if ((fd->u.d.htab = realloc (fd->u.d.htab, fd->u.d.tabsize * sizeof (*fd->u.d.htab))) == NULL) { errno = ENOMEM; return -1; } } fd->u.d.htab [fd->nextcode - FIRSTCODE].c = c; fd->u.d.htab [fd->nextcode - FIRSTCODE].w = fd->u.d.oldcode; if (fd->nextcode == (1L << fd->nbits) - 1 && fd->nbits < MAXBITS) { fd->nbits++; if (refill_buffer (fd, ifd)) return -1; } fd->nextcode++; } @ Closing a file, which has been read from is not as difficult as closing a file which has been written to, because no buffers have to be flushed. Nevertheless, it has to take care of freeing the data structures. I have two different routines for closing, one for [[cr_m]], the other for [[ir_m]]. <>= static int close_compressed_read (int ifd) { cfd fd = filetab [ifd]; if (--fd->shared == 0) { if (fd->nbits > 0) { free (fd->u.d.htab); free (fd->u.d.stack); } free (fd); } filetab [ifd] = NULL; return _sys_close (ifd); } @ <>= static int close_prefix_read (int ifd) { cfd fd = filetab [ifd]; if (--fd->shared == 0) free (fd); filetab [ifd] = NULL; return _sys_close (ifd); } @ \section{Miscellaneous IO-substitutes} Reading from a file, that has been opened for writing or vice versa is not allowed. <>= static int refuse_io (int fd, unsigned char *buf, int n) { errno = EINVAL; return -1; } @ As far as [[lseek]] is concerned, a compressed file (regardless whether read or written) behaves like a pipe, i.e. [[lseek]] returns -1. <>= static long refuse_seek (int fd, long offset, int whence) { errno = ESPIPE; return -1; } @ An attempt to [[lseek]] a file, that is opened for reading and has not (yet) proven to contain compressed data, will force to treat the file as a plain file. <>= static long prefix_seek (int ifd, long offset, int whence) { cfd fd = filetab [ifd]; mark_uncompressed (fd); return _sys_lseek (ifd, offset, whence); } @ \section{Examples} The remainder of the text gives a collection of sample code, which provides some evidence, that the implementation is correct. I start with a simple copy program. The program takes one or two command line arguments the first of which is the input file, while the second names the output file and defaults to the standard output. The most interesting property of this program is, that is can be used both as a substitute for [[compress]] and as a replacement for [[uncompress]]. Further, it will also work on plain files. Consider: \begin{itemize} \item {\tt t plain-input compressed-output} \item {\tt t compressed-input compressed-output} \item {\tt t plain-input >plain-output} \item {\tt t compressed-input >plain-output} \end{itemize} <>= # include # include # include int main (int argc, char **argv) { FILE *in, *out; int c; if (argc != 3 && argc != 2) { fprintf (stderr, "Usage: %s infile [ outfile ]\n", argv [0]); exit (1); } if ((in = fopen (argv [1], "r")) == NULL) { fprintf (stderr, "Cannot open file %s for reading: %s\n", argv [1], strerror (errno)); exit (1); } if (argc == 2) out = stdout; else if ((out = fopen (argv [2], "w")) == NULL) { fprintf (stderr, "Cannot open file %s for writing: %s\n", argv [2], strerror (errno)); exit (1); } while ((c = getc (in)) != EOF) putc (c, out); fclose (in); fclose (out); return 0; } @ The next example does the same thing while not using the standard library. Instead, it calls [[open]], [[read]] etc. directly. <>= # include # include # include int main (int argc, char **argv) { int ifd, ofd; int n; char buf [4096]; if (argc != 3 && argc != 2) { fprintf (stderr, "Usage: %s infile [ outfile ]\n", argv [0]); exit (1); } if ((ifd = open (argv [1], 0)) < 0) { fprintf (stderr, "Cannot open file %s for reading: %s\n", argv [1], strerror (errno)); exit (1); } if (argc == 2) ofd = 1; else if ((ofd = creat (argv [2], 0666)) < 0) { fprintf (stderr, "Cannot open file %s for writing: %s\n", argv [2], strerror (errno)); exit (1); } while ((n = read (ifd, buf, 512)) > 0) write (ofd, buf, n); close (ifd); close (ofd); return 0; } @ The next example uses [[dup]] and performs interleaved writes to both file descriptors. Of course, this isn't necessary, but it shows, that [[dup]] works as expected. <>= # include # include # include int main (int argc, char **argv) { int ifd, oofd, dofd; int n; char buf [512]; if (argc != 3 && argc != 2) { fprintf (stderr, "Usage: %s infile [ outfile ]\n", argv [0]); exit (1); } if ((ifd = open (argv [1], 0)) < 0) { fprintf (stderr, "Cannot open file %s for reading: %s\n", argv [1], strerror (errno)); exit (1); } if (argc == 2) oofd = 1; else if ((oofd = creat (argv [2], 0666)) < 0) { fprintf (stderr, "Cannot open file %s for writing: %s\n", argv [2], strerror (errno)); exit (1); } dofd = dup (oofd); while ((n = read (ifd, buf, 512)) > 0) write (oofd, buf, n / 2), write (dofd, buf + n / 2, n - n / 2); close (ifd); close (oofd); close (dofd); return 0; } @ The following program is the most complex example. It shows the use of pipes in the framework of [[fork]] and [[exec]]. By two times executing [[fork]] I create a sequential arrangement of three processes, which are connected by two pipes. Pipe [[p1]] connects the {\em child} with the {\em parent}, while [[p2]] provides a channel from the {\em grandchild} to the {\em child}. The {\em child} starts another program by calling either [[execl]] or [[execvp]], depending on what command line arguments are given. The idea is to pipe compressed data to a {\em filter} and read it back. Simple ``filters'' like [[cat]] or [[tee]] don't change the data. Therefore, the {\em parent} should see, what the {\em grandchild} wrote in this case. It is worth trying to put {\tt uncompress~-C} into the place of the filter. (The behavior should not change, because [[read]] automatically detects, that the data in [[p1]] are not in compressed format.) Note, that all calls to [[fork]] and [[exec]] have to be executed {\em before any actual input or output takes place}. <>= # include # include # include # include # define check(x) \ ((x)<0?(fprintf(stderr,"%s(%d): (%s) < 0 (%s)\n", \ __FILE__, __LINE__, #x, strerror(errno)), \ exit(1)):0) int main (int argc, char **argv) { char buf [512]; int n; int p1 [2], p2 [2]; int f; check (pipe (p1)); check (f = fork ()); if (f > 0) { check (close (p1 [1])); while (check (n = read (p1 [0], buf, 512)), n > 0) check (write (1, buf, n)); putc ('\n', stderr); check (close (p1 [0])); check (wait (NULL)); exit (0); } else { check (close (1)); check (dup (p1 [1])); check (close (p1 [0])); check (close (p1 [1])); check (pipe (p2)); check (f = fork ()); if (f > 0) { check (close (p2 [1])); check (close (0)); check (dup (p2 [0])); check (close (p2 [0])); if (argc == 1) check (execl ("/bin/cat", "cat", NULL)); else check (execvp (argv [1], argv + 1)); } else { check (close (p2 [0])); while (check (n = read (0, buf, 512)), n > 0) check (write (p2 [1], buf, n)); check (close (p2 [1])); exit (0); } } } @ A small program shows the use of pipes through the [[popen]]-interface: <>= # include # include int main (int argc, char **argv) { FILE *fp; int c; assert (argc == 2); fp = popen (argv [1], "r"); assert (fp != NULL); while ((c = getc (fp)) != EOF) putchar (c); pclose (fp); return 0; } @ A final test case makes sure, that [[read_prefix]] works correctly, even in the case, that the number of characters asked for is less than the length of the [[compress_prefix]]. The program always reads only one character at a time. Try to use it with a plain file and remember, how [[read]] is implemented. <>= # include int main (int argc, char **argv) { char c; int fd; assert (argc == 2); fd = open (argv [1], 0); assert (fd >= 0); while (read (fd, &c, 1) == 1) write (1, &c, 1); close (fd); return 0; } @ \section{Conclusions} The examples in the previous section show, that the new file system interface really hides the details of data compression, thereby providing this service in a fashion transparent to the programs that use it. However, everything works only as long as a single program has complete control over an open file. In a multi-tasking environment like UNIX, this is generally not true. This puts some severe constraints onto the usage of the new interface. The implementation presented in this paper assumes, that files opened for writing by [[open]] are not compressed. What, if the data in that file actually {\em are} compressed? Recently, there are some efforts made to integrate data compression with file system implementations themselves. Only here one has the opportunity to know about {\em every} access to a file, and things can be synchronized properly. Nevertheless, the given program text can be useful in many circumstances. I, for instance, tried to integrate it with [[VSCM]], which is my implementation of {\em Scheme}. Memory dumps written by [[VSCM]] are less than half as big as before, and everything still works fine. Probably, the fact that files containing symbolic expressions are written in compressed mode as well is a little bit surprising and annoying. (This leads to the demand for a switch to turn off compression, but this would expose details of the implementation, which is what I wanted to avoid in the first place.) \section{Indexes} \subsection{Code Chunks} \nowebchunks \subsection{Identifiers} \nowebindex