Splash

(Small PERL-Like List And String Handling)

SPLASH is a C++ Class Library based on Perl data types.

Download the Unix version from here
Download the DOS/Windows version from here

Following is an article I wrote on SPLASH that was published in the October 1993 issue of The C Users Journal.


The SPLASH class library

I have used Perl for many years and often wished that some of the data types and functions provided in Perl were available when I was writing my c programs. When I started writing c++ code I realized that the list and string data types that I used in Perl scripts could be faithfully duplicated by a relatively simple class library. The reason for this is that c++ allows you to create user defined types. Additionally c++ templates allow you to create a list class that can store any data type.

Some benefits of basing this class library on the existing Perl language are that the behavior and functions of the data types are already defined, so I can concentrate on the implementation. Also the functions and data types are known to be useful and effective, as they have been in use for years.

I am only going to talk about the string and list data types in this article. However SPLASH implements associative arrays and array slices.

A brief description of Perl

For those unfamiliar with Perl it is a public domain scripting language written by Larry Wall. It can run on most platforms from Unix to MSDOS. Perl provides list, string, numeric and associative array data types, and simplified file I/O. It also allows easy access to some very powerful regular expression functions. (See sidebar 2: What is a regular expression?)

I call it a scripting language, because like Unix csh or MSDOS .BAT files you don't have to compile the script, just edit and run. Perl is very useful for processing files and extracting information from them, or translating files in some way or other. It is most often used by UNIX system administrators for writing scripts that run automatically at frequent intervals and do some relatively complex task of cleaning up log files or accounting system usage etc. I have used Perl for varying tasks from a global search and replace script to calculating my monthly compuserve bill from the dialing log generated by Procomm plus.

A brief description of SPLASH

The SPLASH (Simple Perl-like String And List Handling) class library is a c++ implementation of the Perl list and string data types. It includes the regular expression, substitute and translate operations on strings.

For a function by function description of SPLASH see the sidebar 1: SPLASH function reference.

SPLASH provides a dynamic list template class (SPList<T>) that allows you to add and extract data from the top of a list (push(), pop()), and from the bottom of the list (unshift(), shift()).

For example a FIFO (First In, First Out list) could be implemented by:

{
  SPList<int> mylist; // put 1 and 2 onto the list 
  mylist.push(1); 
  mylist.push(2); 
  y= mylist.shift(); // y gets 1 
  x= mylist.shift(); // x gets 2;
} 

The list can be sorted (uses the '<' operator on the elements) and reversed. If you make a list of your own data type, then you must define an operator<(), so that the sorting order will be correct.

For example to sort a list in reverse order:-

{
  SPList<float> mylist, sortedlist; // sorts the list, then reverses the
                                    // resulting list, and returns it
  sortedlist = mylist.sort().reverse();
} 

A list can be inserted anywhere within another list, or elements can be deleted from a list (splice()). Any individual element within a list can be randomly accessed using the '[]' operator.

The list is 'dynamic', which means it automatically shrinks and expands as necessary, without the user having to worry about memory management.

The string class (SPString) implements a dynamic string data type. The offset of a sub-string can be found (index()). A sub-string may be extracted or assigned to, effectively overwriting or deleting the sub-string (substr()). The string may be used anywhere a const char* can be used, so it can be passed as a parameter to the standard c str...() functions. The standard comparison operators (< > == etc) are defined, so intuitive operations can be applied to the strings. String concatenation is implemented by defining the '+' and '+=' operators.

Some of the most powerful functions are the regular expression matching operations which allow a regular expression to be applied to a string, returning a boolean on the success or failure of the match, or even returning a list of sub-strings that matched (m()). The substitute and replace function (s()) allows you to replace a successful regular expression match within the string with an arbitrary sub-string. The translate function (tr()) will replace all occurrences of a specific character, or range of characters. For example mystr.tr("a-z, "A-Z") will convert all lowercase characters in mystr with the equivalent uppercase character.

The string list class (SPStringList) is basically an SPList<SPString> class with some added functionality specific to lists of strings. It lets you grep ("get a regular expression") within the list, returning a list of strings that match. You can generate a list of strings from a single string by splitting it on a given regular expression (split()). The inverse operation (join()) will generate a single string by concatenating a list of strings separated by a given string.

All the classes have streams operators for input and output, to allow convenient file or stream processing.

The design process

The trap I always fall into when writing c++ programs, is trying to be too object oriented. I get carried away trying to make my programs fit exactly into the object oriented design paradigm. I find concepts such as data hiding, or encapsulation, easier to use than things like inheritance. The Regexp class is a good example of where encapsulation really worked in SPLASH. The Regexp class effectively wraps itself around an existing c regular expression package. The reason this worked so well is that I used two different regex libraries during the design process, one was the standard regex library provided with most unix systems, the other one is Henry Spencers's public domain regex library. I ended up using the latter, because it matches Perl more closely. The interface to the Regexp class was such that when I changed the underlying c library I didn't have to change any of the implementations that used the Regexp class.

While designing SPLASH it was not always intuitively obvious when to use public inheritance, private inheritance or layering. I found Scott Meyers book to be very helpful when deciding which method to use. In the first implementation of SPLASH I derived both SPList and SPString from a dynamic array class called MArray, that I had written a year previously for another project. MArray (listing 1) is a very simple array class that will grow when elements are added to it. The only interface to it is the overloaded operator[]() (for indexing), an add() function and a remove() function. Even though its interface and implementation are definitely not optimal for either class, it did get me going pretty quickly. Once the functionality of SPLASH was frozen and tested, I wrote two new classes to replace MArray. These two classes (SPListBase and VarString) were tailored specifically to the SPList and SPString classes, thereby improving efficiency, and proving that encapsulation really works. None of the test programs using either SPList or SPString needed changing, even though the underlying implementation of the two classes changed radically.

The implementation

It could be argued that neither VarString nor SPListBase needed to be written, and that the functionality of these classes could just as easily have been put directly into SPString and SPList. The reason I decided to use a separate class in each case was that the required interface to these classes was very well defined, and the mechanisms therein were completely self contained. In addition I could optimize the low level implementation of each without having to change the, more complicated, SPList and SPString classes. For instance I might decide to re-implement SPListBase as a linked list instead of a linear array. This may be better suited to some list applications. If I were to do this I could effectively replace SPListBase with a linked list class with no changes needed to SPList.

Description of SPListBase

SPListBase (listing 3) implements the mechanism for the dynamic list, it is a minimalist implementation of a linear array very similar to MArray, with one optimization. Because elements can be added to the front or back of the list, room is kept at both ends, so that data doesn't have to be shuffled up each time an element is added to the front of the list. This is accomplished using an offset (the private variable first) to indicate where the beginning of the array is. first initially points to the middle of the array.

For performance reasons the array never shrinks, although this would be trivial to implement if it was required. Also note the judicious use of const on all parameters which are never changed by the function. Other than reducing the chances of unexpected bugs, it also allows some compilers to do better optimization. I also make all functions which do not alter the state of the class (ie don't write to any variables in the class) const functions. This makes the intent of the functions clear to people reading the code. It also means that these functions can be called from within other const functions. The syntax for const functions can be seen in listing 3 on the count() member function.

SPListBase could be reused by any program that requires a dynamic array, this has some bearing on why I made some members private and some protected. I call any function that uses this class a client, whereas a class that inherits from this class I have called a derived class. grow() is a private member because I don't want it used outside of the class, it implies a knowledge of how the list is stored, and this kind of knowledge needs to be hidden from any client or derived class. A client should be able to use this class without any knowledge of how the data is actually stored. Hence the interface to SPListBase is very basic. There is a way to randomly access any element within the array, a way to find out how many elements there are in the array, and two ways to add an element to the array. There is only one way to remove an element from the array, and that is using the protected member compact(). I decided to make this a protected member because if I were to reuse this class as a generic dynamic array, the compact() function would not normally be used, however a derived class usually has more intimate knowledge about its base class than a client would. Protected member functions are only accessible within the class and by any derived class.

There are two versions of the operator[](). One is a const version and the other returns a reference to an element, which means that the element can be overwritten with new data. The const version implies that the data being indexed is constant data, and therefore must exist, so I generate an assert error if the index is out of range. The other version could be used for adding data to the array, so I silently expand the array if the index is out of range. I do this because it is the behavior of Perl lists.

Description of VarString

VarString (listing 2) is similar to SPListBase, in that it manages a variable length array, and is a minimal implementation. It has 5 constructors, which allow a VarString to be generated from different data types. Two of them allow a VarString to be generated from a char *, either a complete c string or a partial one, the latter is the mechanism used to implement substrings. The conversion operator (operator const char * ()) allows a VarString to be converted to a const char *, so that it can be passed to any of the standard str...() library functions. As you can see all this does is to return the address of the private array stored in VarString. This could be very dangerous if the client decided to overwrite the array, so I specifically return a const char *, which will stop this from happening. Again a client of VarString should make no assumptions about how the string is stored internally to VarString and I am trying to use the compiler to enforce this rule.

An example of how the internal data representation may change is that if I wanted to store binary strings in VarString, the '\0' could no longer have any specific meaning, unlike in a c string where it marks the end of the string. So VarString could store a counted string in this case, and whenever the VarString needed to be converted to a c string a '\0' could be appended to the array returned by the conversion operator.

I have also added a member function called remove() which by default will delete 1 character from the array, but if a second parameter is given it will delete that number of characters.

A useful TempString class

A useful, but simple, class is TempString (listing 7). This is a mechanism for allocating a string for temporary usage, which will automatically destroy itself when it passes out of scope. It replaces the often used code sequence:-

{
  char *astring, *tmp;
  // .... 
  tmp= malloc(strlen(astring));
  strcpy(tmp, astring); // use tmp for something like strtok()
  free(tmp);
} 

Because the compiler generates the destructor call when the TempString passes out of scope the storage is automatically released, so it reduces the chance of a memory leak in your programs. Of course you have to be careful to not pass the address of the TempString to something that will use it after it has been released, but even c++ can't save you from every logic mistake.

Why inherit from SPListBase, but layer VarString?

(Or When to use and not to use inheritance)

The question of when to inherit and when to layer is a thorny one. The books say that inheritance implies that the relationship between the derived class and the base class is one of "IS A". When I applied this test to SPList and SPListBase I decided that SPList "is NOT a" SPListBase, but instead is "implemented in terms of" SPListBase. "Implemented in terms of" can be achieved in two ways, one is layering, where the object you are using is declared as a variable in the class that is using it. Another way is to privately inherit from the class. Private inheritance means that only the derived class will have access to the protected and public members of the base class. In other words it is exactly as if the base class were a private member of the deriving class. So which one to use? According to Myers, layering is the preferred method, and he recommends you use that, unless you need access to a protected member in the layered class. SPListBase does indeed contain a protected member (compact()), so I used private inheritance in this case.

One reason that I don't consider SPList "IS A" SPListBase is that I only want users of SPList to see the Perl-like interface instead of the raw interface to SPListBase.

SPString also fails the "IS A" test with regards to VarString. SPString is "implemented in terms of" VarString. VarString is a raw interface to a dynamic string type, and SPString packages this up with a Perl-like interface. Accordingly SPString doesn't inherit from VarString, instead it defines a VarString as a private member (layering). VarString is a private data member because I want to hide the actual data storage method from users of SPString, in case I decide to change it in the future.

A good example of an "IS A" relationship is that of SPStringList (listing 6) and SPList. SPStringList publicly inherits from SPList (SPList<SPString> to be exact) because it "IS A" SPList, it just adds a some more functionality to SPList. A client of SPStringList has access to all the functions in SPList, in addition SPStringList offers a few functions that are specific to a list of SPStrings.

Description of SPList

SPList (listing 4) implements the Perl-like list data type. What the user of SPList sees is an interface similar to the various operators that can be applied to a Perl list.

SPList is privately derived from SPListBase. This means that all the member functions within SPList have access to the protected and public member functions of SPListBase, but the client of an SPList cannot use, or even see, any member functions within SPListBase.

SPList is a template class, this means that when it is used in a declaration the client must specify a type in angle brackets. The compiler then generates the code for the class, substituting the given type where appropriate. This saves me having to rewrite the class four times to support four different data types in the list.

For example:

// Generate a list of Integers
SPList<int> intlist; 
// Generate of list of pointers to char
SPList<char *> strlist; 

The implementation of most of SPList is fairly straightforward, so I'll just mention a few of the more interesting features.

The operator void *(){ return count(); }, is a trick that is used to allow the variable name to be used in an if statement as a boolean test. The return value in this case is count(), which of course will return true if the list is not empty, and false if the list is empty. so:-

{
  SPList<int> alist;
  if(alist) cout << "I'm not empty" << endl;
  else cout << "I'm empty" << endl; // this will get executed
  alist.push(1);
  if(alist) cout << "I'm not empty" << endl; // this will get executed
  else cout << "I'm empty" <<endl;
} 

Note, in listing 4, the two ways I have made functions in SPListBase available to users of SPList. the operator[] functions implicitly call the base class equivalent, whereas count uses another syntax which is given in the ARM page 244, section 11.3 however I have found that many c++ compilers will not accept this syntax.

Description of SPString

SPString (listing 5) implements the Perl string data type, it is more oriented towards processing data in files than manipulating strings, so probably is not a good general purpose string data type. Because it has the conversion operator const char *() an SPString can be used as a parameter to any of the standard c string routines. In addition the compare and equality operators have been defined 3 different ways, so a c string can be used in a comparison operation either on the right or the left of the operator. For example:

{
  SPString mystring, yourstring;
  // ....
  if(mystring != "Hello")
  ....
  if("abcd" < mystring)
  ....
  if(mystring >= yourstring)
  ...
} 

In order for the c string (a char *) to be used on the left of a binary operator it must be defined as a friend function so:-

friend int operator<(const char *s, const SPString& sp)
{
  return (strcmp(s, sp.pstr) < 0);
} 

This puts the scope of this function outside of the class it is defined within, as described in the ARM page 248 section 11.4.

As can be seen the actual implementation just calls a regular c string compare function. This also demonstrates operator overloading, where the operator<() is overloaded to take an SPString and a const char *.

Similarly the '+' operator has been overloaded to do a concatenation of two strings, it can take an SPString and an SPString, an SPString and a const char *, an SPString and a char, and finally a const char * and an SPString, the latter being implemented as a friend function.

Why define an operator<(const SPString& s) and operator<(const char *s) when a conversion operator from SPString to const char * is available? Why not just let the compiler generate the conversion, especially when it is going to take place anyway when it calls the strcmp() function?

The reason is the following compiler error message generated on the following lines of code:

{
  SPString s1, s2;
  // ...
 if(s1 == s2)
 // .... 
} 
error: ambiguous call: operator== ( class SPString, class SPString)
choice of operator ==()s:
  int operator ==(const class SPString, const char *);
  int operator ==(const char *, const SPString&); 

This very helpful error message explains the problem. Because we have a friend function to allow a char * to be on either side of the binary operator, the compiler is unable to decide which of the two SPStrings to convert to a const char *, so it gives this error message, and the way I fixed the problem was to disambiguate the call by writing a specific operator for each case.

Substrings

I wanted to hide the substring class from the user, so I made it a nested class within the SPString class, this way the user has no access to the substring class. Unfortunately not all c++ compilers support nesting of classes (where one class is defined within another class), and not all compilers handle the calling of member functions in the enclosing class from the nested class the same way. The Borland c++ 3.1 and cfront compilers do support nesting, so that is how I implemented it. An alternative method, which has a similar effect, is to define the class in the normal way, but put all the constructors in the private section and declare SPString to be a friend of subclass. This means that a user will not be able to instantiate this class, but SPString will be able to use it. In either method SPString must make substring a friend, so substring can access SPString's private variable pstr.

Even though the user has no access to a substring, the SPString member function substr() returns a substring, so how can this work? By providing a constructor in SPString that can create an SPString from a substring, and by providing an assignment operator that allows a substring to be assigned to an SPString.

The substring class basically just remembers the offset within the SPString, the length of the substring and the address of the first character of the substring. This is necessary in order to allow the substr() function to appear on the left hand side of an expression, so you can do the following:-

{
  SPString mystr= "0123456789";
  mystr.substr(3, 2)= "Ab"; // mystr becomes "012Ab56789"
} 

As the string will shrink or expand depending on whether the string being assigned is smaller or larger than the target substring a problem that needs to be overcome is if the substrings overlap, so I check to see if the SPString I am assigning to is the same as the source string, and if so I make a copy of the source string, then copy it in, this way I don't corrupt the source string as I am copying it.

SPString::insert() is a private member function because it is only used by substring. It basically inserts a given length string into the current string, expanding or contracting the string array as required.

Stream input for SPString

The input stream operator for the SPString class is shown in listing 8. This will allow a line of text (terminated by a newline) to be loaded into an SPString variable. The input can come from any stream type, be it an in-memory stream (strstream) or the standard input stream (cin).

For example:

{
  SPString instr;
  while(cin){ // read until EOF
    cin >> instr;
    //read in line
    // process line
  }
} 

The tricky part is leaving the stream in a good state so that the typical syntax (above) used for reading a file will work. This is tricky because a line terminated by end of file that is not terminated with a '\n' first is considered a complete line, but the state the stream is left in is critical for the while(cin) construction to work correctly. Specifically the line must be read, but not leave the stream in a fail state. The code in listing 8 will correctly return the last line if it is terminated by EOF with the stream still in a non-fail condition, but at EOF, so the next call will fail as required, and the while loop will terminate at the proper time.

Class interdependencies

When a class A references another class B as anything but a pointer or a reference, the class B must already have been defined, (ie the compiler must know what the size of the class is and what member functions it contains).

This can be a problem if two classes refer to each other (a cyclic dependency). This can happen if class A contains a data member of class B and class B contains a data member of class A. The only way I know of getting around this problem is to change one of the data members to a pointer. Then a forward declaration of the class is all that is needed.

I had many cyclic dependencies in SPLASH, and I was constantly shuffling declarations and definitions of classes around to solve them. In one case I had to change the data member to a pointer, with the accompanying syntax changes in all the implementations that used that data member.

The problem becomes even worse when templates are involved, as some compilers need to see the entire definition of a template member function before you can use it. Cyclic dependencies in this case can be very difficult to overcome, so you should avoid cyclic dependencies in template classes or use a compiler that is more intelligent in the way it handles templates. I am still trying to get SPLASH to compile with the GNU g++ and Zortech c++ compilers for exactly these reasons. Borland's c++ and cfront compilers seem more tolerant in these cases, but now I have to rewrite some of the code to break these cyclic dependencies, if I had known ahead of time it would have been easier to avoid them.

class templates

One of the things to note when using templates is that all the implementation of template member functions should go into the header file. This is so the compiler can expand a template when it is needed, as mentioned above the compiler must have seen the template definition before it can expand the function, much like an inline function.

Some compilers allow you to put the template definitions in a separate file and then it uses some technique to figure out what file to look at for the definition as it needs it.

Some compilers seems to have difficulties "finding" a particular template, even though it appears to be staring them in the face. A work around for this behavior is to specify the expansion in your program explicitly:

// Pull in the output operator for a list of ints
ostream SPList<int>::operator<<(ostream& os, int i); 

This usually clears up any problems. Template handling currently varies from compiler to compiler and appears to be the least portable c++ construct I have used so far.

Summary

The SPLASH distribution contains a number of examples of how to use the various features of this class library, including associative arrays and array slices which have not been covered here, and examples of how to use the stream operators for processing text files.

I haven't been able to cover all the interesting things I discovered about c++ while I was writing this class library. I hope I have pointed out some of the pit falls and some useful techniques. I encourage you to get a copy of the source code and see how the various functions have been implemented, it is not the best and most efficient code, but then it wasn't designed to be, it was designed to work. I find the functions very useful for a wide range of applications, and reusability of classes is finally becoming a reality for me.

Bibliography

  • Scott Meyers, Effective c++, Addison-Wesley
  • ARM, Ellis & Stroustrup, The Annotated C++ Reference Manual, Addison-Wesley

Sidebars

Sidebar 1. SPLASH function reference

class SPList<T>	- A list of any type specified by T
    T& operator[n] 	- returns a reference to the element at index
                          'n' (0 based). Generates an ASSERT error
                          if n < 0, can be used as an lvalue. Using
                          an index greater than the current size of the
                          list will cause the list to grow upto that
                          index. The values inbetween will be undefined
                          if <T> is a built-in type.
    operator void*()	- returns NULL if the list is empty,
			  can be used like: if(list) // not empty
    int isempty()	- returns TRUE if the list is empty, as an
			  alternative for the previous technique
    void reset()	- clears all elements in the list, but doesn't
			  actually free up the storage until it is
			  destroyed
    int count()		- returns the number of elements in the list
    int scalar()	- Perl-like Synonym (alias) for count()
    T pop()		- removes and returns the last element on the
			  list. If the list is empty the value returned
			  is usually undefined
    void push(T x)	- puts the single value 'x' at the end of the
			  list
    void push(SPList<T> l)
		 	- puts all the elements in the list 'l' at the
			  end of the list.
    T shift()		- removes and returns the first element
			  in the list
    int unshift(T x)	- puts the single value 'x' at the start
			  of the list
    int unshift(SPList<T> l)
			- puts all the elements in the list 'l'
			  at the start of the list
    SPList<T> reverse()
			- returns a list that is in the reverse
			  order
    SPList<T> splice(offset, len, SPList<T> l)
			- removes 'len' elements from 'offset' (0 based)
			  and inserts all the elements in the list 'l'
			  at the same position
				
    SPList<T> splice(offset, len)
			- removes 'len' elements from 'offset' (0 based)
    SPList<T> splice(offset)
			- removes all the elements from 'offset'
			  (0 based)
    SPList<T> sort()	- returns a list that has been sorted according
			  to the rules that T::operator<() returns
			  for the type T.
class SPStringList	- everything SPList does and ...
    int split(str [,pat] [,limit])
			- appends the results of splitting the string
			  'str' to the list. If 'pat' is specified then
			  any string that matches the RE 'pat' is
			  considered a separator to split on, the
			  default is white-space. If 'limit' is specified
			  then no more than that number of elements is
			  generated. If 'limit' is not specified, then empty
			  entries are stripped from the end of the list.
			  If the RE includes subexpressions then they
			  are inserted into the list as well.
			  If 'pat' is equal to the string "' '" then
			  a special case is done which matches awks
			  handling of whitespace. If 'pat' is an empty
			  string "", then all characters are split into
			  the list
    SPString join([pat])
			- Returns the string that is the result of
			  combining all the elements in the list, and
			  separating them by 'pat'. If 'pat' is omitted
			  then the elements are separated by a space
    int m(const char *exp, targ [,opts)
			- Appends to the list all the subexpression
			  matches that occured when applying the regular
			  expression 'exp' to the string 'targ'.
			  The number of matches is returned. The first
			  element generated is the entire matched string
			  opts: (a const char * with default "")
			      i - Forces case insensitive match
    
    SPStringList grep(const char *exp [,opts])
			- returns a list of all the elements that
			  matched the regular expression 'exp'.
			  opts: (a const char * with default "")
			      i - Forces the search to be case insensitive
class SPString		- A standard c null-terminated string may be
			  used anywhere that a SPString can be used
			  and vice-versa.
			- Individual characters may be read with
			  the '[]' operator.
    int length()	- returns the length of the string
    char chop()		- removes and returns the last character in the
			  string
    int index(SPString str [, offset])
			- returns the offset in the string that matches
			  the string 'str', starting at position
			  'offset' if specified, otherwise searches the
			  entire string.
			  Returns -1 if no match is found
    int rindex(SPString str [, offset])
			- returns the offset in the string that matches
			  the string 'str', starting at the end of the
			  string - 'offset' if specified, otherwise
			  searches the entire string.
			  Returns -1 if no match is found
    substring substr(offset [, len])
			- returns the substring within the string that
			  starts at 'offset' and is 'len' characters, if
			  'len' is omitted the rest of the string is
			  returned.
			  This may be used as an lvalue, in which case
			  the characters are removed, and the RHS of the
			  expression in inserted at the same postion.
    SPStringList split([,pat] [,limit])
			- same as SPStringList::split() but returns
			  a list of splits
    operator<		- These operators do what you would expect

    operator>
    operator<=
    operator>=
    operator==
    operator!=
    operator+		- returns the result of contenating two or more
			  strings
    operator+=		- replaces the LHS of the expression with the
			  concatenation of the LHS with the RHS
    int m(const char *exp [,opts])
			- returns 0 if the regular expression 'exp'
			  fails to match the string. Returns 1 if a
			  match was made
			  opts: (a const char * with default "")
			      i - Forces case insensitive match
    int m(const Regexp& exp)
			- Same as above but takes a precompiled
			  regular expression.
    int m(const char *exp, SPStringList& l [,opts])
			- Loads the list 'l' with all subexpression
			  matches of the regular expression 'exp' with
			  the string. Returns 0 if no matches were made.
			  Returns the number of matches if any
			  opts: (a const char * with default "")
			      i - Forces case insensitive match
    int m(const Regexp& exp, SPStringList& l)
			- Same as above but takes a precompiled
			  regular expression.
    int tr(search, repl [,opts])
			- replaces all occurrences of characters in 'search'
			  with the equivalent character in 'repl'. If 'repl'
			  is empty then just counts the characters.
			  opts: (a const char *) default is ""
			     c - complements the 'search' pattern. Replaces
				 all characters that are not in 'search', with
				 the last character specified in 'repl'.
			     d - deletes characters in 'search' that don't have an
				 equivalent 'repl'
			    cd - deletes characters not in 'search' 
			     s - compresses sequences of translated characters
				 in resulting string
    int s(exp, repl [,opts])
			- substitute the first substring matched by
			  'exp' with the string 'repl'. $& in 'repl'
			  will be replaced by the entire matching
			  string, $1 - $9 will be replaced by the
			  respective subexpression match. \$ or \\
			  will insert a $ or \ respectively.
			  opts: (a const char *) default is ""
			     g - causes all occurrences of 'exp' in
				 the string to be replaced by 'repl'
			     i - Forces case insensitive matching				
class Regexp		- Henry Spencers regular expression package
			  oo-ized
    Regexp(exp [,opts])	- Compiles a regular expression that can be
			  passed to one of the m() functions.
			  opts: (an int with default 0)
			      Regexp::nocase - Forces case insensitive
					       match
    int match(targ)	- returns 1 if the compield RE matches 'targ'
			  returns 0 if not
    
    int groups()	- returns 1 + the number of subexpression
			  matches found in the last match(). If the
			  previous match() succeeded then the whole
			  match is included in the count (hence +1)
    
    Range getgroup(n)	- returns the range of the 'n'th subgroup.
			  'n' = 0 is the range of the entire match
SPStringList m(exp, targ [,opts])
			- returns a list of all the subexpression
			  matches that occured when applying the
			  regular expression 'exp' to the string
			  'targ'. element 0 of the list is the first
			  subexpression, element 1 the next, etc.
			  opts: (a const char * with default "")
			      i - Forces case insensitive match
xin >> astring		- Text from the stream xin is loaded into
			  astring, the text is expected to be
			  terminated by '\n', which is removed from
			  the stream, but not put into astring.
			  asring is cleared first.
xin >> astringlist	- Each Text line, as defined above, is loaded
			  into an element of astringlist, which
			  is reset first.

 

Sidebar 2. What is a regular expression?

(Extracted from the Gnu Emacs Info page on regular expressions)

Regular expression matching allows you to test whether a string fits into a specific syntactic shape. You can also search a string for a substring that fits a pattern.

A regular expression describes a set of strings. The simplest case is one that describes a particular string; for example, the string `foo' when regarded as a regular expression matches `foo' and nothing else. Nontrivial regular expressions use certain special constructs so that they can match more than one string. For example, the regular expression `foo\|bar' matches either the string `foo' or the string `bar'; the regular expression `c[ad]*r' matches any of the strings `cr', `car', `cdr', `caar', `cadddar' and all other such strings with any number of `a''s and `d''s.


Listings

 

Listing 1

/*
 * marray.h
 * MArray means My Array or Minimal Array Class
 */
#ifndef	_MARRAY_H
#define _MARRAY_H
#include    <assert.h>
template<class T>
class MArray
{
private:
  enum{ALLOCINC=20};
  T *a;
  int cnt;
  int allocated;
  int allocinc;
  void grow(void);	// make array grow
protected:
  void remove(const int i); // This is protected
			    // because it is dangerous
public:
  MArray(int n= ALLOCINC) // default constructor   
  {
  a= new T[n];
  cnt= 0;
  allocated= n;
  allocinc= n;
  }
  MArray(const MArray<T>& n); // copy constructor
  // assignment operator
  MArray<T>& MArray<T>::operator=(const MArray<T>& n);
  virtual ~MArray(){ // destructor
    delete [] a;
  }
  // index operator
  T& operator[](const int i) const{
    // generate an assert error if indices don't make sense
    assert((i >= 0) && (i < cnt));
    return a[i];
  }
  int count(void) const{ return cnt; } // length of array
  void add(const T& n);    // add an element
  // add an element at a specific place
  void add(const int i, const T& n); 
};
// copy constructor
template <class T>
MArray<T>::MArray(const MArray<T>& n)
{
  allocated= n.allocated;
  allocinc= n.allocinc;
  cnt= n.cnt;
  a= new T[allocated];
  for(int i=0;i<cnt;i++) a[i]= n.a[i];
}
// assignment operator
template <class T>
MArray<T>& MArray<T>::operator=(const MArray<T>& n){
  // check we are not assigning to ourselves
  if(this == &n) return *this;
  delete [] a; // get rid of old one
  allocated= n.allocated;
  allocinc= n.allocinc;
  cnt= n.cnt;
  a= new T[allocated];
  for(int i=0;i<cnt;i++) a[i]= n.a[i];
  return *this;
};
// a private member function used by the
// implementation to grow the array
template <class T>
void MArray<T>::grow(){
  allocated += allocinc;
  T *tmp= new T[allocated];
  for(int i=0;i<cnt;i++) tmp[i]= a[i];
  delete [] a;
  a= tmp;
};
// add an element to the end of the array
template <class T>
void MArray<T>::add(const T& n){
  if(cnt >= allocated) grow();
  a[cnt++]= n;
};
// add an element to the array at the index ip
template <class T>
void MArray<T>::add(const int ip, const T& n){
  assert((ip >= 0) && (ip <= cnt));
  if(cnt >= allocated) grow();
  for(int i=cnt;i>ip;i--) // shuffle up
  a[i]= a[i-1];
  a[ip]= n;
  cnt++;
};
// remove the element at the index ip, and shuffle the
// rest of the elements down
template <class T>
void MArray<T>::remove(const int n){
int i;
  assert((n >= 0) && (n < cnt));
  // shuffle down starting at n
  for(i=n;i<cnt-1;i++){
    a[i]= a[i+1];
  }
  cnt--;
};
#endif

Listing 2

// This class takes care of the
// mechanism behind variable length
// strings
class VarString
{
private:
  enum{ALLOCINC=32};
  char *a;
  int len;
  int allocated;
  int allocinc;
  inline void grow(int n= 0);
public:
  line VarString(int n= ALLOCINC);
  inline VarString(const VarString& n);
  inline VarString(const char *);
  inline VarString(const char* s, int n);
  inline VarString(char);
  ~VarString(){delete [] a;}
  VarString& operator=(const VarString& n);
  VarString& operator=(const char *);
  inline const char operator[](const int i) const;
  inline char& operator[](const int i);
  operator const char *() const{ return a; }
  int length(void) const{ return len; }
  void add(char);
  void add(const char *);
  void add(int, const char *);
  void remove(int, int= 1);
  void erase(void){ len= 0; }
};

Listing 3

// This is a partial listing of
// the base class for SPList,
// it handles the underlying dynamic
// array mechanism
template<class T>
class SPListBase
{
private:
  enum{ALLOCINC=20}; // default array size
  T *a;	        // the actual data
  int cnt;	// the number of elements in the array
  int first;    // the start of the array
  int allocated;// amount allocated for the array
  int allocinc; // amount to re-allocate to array
  void grow(int amnt= 0, int newcnt= -1); // make array grow
protected:
  void compact(const int i); // remove an element from the array
public:
  SPListBase(int n= ALLOCINC) // default constructor
  {
    a= new T[n];
    cnt= 0;
    first= n>>1; // start off in the middle
    allocated= n;
    allocinc= n;
  }
  // copy constructor
  SPListBase(const SPListBase<T>& n);
  // assignment operator
  SPListBase<T>& SPListBase<T>::operator=(const SPListBase<T>& n);
  virtual ~SPListBase(){ // destuctor
    delete [] a;
  }
  // index operator
  INLINE T& operator[](const int i);
  const T& operator[](const int i) const{
    // generate an assert error if indices don't make sense
    assert((i >= 0) && (i < cnt));
    return a[first+i];
  }
  int count(void) const{ return cnt; }
  // add an element to the end of the array
  void add(const T& n){
    if(cnt+first >= allocated) grow();
    a[first+cnt]= n;
    cnt++;
  }
  
  // add an element into middle of array
  void add(const int i, const T& n);
  // reset array
  void erase(void){ cnt= 0; first= (allocated>>1);}
};
// Allow the array to grow if index is out of range
template <class T>
INLINE T& SPListBase<T>::operator[](const int i)
{
  assert((i >= 0) && (first >= 0) && ((first+cnt) <= allocated));
  int indx= first+i;
    
  if(indx >= allocated){  // need to grow it
    // index as yet unused element
    grow((indx-allocated)+allocinc, i+1); 
    indx= first+i; // first will have changed in grow()
  }
  assert(indx >= 0 && indx < allocated);
  if(i >= cnt) cnt= i+1;  // it grew
  return a[indx];
}

Listing 4

// Partial listing of SPList
template <class T>
class SPList: private SPListBase<T>
{
public:
 SPList(int sz= 10): SPListBase<T>(sz){}
// ...
    
 // make some SPListBase functions
 // available to clients
 T& operator[](const int i)
 {return SPListBase<T>::operator[](i);}
 const T& operator[](const int i) const
 {return SPListBase<T>::operator[](i);}
 SPListBase<T>::count;   // some compilers don't like this
//...
 // pop the next item off the top of the list
 T pop(void)
 {
  T tmp;
  int n= count()-1;
  if(n >= 0){
   tmp= (*this)[n];
   compact(n);
  }
  return tmp;
 }
 // push the item onto the top of the list
 void push(const T& a){ add(a);}
 // shift the first itemoff the start of the list
 T shift(void)
 {
  T tmp= (*this)[0];
  compact(0);
  return tmp;
 }
    
 // put the item onto the start of the list
 int unshift(const T& a)
 {
  add(0, a);
  return count();
 }
 // ....
};

Listing 5
// Partial listing of SPString
// header
class SPString
{
private:
  // variable length string mechanism
  VarString pstr;
    
public:
  // forward declare
  class substring;
    
  // ....
  SPString(const substring& sb) : pstr(sb.pt, sb.len){}
  SPString& operator=(const substring& sb);
  operator const char*() const{return pstr;}
  const char operator[](int n) const{ return pstr[n]; }
  int length(void) const{ return pstr.length(); }
  substring substr(int offset, int len= -1);
  substring substr(const Range& r){ return substr(r.start(), r.length());}
        
  SPStringList split(const char *pat= "[ \t\n]+", int limit= -1);
    
  int operator<(const SPString& s) const
  { return (strcmp(pstr, s) < 0); }
  // ditto for > <= >= == !=
  int operator>(const char *s) const
  { return (strcmp(pstr, s) > 0); }
  // ditto for < <= >= == !=
  friend int operator==(const char *s, const SPString& sp)
  { return (strcmp(s, sp.pstr) == 0); }
  // ditto for > <= >= < !=
  friend substring;
private:
  void insert(int pos, int len, const char *pt, int nlen);
  class substring
  {
  public:
    int pos, len;
    SPString& str;
    char *pt;
    substring(SPString& os, int p, int l) : str(os)
    {
      if(p > os.length()) p= os.length();
      if((p+l) > os.length()) l= os.length() - p;
      pos= p; len= l;
      if(p == os.length()) pt= 0; // append to end of string
      else pt= &os.pstr[p];
    }
    void operator=(const SPString& s)
    {
       if(&str == &s){ // potentially overlapping
         VarString tmp(s);
         str.insert(pos, len, tmp, strlen(tmp));
        }else str.insert(pos, len, s, s.length());
    }
        
    void operator=(const substring& s)
    {
      if(&str == &s.str){ // potentially overlapping
        VarString tmp(s.pt, s.len);
        str.insert(pos, len, tmp, strlen(tmp));
      }else str.insert(pos, len, s.pt, s.len);
    }
    void operator=(const char *s)
    {
      str.insert(pos, len, s, strlen(s));
    }
  };
};

Listing 6

// SPStringList
class SPStringList: public SPList<SPString>
{
public:
  SPStringList(int sz= 6):SPList<SPString>(sz){}
  // copy lists, need to duplicate all internal strings
  SPStringList(const SPStringList& n);
  // assignment operator
  SPStringList& operator=(const SPList<SPString>& n);
 
  int split(const char *str, const char *pat= "[ \t\n]+", int limit= -1);
  SPString join(const char *pat= " ");
  int m(const char *rege, const char *targ, const char *opts="");
  friend SPStringList m(const char *pat, const char *str, const char *opts="")
  {
    SPStringList l;
    l.m(pat, str, opts);
    l.shift(); // remove the first element which would be $&
    return l;
  }
  SPStringList grep(const char *rege, const char *opts="");
};

Listing 7

// just a mechanism for self deleting
// strings which can be hacked
class TempString
{
private:
  char *str;
public:
  // create a TempString from a char *
  TempString(const char *s)    
  {
    str= new char[strlen(s) + 1];
    strcpy(str, s);
  }
  // create a TempString from a substring of char *  
  TempString(const char *s, int len)    
  {
    str= new char[len + 1];
    if(len) strncpy(str, s, len);
    str[len]= '\0';
  }
  // remove storage when done
  ~TempString(){ delete [] str; }
  // convert a TempString into a char *
  operator char*() const { return str; }
};

Listing 8

// Input stream operator for SPString
// allows the reading of long lines
// and lines terminated by EOF and not '\n'
istream& operator>>(istream& ifs, SPString& s)
{
char c, buf[132];
  s= ""; // initialise with an empty string
  // read a buffer from the input stream
  ifs.get(buf, sizeof buf); 
  if(ifs){ // if previous operation was ok
    s += buf; 	// append buffer to string
    // if its a long line continue appending to string
    while(ifs.good() && (c=ifs.get()) != '\n'){
      ifs.putback(c);
      // append to line
      if(ifs.get(buf, sizeof buf)) s += buf;
    }
  }
  return ifs;    
}