Svenska städer
Strings in SGI STL
This is an attempt to answer some of the questions related to the use
of strings with SGI STL.
What's wrong with the string class defined by the draft standard?
There are several problems, but the most serious ones relate to the
specification for lifetimes of references to characters in a string.
The second committee draft disallows the expression
s[1] == s[2]
where s is a nonconstant
string. This is not simply an oversight; current reference counted
implementations may fail for more complicated examples. They may fail even
for s[1] == s[2] if the string s is
simultaneously examined (merely examined, not necessarily modified) by
another thread. It is hard to define precisely what constitutes a
correct use of one of the current reference counted implementation.
This problem was partially addressed at the July 1997 meeting of the
C++ standardization committee; the solution was to adopt more
complicated rules about reference lifetimes. Unfortunately, these new
rules still do not address the multi-threading issues.
A related problem was pointed out in the French national body comments
on the second committee draft. The following program produces the
wrong answer for most reference counted basic_string
implementations that we have tested; the problem is that, if two
strings share a common representation, they are vulnerable to
modification through a pre-existing reference or iterator.
# include <string>
# include <stdio.h>
main() {
string s("abc");
string t;
char & c(s[1]);
t = s; // Data typically shared between s and t.
c = 'z'; // How many strings does this modify?
if (t[1] == 'z') {
printf("wrong\n");
} else {
printf("right\n");
}
}
The draft standard (as well as common sense) says that updating a
reference to one of s's elements should only modify
s, not t as well; the fact that s and
t might share a representation is an implementation detail
that should have no effect on program behavior. Given the design of
basic_string, though, it is very difficult for a
reference-counted implementation to satisfy that requirement.
The only known way for a reference-counted implementation to avoid
this problem is to mark a string as unsharable whenever there might be
an existing reference or iterator to that string. That is, whenever a
program obtains a reference or an iterator to a string (e.g. by
using operator[] or begin()), that particular string
will no longer use reference counting; assignment and copy
construction will copy the string's elements instead of just copying a
pointer. (We are not aware of any implementation that uses this
technique and that also attempts to be thread-safe.)
This is a drastic solution: since almost all ways of referring to
characters involve references or iterators, this solution implies, in
effect, that the only strings that can be reference-counted are the
ones that are never used. In practice, then, a reference counted
implementation of basic_string can't achieve the performance
gains that one might otherwise expect, since reference counting is
forbidden in all but a few special cases.
A different solution is to abandon the goal of reference-counted
strings altogether, and to provide a non-reference-counted
implementation of basic_string instead. The draft standard
permits non-reference-counted implementations, and several vendors
already provide them. The performance characteristics
of a non-reference-counted basic_string are predicable,
and are very similar to those of a vector<char>:
copying a string, for example, is always an O(N)
operation.
In this implementation, basic_string
does not use reference counting.
I have been using a reference counted implementation, and it works fine.
Why haven't I seen problems?
The current implementations do work correctly, most of the time:
preserving a reference to a character in a string is uncommon.
(Although preserving iterators to strings may be more frequent, and
exactly the same issues apply to iterators.) Some less contrived
sequential programs also fail, though, or else behave differently on
different platforms.
Multi-threaded applications that use a reference counted
basic_string are likely to fail intermittently, perhaps once
every few months; these intermittent failures are difficult to
reproduce and debug. But it is likely that a large fraction of
multi-threaded clients will fail occasionally, thus making such a
library completely inappropriate for multi-threaded use.
So what should I use to represent strings?
There are several possible options, which are appropriate under different
circumstances:
- Ropes
-
Use the rope package provided by the
SGI STL. This provides all functionality that's likely to be needed.
Its interface is similar to the current draft standard, but different
enough to allow a correct and thread-safe implementation. It should
perform reasonably well for all applications that do not require very
frequent small updates to strings. It is the only alternative that
scales well to very long strings, i.e. that could easily be used to
represent a mail message or a text file as a single string.
The disadvantages are:
- Single character replacements are slow. Consequently STL algorithms
are likely to be slow when updating ropes. (Insertions near the beginning
take roughly the same amount of time as single character replacements, and
much less time than corresponding insertions for the other string
alternatives.)
-
The rope implementation stretches
current compiler technology. Portability and compilation time may be
an issue in the short term. Pthread performance on non-SGI platforms
will be an issue until someone provides machine-specific fast
reference counting code. (This is also likely to be an issue for
other reference counted implementations.)
- C strings
-
This is likely to be the most efficient way to represent a large
collection of very short strings. It is by far the most space
efficient alternative for small strings. For short strings, the C
library functions in <string.h> provide an efficient set of
tools for manipulating such strings. They allow easy communication
with the C library. The primary disadvantages are that
- Operations such as concatenation and substring are much more
expensive than for ropes if the
strings are long. A C string is not a good representation for a text
file in an editor.
- The user needs to be aware of sharing between string representations.
If strings are assigned by copying pointers, an update to one string may
affect another.
- C strings provide no help in storage management. This may be a
major issue, although a garbage collector can help alleviate it.
- vector<char>
-
If a string is treated primarily as an array of characters, with
frequent in-place updates, it is reasonable to represent it as vector<char> or vector<wchar_t>. The same is true
if it will be modified by STL container algorithms. Unlike C strings,
vectors handle internal storage management automatically, and
operations that modify the length of a string are generally more
convenient.
Disadvantages are:
- Vector assignments are much more expensive than C string pointer
assignments; the only way to share string representations is to pass
pointers or references to vectors.
- Most operations on entire strings (e.g. assignment, concatenation)
do not scale well to long strings.
- A number of standard string operations (e.g. concatenation
and substring) are not provided with the usual syntax, and must be
expressed using generic STL algorithms. This is usually not hard.
- Conversion to C strings is currently slow, even for short strings.
That may change in future implementations.
What about mstring.h, as supplied with SGI's 7.1 compiler?
This package was a minimal adaptation of the freely available Modena
strings package. It was intended as a stopgap. We do not intend to
develop it further.
It shares some of the reference lifetime problems of other
implementations that try to conform to the draft standard. Its exact
semantics were never well-defined. Under rare conditions, it will
have unexpected semantics for single-threaded applications. It fails
on the example given above. We strongly discourage use for
multi-threaded applications.
Copyright ©
1999 Silicon Graphics, Inc. All Rights Reserved.
TrademarkInformation