The Better String Library FAQ

by Paul Hsieh
Last updated: 12-30-2006

loop

 Intergalactic Radio Station
 © Copyright 2004

Q. What is Bstrlib?

A. Bstrlib is a string data type library for C and C++. It was written primarily with safety in mind, but also has performance and functionality advantages while remaining totally interoperable with ordinary char * usage.

Q. What makes Bstrlib so much safer than using the standard C/C++ library?

A. Bstrlib contains these mechanism as part of its API:

  1. memory management

    The memory required to hold a string as it is modified is automatically allocated and managed in Bstrlib functions.

  2. buffer overflow detection

    The Bstrlib API has well defined interpretations for all legal values of its parameters including character indexes and lengths that fall outside the boundaries of a given bstring.

  3. error propagation

    Well defined error conditions are detected rather than leading to undefined action. So scenarios with multiple failure conditions don't need to be littered with large amounts of error detection -- it is typically sufficient to put a single check at the end of a long series Bstrlib operations. (The C++ API uses exception handling.)

  4. write protection

    bstrings declared statically or constructed from unqualified sources are write protected. bstrings can also be made write protected dynamically.

  5. aliasing support

    Unlike most C libraries, aliased parameters are detected and supported and are given the most natural interpretation.

  6. reduction of undefined scenarios

    C's standard library (as well as many other libraries implemented in C) suffers from being littered with a minefield of undefined behaviors that result from a myriad of semantic conditions even when passing parameters to functions which in of themselves are legal. Bstrlib, as a matter of policy, does not allow this to happen. If your parameters are legal with respect to their type (independent of their meaning as that parameter), then using the Bstrlib API function will not lead to any undefined behavior. (What a concept!)

Q. Doesn't all the safety of Bstrlib lead to increased overhead versus standard char buffer usage?

A. Bstrlib uses a (length, data) internal representation rather than '\0' termination. Because of this, functions which require length determinations have dramatically improved performance versus the corresponding standard C library functions (see the benchmarks in the feature comparisons page.) Where performance is concerned, this means that string manipulations of strings that are not very small will favor Bstrlib.

In addition, the Bstrlib API is substantially more functional than the C library. This means that function call overhead is better amortized (by virtue of not needing to call as many such functions) than the C library.

A minimum useage of Bstrlib measured on a variety of compilers shows an additional object code size of between 18k and 28K.

All this being said, manipulations that are extremely trivial, on very small strings may execute marginally faster just using the straight C library.

Q. By always allocating memory for strings, does this open Bstrlib up to denial of service attacks when receiving user input?

A. Bstrlib contains a function bSecureInput () in the bstraux module which addresses this issue; it takes an optional maximum length parameter for user input. So malformed input cannot lead to unsually large amounts of resources to be wasted unnecessarily.

Q. What is the relationship between bstrings and CBString?

A. CBString is a C++ class that substantially uses bstrings to implement its functionality. Both provide nearly the same functionality. Like STL's std::string or MFC's CString, CBString uses operator overloading, exception handling, and STL to maximally leverage C++ functionality. The [] operator is additionally safe via bounds checking. CBString throws exceptions as a result of any error encountered, while Bstrlib propagates such errors.

Q. Doesn't the bounds protection decrease the performance of Bstrlib?

A. For most operations, no. This only affects per-character operations, where by default Bstrlib favors safety over performance. However, one can always achieve higher performance by gaining direct access to the buffer as necessary. I.e., its easy to be safe and fast most of the time, and requires just a little effort (its just a little more same-line typing) to be less-safe but always fast.

Q. What advantage does Bstrlib::CBString have over std::string?

A. std::string is an STL generic which will be somewhat slower than CBString because of it. std::string also does not contain a lot of standard character string manipulations like format, findreplace, split and join. CBString implements useful write protection (that extends into per character protection.) With std::string it is cumbersome to be bounds protected, and easy to be unsafe (this is opposite to Bstrlib::CBString, where a cast and a dereference is required to drop safety.)

Q. Isn't using some of the string library extensions such as strlcat and snprintf sufficient to overcome the main safety problems of char * buffer usage so as to make Bstrlib unnecessary?

A. No. String manipulation, by its very nature requires memory management to go hand in hand with each operation. These solutions do not provide any such thing and continue to defer the problem to the programmer. Arbitrary preconceived length cut-offs are just not adequate. It leads to the typical double problem of "make the buffer large enough for all reasonable cases" that just wastes memory in the most typical cases, while not functioning ideally at all in uncommon cases.

Buffer overflows only represent the very worst problem with string support in the C/C++ languages. The other safety features in Bstrlib still give it a notable advantage versus any typical function augmentation based on the C library as a foundation.

A survey of other modern programming languages also shows that a larger complement of string manipulation functions are required. Something that the C library or STL or its extensions does not provide.

Q. Aren't all these safety mechanisms really only an issue with beginner programmers? Is there any advantage for experienced programmers?

A. The safety features are not meant merely to coddle the beginner programmer. Bstrlib uses policies that minimize the number of required reallocs, implements correct alias detection, performs no action when trying to destroy a write protected bstring, accepts NULL parameters, deals with out-of-range indexes, does not inhibit thread safety and adds functionality found in other more modern programming languages.

While any sufficiently skilled programmer can duplicate (and possibly improve on) any of this functionality, Bstrlib offers all these features in one well tested package without the need for "reinventing the wheel". The fact that other string libraries exist but are unable to match Bstrlib feature for feature is evidence, that doing so is, perhaps, easier said than done.

Q. Isn't Bstrlib really appropriate just for applications that are being written scratch?

A. Not at all! Bstrlib is highly interoperable with standard char * strings. Complete conversion from char *'s to bstrings is usually not necessary, since the library contains key functions for mixing the two and extracting a NUL ('\0') terminated char * from a bstring is a trivial operation. Bstrlib also contains macros which replaces the semantics of "pointer arithmetic" with a safe superset (string segment arithmetic) of such functionality. So migrating from char * usage to Bstrlib can be done incrementally without semantic impediments.

Q. Isn't using a different programming language a better answer than using Bstrlib?

A. The C (and C++) language has been much maligned for its severe lack of safety. And when one takes into account the C standard library as well, one is inevitably lead to the conclusion that this criticism is justified. While basically every other modern high level language in existence is far safer than C in all modes of usage, they also all pretty universally find themselves compromising in one way or another:

  1. Performance

    Even if some alternative language can theortically perform comparably with C (or if someone rigs a benchmark to make it suggest that the performance of C can be equalled or beaten), there simply is no denying the centuries (if not millienia) of man years of research and effort put into C compiler optimizers coupled with the more natural mapping of the C language to native machine language versus other languages. This explains the rise of C++ over other languages like Eiffel or Ada -- C++ is able to leverage all of the results from the C world, while those languages are not.

  2. Portability and Availability

    Actually, from a semantics point of view, C/C++ are horribly unportable. But this does not detract from the fact that just about every serious computing platform today has a C or C++ compiler available for it. C/C++ at best can be described as *syntactically* portable, but for most development projects (which will typically be limited to one or at most a few target platforms) this is definately good enough.

    While Java is semantically portable (as part of the standard, which in of itself, is far more rigorously and usefully enforced than the ANSI C standard) its availability does not include legacy platforms and has been maligned and actively attacked by certain aggressive proprietary software development houses. Regardless of the righteousness of the cause, one must make practical decisions which may require taking a pass on using Java.

  3. Developer familliarity

    There is simply a larger and better skill base amongst C/C++ programmers than others. While Java's programmer base is certainly growing to challenge this, it has not yet acheived a situation where the available pool of programmers of comparable skill level favors it.

  4. Libraries and tools

    Given its 20+ year head start over languages like Perl, Python and Java, there is clearly much more avaliable in terms of libraries and tools.

None of these points should be taken as the final word on other programming languages. I make them only to demonstrate that there certainly exist compelling reasons for some to not switch to another programming language even for modern software development.

All that being said, Bstrlib helps solve a problem in what is one of the weakest areas for the C language -- the C standard library's pathetic string support. Other languages like Perl and Python have made a point of having really good and highly functional string support without suffering from anything akin to a "buffer overflow" that really puts the C standard library to shame. Use of Bstrlib substantially reduces this advantage.

So if one is compelled to switch languages because they believe that C is just an unsafe language, the existence of Bstrlib is in effect saying "maybe not".

Q. Does Bstrlib include regular expression searching?

A. No. There is no single defacto regular expression standard, they are weaker than other parsing mechanisms (such as context free, LALR grammars, etc) and each requires a fairly non-trivial implementation in of themselves. That said, Bstrlib is totally interoperable with char * usage. Any other, char * compatible library can be used in conjunction with Bstrlib. Thus, most available regular expression libraries (in particular including PCRE) can be used with Bstrlib.

Q. Does Bstrlib support Unicode?

A. No, not at this time. Unfortunately, I have no experience with Unicode at all. Unicode, it turns out, is a very complicated and evolving standard. I am not averse to creating a solution for Unicode, however its obvious that such a thing would have to be designed very carefully. I am open to feedback on this issue, including suggestions on what the scope should be and how such support might be added.

Q. Does Bstrlib support garbage collection?

A. No. Certainly not directly. bstrings and CBStrings need to be correctly destroyed much like any allocated memory or object in C/C++ to avoid memory leaks. That said, both implementations allocate memory via malloc, realloc or new, so the garbage collection mechanisms that do exist for C/C++ (such as the Boehm garbage collector) should integrate with Bstrlib without issue.

Schemes such as "reference counting" do not work in a language like C without a lot of hand holding (ADT construction, copying and destruction would have to precisely track references to any contained strings) and can inhibit the creation of thread safe solutions.

Q. Is Bstrlib thread safe?

A. The thread safety in Bstrlib is comparable to that of a linked list or any other self-contained ADT (abstract data type) rather than that of a system-defined atomic data-type such as sig_atomic_t. I.e., reading and writing to the same bstring at the same time leads to undefined results. However, there is no restriction whatsoever from manipulation or reading two different bstrings. Also clearly there can be any number of readers for a single bstring. In particular, Bstrlib has no static/globally written state, so using it won't lead to hidden or unavoidable race conditions. Every function in Bstrlib is also re-entrant, and can be called even by dynamically linked code provided they have a proper context for calling malloc/realloc/free.

To support shared read/write semantics, it is sufficient and recommended that exclusion/critical-sections be handled at higher layers in your code by just making sure that the same bstring is not being modified by more than one thread at once (this will bring the thread safety characteristics to the level of, say, malloc/realloc/free with respect to the shared heap.) This consideration is typical for ADT manipulation in multithreaded programming, and therefore should not lead to any burden that is not expected or already present.

Q. Why does Bstrlib implement abstracted stream reading functionality, but not abstracted stream writing functionality?

A. Stream based writing was added to the bstraux module of Bstrlib.

Q. How do I convince my programming organization to use Bstrlib?

A. Bstrlib is easy to migrate to with numerous good properties:

  1. Bstrlib/CBString has a short learning curve (there are not that many functions/methods)
  2. Bstrlib allows for very concise code. This will inevitably lead to much more maintainable code.
  3. Bstrlib's safety features will lead to improved reliability. It will raise the level of less experienced programmers, while complementing the capabilities of more experienced programmers.
  4. Bstrlib uses a safety model that can be educational to those that use it. Some of the ideas used in Bstrlib (read-only strings, complete alias support, absolutely minimized undefined behaviors) are rare or just completely unseen in other existing libraries. Programmers who use Bstrlib can be motivated to use its techniques in other code they write.
  5. Bstrlib has complete interoperability with ordinary '\0' terminated char * buffers. This means that using Bstrlib does not burn any bridges or in any way compromise the ability to use other libraries which rely on char * buffers. This also provides a way to migrate to use of Bstrlib in an incremental fashion.
  6. Bstrlib/CBString is totally portable from compilation to run time behavior. It does not require UNIX tools or MFC or any other common but non-standard mechanisms. (Use of STL and exception handling can be turned off.)
  7. Bstrlib is well tested and in fact comes with an extensive unit test.
  8. Bstrlib comes with various utility functions which support "net strings", CSV, base 64, UU and Y codecs which make it ideal for dealing with MIME.
  9. It is dual licensed under both the BSD license and the GNU public license. This means it can be used on any project and with any vendor without serious issue.

Q. How does Bstrlib compare to other solutions?

A.  See the Bstrlib feature comparison table or refer to the Bstrlib documentation which gives more detailed comparisons between existing string library alternatives as well as with the standard C library.


Home

Valid HTML 4.0 Transitional

SourceForge Logo