Strings are a fundamental data type in most languages and the .NET platform is no different. What follows is a Q&A on strings that (hopefully) serves as a refresher course on basic string knowledge and operations that programmers use on a regular basis.
What is a string and how are they stored in .NET ?
In .NET strings are just a series of Unicode characters used to represent text. They are serializable, comparable, equatable, enumerable and cloneable. They are not null-terminated like other languages. In fact, strings can have null characters embedded within them.
What is a Unicode character?
The Unicode Standard was designed to be able to codify all the characters of all the languages in the world. To do this it uses Unicode characters each of which is assigned a unique number called a code point. Various encodings are defined by the standard that specify how a code point is encoded into a sequence of one or more 16-bit values. Each 16-bit value ranges from hexadecimal 0x0000 through 0xFFFF (often written U+0000 - U+FFFF) and in .NET is stored in a Char structure.
What's the difference between "string" and "System.String" ?
None. The lowercase version seen in C# code is merely an alias to the framework-provided System.String class.
What's the difference between a char and a string data type?
A Char represents a single Unicode character, whereas a string represents any number of Unicode characters. The System.Char data type is a struct so it is a value-type and therefore not nullable. Char literals are defined with single quotes whereas string literals are defined with double quotes.
Are strings value- or reference-types?
Strings are objects in .NET which makes them reference types, however they are unusual in that they exhibit value-type semantics since they are immutable and operations on them create new strings.
Are strings immutable?
Yes. See answer above. This means that any modifying operation performed on a string actually creates a new modified string. In other words, once created strings cannot be modified.
How are "empty" strings defined?
Since strings are reference types they can be set to null. There is also a system provided String.Empty field that you can use. It is effectively a zero-length string, "".
How big can strings be?
Like other CLR objects, string reference types cannot breach the maximum object size allowed in the GC heap which is 2GB. If you are more interested in character count you can use the Length instance property to obtain the number of characters (Char objects) in the string. Note that embedded nulls are counted by the Length property.
What is a character encoding?
As mentioned earlier, strings in .NET are sequences of Unicode characters. Encoding is the process of transforming a set of Unicode characters into a byte sequence. Whenever you want to transform a .NET string into a byte array you'll need to use an encoding. The same encoding must be used to transform the byte array back to a string! Note that encoding classes replace characters with "?" or a "substitute character" if a problem occurs.
What encodings are supported by .NET ?
The framework provides the Encoding class with supports ASCII, UTF-7, UTF-8, UTF-32, and of course Unicode. For Unicode and UTF-32 encoding both little endian and big endian byte orders are supported. These refer to whether the most-significant or the least-significant byte come first in the byte stream. The .NET framework can detect the endian-ness of a byte array by checking the first couple of bytes (by convention FE FF is passed).
What is string normalization?
Some Unicode characters have multiple equivalent binary representations consisting of sets of combining and/or composite Unicode characters. The existence of multiple representations for a single character complicates searching, sorting, matching, and other operations thus the Unicode standard defines a process called normalization that returns one binary representation when given any of the equivalent binary representations of a character. In .NET, the String.Normalize() method returns a new string whose binary representation is in a particular Unicode normalization form. The .NET Framework currently supports normalization forms C, D, KC, and KD.
How do you compare 2 strings? What algorithm is used?
There is an instance method available for strings called CompareTo() that accepts a second string and returns a negative integer if the first string is "less than" the second, a positive integer if the first string is "greater than" the second, or zero if they are the same. This is a linguistic operation (explanation below) therefore the comparison is culture-sensitive. This, and most other string comparisons, compare the actual string values rather than the object references.
There are other ways to perform comparisons. The static == operator can be used for string equality checks. Under the covers this uses the Equals() non-static method.
There is also a String.Compare(s1,s2) static method that does culture-aware string comparisons. You should prefer to use this one if the strings could be null. Note that comparing a null to a null returns match (0).
What's the difference between Compare and CompareOrdinal?
Both are static methods used to perform string comparisons but CompareOrdinal doesn't consider the culture (but it is case-sensitive). In general, string operations in .NET can be considered ordinal or linguistic. An ordinal operation acts on the numeric value of each Char object. A linguistic operation acts on the value of the String taking into account culture-specific casing, sorting, formatting, and parsing rules. Linguistic operations execute in the context of an explicitly declared culture or the implicit current culture. An ordinal comparison is automatically case-sensitive because the lowercase and uppercase versions of a character have different code points. In general, a culture-sensitive comparison is typically appropriate for sorting, and ordinal comparison is typically appropriate for equality checking.
What's the difference between a string and a StringBuilder?
Whereas the String class represents an immutable sequence of Unicode characters, the StringBuilder class can be used to represent a mutable string of characters. Highly repetitive string manipulation operations (such as concatenation) are best done with a StringBuilder.
What is string interning?
As mentioned previously strings are objects and they are immutable. Because of this there is no real need to ever have multiple objects that represent the same string. To facilitate this the .NET framework employs string interning whereby an intern pool holds interned objects that represent strings. String literals are automatically interned, and you can also force interning of other strings via the Intern() method. There are also methods available to determine if a given string literal is interned or not.
Update (Dec-2009): Eric Lippert has a great article describing how interning and reference/value comparisons create interesting edge cases.
How can you replace a portion of a string?
Use the Replace method as follows: string output = input.Replace("searchTerm","replacementValue");
How would you reverse a string in-place?
See my earlier article on this.
What is a verbatim string literal?
Certain characters have special meaning and therefore need to be "escaped" when included in a string. This concept won't be new to most programmers, but the term verbatim string might be. A verbatim string is simply a string that is prefixed with @ before the double quotes, and this signifies that any embedded escape characters are not to be treated as such. Of course, it still respects the double quote to define the start and end of the string, and a pair of double quotes can be used to represent an embedded double quote character.
How would you create a string that consists of a single character repeated X times?
Use the constructor syntax (this creates a string of 80 consecutive hashes): string a = new string("#", 80);
How would you convert a string to a numeric?
Use the Convert class and code similar to this (for an integer): int i = Convert.ToInt32(someString);
Alternately you can call the Parse() static methods on the data types and pass it the string to convert.
21 Mar 2009 Damien Wintour