Home Random Page


CATEGORIES:

BiologyChemistryConstructionCultureEcologyEconomyElectronicsFinanceGeographyHistoryInformaticsLawMathematicsMechanicsMedicineOtherPedagogyPhilosophyPhysicsPolicyPsychologySociologySportTourism






Nbsp;   Parsing a String to Obtain an Object: Parse

In the preceding section, I explained how to take an object and obtain a string representation of that object. In this section, I’ll talk about the opposite: how to take a string and obtain an object represen- tation of it. Obtaining an object from a string isn’t a very common operation, but it does occasionally come in handy. Microsoft felt it necessary to formalize a mechanism by which strings can be parsed into objects.

Any type that can parse a string offers a public, static method called Parse. This method takes a String and returns an instance of the type; in a way, Parse acts as a factory. In the FCL, a Parse

method exists on all of the numeric types as well as for DateTime, TimeSpan, and a few other types (such as the SQL data types).

Let’s look at how to parse a string into a number type. Almost all of the numeric types (Byte, SByte, Int16/UInt16, Int32/UInt32, Int64/UInt64, Single, Double, Decimal, and Big­ Integer) offer at least one Parse method. Here I’ll show you just the Parse method defined by the Int32 type. (The Parse methods for the other numeric types work similarly to Int32’s Parse method.)

 

public static Int32 Parse(String s, NumberStyles style, IFormatProvider provider);

 

Just from looking at the prototype, you should be able to guess exactly how this method works. The String parameter, s, identifies a string representation of a number you want parsed into an Int32 object. The System.Globalization.NumberStyles parameter, style, is a set of bit flags that identify characters that Parse should expect to find in the string. And the IFormatProvider parameter, provider, identifies an object that the Parse method can use to obtain culture-specific information, as discussed earlier in this chapter.

For example, the following code causes Parse to throw a System.FormatException because the string being parsed contains a leading space.

 

Int32 x = Int32.Parse(" 123", NumberStyles.None, null);

 

To allow Parse to skip over the leading space, change the style parameter as follows.

 

Int32 x = Int32.Parse(" 123", NumberStyles.AllowLeadingWhite, null);

 

See the .NET Framework SDK documentation for a complete description of the bit symbols and common combinations that the NumberStyles enumerated type defines.

Here’s a code fragment showing how to parse a hexadecimal number.

 

Int32 x = Int32.Parse("1A", NumberStyles.HexNumber, null); Console.WriteLine(x); // Displays "26"


This Parse method accepts three parameters. For convenience, many types offer additional overloads of Parse so you don’t have to pass as many arguments. For example, Int32 offers four overloads of the Parse method.

 

// Passes NumberStyles.Integer for style

// and thread's culture's provider information. public static Int32 Parse(String s);

 

// Passes thread's culture's provider information. public static Int32 Parse(String s, NumberStyles style);



 

// Passes NumberStyles.Integer for the style parameter.

public static Int32 Parse(String s, IFormatProvider provider);

 

// This is the method I've been talking about in this section. public static Int32 Parse(String s, NumberStyles style,

IFormatProvider provider);

 

The DateTime type also offers a Parse method.

 

public static DateTime Parse(String s, IFormatProvider provider, DateTimeStyles styles);

 

This method works just as the Parse method defined on the number types except that Date­ Time’s Parse method takes a set of bit flags defined by the System.Globalization.DateTime­ Styles enumerated type instead of the NumberStyles enumerated type. See the .NET Framework SDK documentation for a complete description of the bit symbols and common combinations the DateTimeStyles type defines.

For convenience, the DateTime type offers three overloads of the Parse method.

 

// Passes thread's culture's provider information

// and DateTimeStyles.None for the style public static DateTime Parse(String s);

 

// Passes DateTimeStyles.None for the style

public static DateTime Parse(String s, IFormatProvider provider);

 

// This is the method I've been talking about in this section. public static DateTime Parse(String s,

IFormatProvider provider, DateTimeStyles styles);

 

Parsing dates and times is complex. Many developers have found the Parse method of the Date­ Time type too forgiving in that it sometimes parses strings that don’t contain dates or times. For this reason, the DateTime type also offers a ParseExact method that accepts a picture format string that indicates exactly how the date/time string should be formatted and how it should be parsed.

For more information about picture format strings, see the DateTimeFormatInfo class in the .NET Framework SDK.


 

 

 
 

Encodings: Converting Between Characters and Bytes

In Win32, programmers all too frequently have to write code to convert Unicode characters and strings to Multi-Byte Character Set (MBCS) characters and strings. I’ve certainly written my share of this code, and it’s very tedious to write and error-prone to use. In the CLR, all characters are repre- sented as 16-bit Unicode code values and all strings are composed of 16-bit Unicode code values. This makes working with characters and strings easy at run time.

At times, however, you want to save strings to a file or transmit them over a network. If the strings consist mostly of characters readable by English-speaking people, saving or transmitting a set of 16- bit values isn’t very efficient because half of the bytes written would contain zeros. Instead, it would be more efficient to encode the 16-bit values into a compressed array of bytes and then decode the array of bytes back into an array of 16-bit values.

Encodings also allow a managed application to interact with strings created by non-Unicode systems. For example, if you want to produce a file readable by an application running on a Japanese version of Windows 95, you have to save the Unicode text by using the Shift-JIS (code page 932) en- coding. Likewise, y ou’d use Shift-JIS encoding to read a text file produced on a Japanese Windows 95 system into the CLR.

Encoding is typically done when you want to send a string to a file or network stream by using the System.IO.BinaryWriter or System.IO.StreamWriter type. Decoding is typically done when you want to read a string from a file or network stream by using the System.IO.BinaryReader or System.IO.StreamReader type. If you don’t explicitly select an encoding, all of these types default to using UTF-8. (UTF stands for Unicode Transformation Format.) However, at times, you might want to explicitly encode or decode a string. Even if you don't want to explicitly do this, this section will give you more insight into the reading and writing of strings from and to streams.


Fortunately, the FCL offers some types to make character encoding and decoding easy. The two most frequently used encodings are UTF-16 and UTF-8:

■ UTF-16 encodes each 16-bit character as 2 bytes. It doesn’t affect the characters at all, and no compression occurs—its performance is excellent. UTF-16 encoding is also referred to as Uni- code encoding. Also note that UTF-16 can be used to convert from little-endian to big-endian and vice versa.

■ UTF-8 encodes some characters as 1 byte, some characters as 2 bytes, some characters as 3 bytes, and some characters as 4 bytes. Characters with a value below 0x0080 are compressed to 1 byte, which works very well for characters used in the United States. Characters between 0x0080 and 0x07FF are converted to 2 bytes, which works well for European and Middle East- ern languages. Characters of 0x0800 and above are converted to 3 bytes, which works well for East Asian languages. Finally, surrogate pairs are written out as 4 bytes. UTF-8 is an extremely popular encoding, but it’s less efficient than UTF-16 if you encode many characters with values of 0x0800 or above.

Although the UTF-16 and UTF-8 encodings are by far the most common, the FCL also supports some encodings that are used less frequently:

■ UTF-32 encodes all characters as 4 bytes. This encoding is useful when you want to write a simple algorithm to traverse characters and you don’t want to have to deal with characters taking a variable number of bytes. For example, with UTF-32, you do not need to think about surrogates because every character is 4 bytes. Obviously, UTF-32 is not an efficient encoding in terms of memory usage and is therefore rarely used for saving or transmitting strings to a file or network. This encoding is typically used inside the program itself. Also note that UTF-32 can be used to convert from little-endian to big-endian and vice versa.

■ UTF-7 encoding is typically used with older systems that work with characters that can be expressed using 7-bit values. You should avoid this encoding because it usually ends up expanding the data rather than compressing it. The Unicode Consortium has deprecated this encoding.

■ ASCII encodes the 16-bit characters into ASCII characters; that is, any 16-bit character with a value of less than 0x0080 is converted to a single byte. Any character with a value greater than 0x007F can’t be converted, so that character’s value is lost. For strings consisting of characters in the ASCII range (0x00 to 0x7F), this encoding compresses the data in half and is very fast (because the high byte is just cut off). This encoding isn’t appropriate if you have characters outside of the ASCII range because the character’s values will be lost.

Finally, the FCL also allows you to encode 16-bit characters to an arbitrary code page. As with the ASCII encoding, encoding to a code page is dangerous because any character whose value can’t be expressed in the specified code page is lost. You should always use UTF-16 or UTF-8 encoding unless you must work with some legacy files or applications that already use one of the other encodings.


When you need to encode or decode a set of characters, you should obtain an instance of a class derived from System.Text.Encoding. Encoding is an abstract base class that offers several static readonly properties, each of which returns an instance of an Encoding-derived class.

Here’s an example that encodes and decodes characters by using UTF-8.

 

using System; using System.Text;

 

public static class Program { public static void Main() {

// This is the string we're going to encode. String s = "Hi there.";

 

// Obtain an Encoding­derived object that knows how

// to encode/decode using UTF8 Encoding encodingUTF8 = Encoding.UTF8;

 

// Encode a string into an array of bytes. Byte[] encodedBytes = encodingUTF8.GetBytes(s);

 

// Show the encoded byte values. Console.WriteLine("Encoded bytes: " +

BitConverter.ToString(encodedBytes));

 

// Decode the byte array back to a string.

String decodedString = encodingUTF8.GetString(encodedBytes);

 

// Show the decoded string.

Console.WriteLine("Decoded string: " + decodedString);

}

}

 

This code yields the following output.

 

Encoded bytes: 48­69­20­74­68­65­72­65­2E

Decoded string: Hi there.

 

In addition to the UTF8 static property, the Encoding class also offers the following static prop- erties: Unicode, BigEndianUnicode, UTF32, UTF7, ASCII, and Default. The Default property returns an object that is able to encode/decode using the user’s code page as specified by the Language For Non-Unicode Programs option of the Region/Administrative dialog box in Control Panel. (See the GetACP Win32 function for more information.) However, using the Default property is discouraged because your application’s behavior would be machine-setting dependent, so if you change the system’s default code page or if your application runs on another machine, your applica- tion will behave differently.

In addition to these properties, Encoding also offers a static GetEncoding method that allows you to specify a code page (by integer or by string) and returns an object that can encode/decode using the specified code page. You can call GetEncoding, passing "Shift­JIS" or 932, for example.


When you first request an encoding object, the Encoding class’s property or GetEncoding method constructs a single object for the requested encoding and returns this object. If an already- requested encoding object is requested in the future, the encoding class simply returns the object it previously constructed; it doesn’t construct a new object for each request. This efficiency reduces the number of objects in the system and reduces pressure in the garbage-collected heap.

Instead of calling one of Encoding’s static properties or its GetEncoding method, you could also construct an instance of one of the following classes: System.Text.UnicodeEncoding, System.Text.UTF8Encoding, System.Text.UTF32Encoding, System.Text.UTF7Encoding, or System.Text.ASCIIEncoding. However, keep in mind that constructing any of these classes creates new objects in the managed heap, which hurts performance.

Four of these classes, UnicodeEncoding, UTF8Encoding, UTF32Encoding, and UTF7Encoding, offer multiple constructors, providing you with more control over the encoding and preamble. (Pre- amble is sometimes referred to as a byte order mark or BOM.) The first three aforementioned classes also offer constructors that let you tell the class to throw exceptions when decoding an invalid byte sequence; you should use these constructors when you want your application to be secure and resis- tant to invalid incoming data.

You might want to explicitly construct instances of these encoding types when working with a BinaryWriter or a StreamWriter. The ASCIIEncoding class has only a single constructor and therefore doesn’t offer any more control over the encoding. If you need an ASCIIEncoding object, always obtain it by querying Encoding’s ASCII property; this returns a reference to a single ASCII­ Encoding object. If you construct ASCIIEncoding objects yourself, you are creating more objects on the heap, which hurts your application’s performance.

After you have an Encoding-derived object, you can convert a string or an array of characters to an array of bytes by calling the GetBytes method. (Several overloads of this method exist.) To

convert an array of bytes to an array of characters or a string, call the GetChars method or the more useful GetString method. (Several overloads exist for both of these methods.) The preceding code demonstrated calls to the GetBytes and GetString methods.

All Encoding-derived types offer a GetByteCount method that obtains the number of bytes necessary to encode a set of characters without actually encoding. Although GetByteCount isn’t especially useful, you can use this method to allocate an array of bytes. There’s also a GetCharCount method that returns the number of characters that would be decoded without actually decoding them. These methods are useful if you’re trying to save memory and reuse an array.

The GetByteCount/GetCharCount methods aren’t that fast because they must analyze the array of characters/bytes in order to return an accurate result. If you prefer speed to an exact result, you can call the GetMaxByteCount or GetMaxCharCount method instead. Both methods take an integer specifying the number of bytes or number of characters and return a worst-case value.

Each Encoding-derived object offers a set of public read-only properties that you can query to obtain detailed information about the encoding. See the .NET Framework SDK documentation for a description of these properties.


To illustrate most of the properties and their meanings, I wrote the following program that displays the property values for several different encodings.

 

using System; using System.Text;

 

public static class Program { public static void Main() {

foreach (EncodingInfo ei in Encoding.GetEncodings()) { Encoding e = ei.GetEncoding(); Console.WriteLine("{1}{0}" +

"\tCodePage={2}, WindowsCodePage={3}{0}" + "\tWebName={4}, HeaderName={5}, BodyName={6}{0}" + "\tIsBrowserDisplay={7}, IsBrowserSave={8}{0}" + "\tIsMailNewsDisplay={9}, IsMailNewsSave={10}{0}",

 

Environment.NewLine,

e.EncodingName, e.CodePage, e.WindowsCodePage, e.WebName, e.HeaderName, e.BodyName, e.IsBrowserDisplay, e.IsBrowserSave, e.IsMailNewsDisplay, e.IsMailNewsSave);

}

}

}

 

Running this program yields the following output (abridged to conserve paper).

 

IBM EBCDIC (US­Canada)

CodePage=37, WindowsCodePage=1252

WebName=IBM037, HeaderName=IBM037, BodyName=IBM037 IsBrowserDisplay=False, IsBrowserSave=False IsMailNewsDisplay=False, IsMailNewsSave=False

 

OEM United States

CodePage=437, WindowsCodePage=1252

WebName=IBM437, HeaderName=IBM437, BodyName=IBM437 IsBrowserDisplay=False, IsBrowserSave=False IsMailNewsDisplay=False, IsMailNewsSave=False

 

IBM EBCDIC (International)

CodePage=500, WindowsCodePage=1252

WebName=IBM500, HeaderName=IBM500, BodyName=IBM500 IsBrowserDisplay=False, IsBrowserSave=False IsMailNewsDisplay=False, IsMailNewsSave=False

 

Arabic (ASMO 708)

CodePage=708, WindowsCodePage=1256

WebName=ASMO­708, HeaderName=ASMO­708, BodyName=ASMO­708 IsBrowserDisplay=True, IsBrowserSave=True IsMailNewsDisplay=False, IsMailNewsSave=False


 

Unicode


 

 

CodePage=1200, WindowsCodePage=1200 WebName=utf­16, HeaderName=utf­16, BodyName=utf­16 IsBrowserDisplay=False, IsBrowserSave=True IsMailNewsDisplay=False, IsMailNewsSave=False


Unicode (Big­Endian)

CodePage=1201, WindowsCodePage=1200

WebName=unicodeFFFE, HeaderName=unicodeFFFE, BodyName=unicodeFFFE IsBrowserDisplay=False, IsBrowserSave=False IsMailNewsDisplay=False, IsMailNewsSave=False

 

Western European (DOS)

CodePage=850, WindowsCodePage=1252

WebName=ibm850, HeaderName=ibm850, BodyName=ibm850 IsBrowserDisplay=False, IsBrowserSave=False IsMailNewsDisplay=False, IsMailNewsSave=False

 

Unicode (UTF­8)

CodePage=65001, WindowsCodePage=1200 WebName=utf­8, HeaderName=utf­8, BodyName=utf­8 IsBrowserDisplay=True, IsBrowserSave=True IsMailNewsDisplay=True, IsMailNewsSave=True

 

Table 14-3 covers the most commonly used methods offered by all Encoding-derived classes.

 

TABLE 14-3Methods of the Encoding-Derived Classes

 

Method Description
GetPreamble Returns an array of bytes indicating what should be written to a stream before writing any encoded bytes. Frequently, these bytes are referred to as BOM bytes. When you start reading from a stream, the BOM bytes automatically help detect the encoding that was used when the stream was written so that the correct decoder can be used. For some Encoding-derived classes, this method returns an array of 0 bytes—that is, no preamble bytes. A UTF8Encoding object can be explicitly constructed so that this method returns a 3-byte array of 0xEF, 0xBB, 0xBF. A UnicodeEncoding object can be explicitly constructed so that this method returns a 2-byte array of 0xFE, 0xFF for big- endian encoding or a 2-byte array of 0xFF, 0xFE for little-endian encoding. The default is little-endian.
Convert Converts an array of bytes specified in a source encoding to an array of bytes speci- fied by a destination encoding. Internally, this static method calls the source encoding object’s GetChars method and passes the result to the destination encoding object’s GetBytes method. The resulting byte array is returned to the caller.
Equals Returns true if two Encoding-derived objects represent the same code page and preamble setting.
GetHashCode Returns the encoding object’s code page.

 


Date: 2016-03-03; view: 785


<== previous page | next page ==>
Nbsp;   Obtaining a String Representation of an Object: ToString | Encoding and Decoding Streams of Characters and Bytes
doclecture.net - lectures - 2014-2024 year. Copyright infringement or personal data (0.022 sec.)