Binary/F/Wes

From CommonJS Spec Wiki
< Binary‎ | F
Jump to: navigation, search

The following is a proposal to revise Binary/F by Wes Garland for the handling of transcoding.

Design Notes

  • The basic idea is that the binary API needs
    • fast conversion between Unicode buffers and Strings, without forcing intermediary object allocation
    • Simple line-oriented character set encoding/decoding (iconv charsets)
    • More complicated charset work (non-Unicode byte streams) should be pushed into Encodings API
  • Observations:
    • All Unicode encodings sets can represent 100% of Unicode
    • It is possible to transcode "bad" Unicode from one encoding to another
    • Correcly transcoding between UTF-8, -16 and -32 can be implemented by an average programmer without difficulty or the need for libiconv
    • All Unicode encodings "know" how many code points are required to represent an entire character by looking at only the first code point. This makes handling truncated sequences possible.
  • The reason I have added the optional Object o to some function signatures is to allow multiple out parameters without incurring new object construction overhead.
  • The way JavaScript Strings are treated in this specification fragment makes it possible and reasonable for implementations that have underlying UTF-8 Strings (like v8) to implement utf8-buffer -> String without any actual conversion.

The Unicode character sets, for the purposes of this specification are:

  • UTF-8
  • UTF-16
  • UTF-32

UCS-4 will be accepted as an alias for UTF-32. UCS-2 and UTF-7 are not supported by the "Unicode" functions, although may be recognized as non-special character sets.

In this specification, Strings will be considered to be UTF-16, encoded with the native byte order, with no leading BOM. This is equivalent to the iconv encoding UTF-16BE on Big-Endian machines and UTF-16LE on Little Endian machines. This consideration does not reflect actual implementation detail in the underlying engine, but rather the view offered to script.

For performance reasons, this specification does not require implementations to perform transcoding validation when converting between Unicode character sets; instead, it is acceptable to transcode invalid code points from one Unicode encoding to another. Implementations are, however, encouraged to provide a method for performing transcoding validation for at least debugging builds of the underlying platform.

Definitions

Encode
to transform a String to a byte-oriented buffer
Decode
to transform a byte-oriented buffer into a String
Transcode
to transform one byte-oriented buffer into another


Constructor Methods

Object Buffer(String string, String charset, [Number length])
  • creates and returns a new Object having a buffer of length bytes. If length is unspecified, the buffer will be exactly big enough to hold the encoded data
  • string is encoded to the character set identified by charset in the new buffer
  • throws an Error if encode fails, even if failure is due to under-sized buffer
  • if buffer is undersized and charset is a Unicode character set, the Error object will be augmented with a "commonjs.binary.encode_buffer_underrun" property indicating how many more bytes would have been required to succeed
  • if buffer is undersized and charset is not a Unicode character set (but is a valid iconv character set), the Error object will be augmented with a "commonjs.binary.encode_buffer_underrun" property having an undefined value

Static Methods

Object unicodeTranscode({Object buffer, [String charset, [Number offset, [Number length]]]} target, {Object buffer, [String charset, [Number offset, [Number length]]]} source, [Object o])
  • Copies from source's buffer to target's buffer, transcoding from one Unicode character set to another, returning an object o
  • The default values for charset, offset, and length are "UTF-8", 0, and buffer.length for both source and target
  • Transcode behaviour is unspecified if source === target and the ranges [source.offset...source.offset + length] and [target.offset...target.offset + length] overlap
  • Transcode input is source.buffer[offset] through source.buffer[offset+length]; only entire characters are transcoded; a trailing partial character is not an error
  • o.encoded holds the number of bytes from source.buffer which were transcoded by this operation
  • o.used holds the number of bytes written into the target.buffer by this operation
  • transcoding errors will throw an exception, with the Error object augmented with a "commonjs.binary.transcode_error_offset" property containing the position in the source buffer which held the leading byte of the unencodeable character
  • The contents of o.encoded and o.used are unaffected if this function throws an exception
  • No properties other than o.encoded and o.used will be affected by this method

Instance Methods

String toString([String charset, [Number offset, [Number length]]])
  • decodes this buffer, starting at byte offset for length bytes, as though this buffer was encoded with the character set identified by charset, returning a new String
  • The default values for charset, offset, and length are "UTF-8", 0, and this.length respectively.
  • This routine operates only on whole strings; truncated Unicode characters are to be treated as transcoding errors
  • A transcoding error will throw an exception, with the Error object augmented with a "commonjs.binary.transcode_error_offset" property
  • If charset identifies a Unicode character set, Error["commonjs.binary.transcode_error_offset"] will contain the number of bytes between the first byte to be decoded and the leading byte of the encoded character which could not be decoded. For example, a buffer containing a three-byte UTF-8 sequence with a corrupted third byte would set Error["commonjs.binary.transcode_error_offset"] to 0.
  • If charset does not identify a Unicode character set, Error["commonjs.binary.transcode_error_offset"] will have an undefined value.
String unicodeToString([String charset, [Number offset, [Number length, [Object o]]]])
  • decodes this buffer, starting at byte offset for length bytes, as though this buffer was encoded with the character set identified by charset, returning a new String
  • The default values for charset, offset, and length are "UTF-8", 0, and this.length respectively.
  • Specifying a non-Unicode character set name in charset will cause this function to throw an Error
  • Only entire characters are decoded - a trailing partial character is not an error
  • o.encoded holds the number of bytes from this buffer which were decoded by this operation
  • Decoding error will throw an exception, with the Error object augmented with a "commonjs.binary.transcode_error_offset" property containing the position in this buffer which held the leading byte of the unencodeable character
  • The contents of o.encoded is unaffected if this function throws an exception
  • No properties other than o.encoded will be affected by this method
Object unicodeFromString(String string, [String charset, [Number offset, [Number length, [Object o]]]])
  • encodes this buffer, starting at byte offset for at most length bytes, encoding string to the character set identified by charset, returning the object o
  • The default values for charset, offset, length, and o are "UTF-8", 0, this.length, and {} respectively
  • Specifying a non-Unicode character set in charset will cause this function to throw an Error
  • o.encoded holds the number of characters encoded from String
  • o.used holds the number of bytes written into this buffer by this operation
  • Only complete characters will be decoded: a string whose last position contains the first half of a surrogate pair will have o.encoded === string.length - 1. This is not an error condition.
  • transcoding error will throw an exception, with the Error object augmented with a "commonjs.binary.transcode_error_offset" property containing the position in string which held the unencodeable character
  • The contents of o.encoded and o.used are unaffected if this function throws an exception
  • No properties other than o.encoded and o.used will be affected by this method