Binary/E

From CommonJS Spec Wiki
Jump to: navigation, search

STATUS: PROPOSAL

This proposal, based on Binary/D, defines types and methods for constructing, manipulating, and transcoding binary data in both mutable and immutable forms. ByteString resembles its immutable text analog, String, and ByteArray resembles its mutable analog, Array. Both types constrain their data to the byte domain and provide a memory-safe interface to underlying continguous, random-access byte collections.

An overview of the supported types, methods, signatures, and corresponding return types is on Google Docs.

Specification

The "binary" top-level module must export ByteArray, ByteString, and Range.

Binary
a non-constructable base type for ByteString and ByteArray
ByteString
immutable, byte quantized
ByteArray
mutable, explicitly resizable, byte quantized
Range
a sub-range wrapper object

Binary

A base type for ByteString and ByteArray. Calling and constructing Binary must throw a TypeError.

ByteString

A ByteString is an immutable, fixed-width representation of a C unsigned char (byte) array that does not implicitly terminate at the first 0-valued byte.

ByteString instances are comparable with the == and === operators based on equal order and respective values of their content.

Constructor

A ByteString function must exist. Both calling and constructing must return a ByteString Object.

ByteString() instanceof ByteString;
new ByteString() instanceof ByteString;
ByteString() instanceof Binary;
new ByteString() instanceof Binary;
typeof ByteString() === "object";
typeof new ByteString() === "object";
ByteString()
Construct an empty byte string.
ByteString(byteString)
Copies byteString.
ByteString(byteArray)
Use the contents of byteArray. For performance, transparently to the layer specified here, the "ByteArray" may relinquish ownership of its underlying byte buffer and switch to a copy-on-write mode.
ByteString(arrayOfNumbers)
Use the numbers in arrayOfNumbers as the bytes. Coerces Numbers to bytes by selecting their least significant eight bits.
ByteString(string, charset)
Convert a string. The ByteString will contain string encoded with charset.

Prototype Properties

toByteArray() ByteArray
Returns a byte for byte copy in a ByteArray.
Note: for best performance, implementations should transfer ownership of the underlying buffer and flag the original owner to copy-on-write.
toByteArray(sourceCharset, targetCharset) ByteArray
Returns this transcoded from the source charset to the target charset as a ByteArray. The source and target charset are internally cast to String with the String constructor.
toByteString() ByteString
Returns this ByteString.
toByteString(sourceCharset, targetCharset) ByteString
Returns this transcoded from the source charset to the target charset as a ByteString. The source and target charset are internally cast to String with the String constructor.
toString() String
Returns a debug representation like "[ByteString 10]", where 10 is the length this ByteString.
toString(charset) String
Decodes this ByteString according to the given character set and returns the String from the corresponding unicode points. May throw a ValueError if the byte string is malformed or if any of the code points are out of the implementation's supported range.
toArray() Array
Returns an Array containing the bytes as Numbers.
toSource() String
returns "require("binary").ByteString([])" for a null byte string or "require("binary").ByteString([0, 1, 2])" for a byte string of length 3 with bytes 0, 1, and 2.
valueAt(offset_opt) Number
Returns the byte at offset as a Number. The offset is coerced internally with the Number constructor. Thus, if it is omitted or passed as undefined, it defaults to getting the value at offset 0.
indexOf(values, start_opt, stop_opt) Number
Returns the index of the first occurence of of a byte or consecutive bytes as represented by a Number, ByteString, or ByteArray of any length, or returns -1 if no match was found. If start and/or stop are specified, only elements between the the indexes start and stop are searched. start defaults to 0 and stop defaults to the length of this.
lastIndexOf(values, start_opt, stop_opt) Number
Returns the index of the last occurence of of a byte or consecutive bytes as represented by a Number, ByteString, or ByteArray of any length, or returns -1 if no match was found. If start and/or stop are specified, only elements between the the indexes start and stop are searched. start defaults to 0 and stop defaults to the length of this.
range(start_opt, stop_opt) Range
Returns a Range object. The value of stop defaults to the length of this. The value of start defaults to 0. Uses this for the Range's ref property.
copy(target, targetStart, start, stop) ByteString
Copies the Number values of each byte from this between start and stop to a target ByteArray or Array at the targetStart offset. If undefined or omitted, stop is presumed to be the length of this. targetStart, start, and stop are internally coerced to Numbers with the Number constructor. Throws a TypeError if the target is not a ByteArray or Array. Returns this.
slice(start_opt, stop_opt) ByteString
See Array.prototype.slice
substr(start_opt, length_opt) ByteString
See String.prototype.substr
substring(start_opt, last_opt) ByteString
See String.prototype.substring
concat(...) ByteString
Returns a ByteString composed of itself concatenated with the given ByteString, ByteArray, and Array values. Throws a TypeError if any of the arguments are not a ByteString, ByteArray, or Array of Numbers. Coerces Numbers to bytes by selecting their least significant eight bits.
join(array) ByteString
Returns a ByteString of the ByteStrings, ByteArrays, Arrays of Numbers, or a combination thereof, delimited by this. Arrays of Numbers are converted to ByteArrays. Coerces Numbers to bytes by selecting their least significant eight bits.
split(delimiter_opt, max_opt) Array
Returns an Array of ByteStrings pared from the ByteString between strings of ByteStrings that match the given delimiter for up to the maximum number of delimiters, starting from the left. If omitted, the maximum defaults to Infinity. The delimiter defaults to an empty ByteString. The delimiter is internally coerced to a ByteString with the ByteString constructor.
splitRight(delimiter_opt, max_opt) Array
Returns an Array of ByteStrings pared from the ByteString between strings of ByteStrings that match the given delimiter for up to the maximum number of delimiters, starting from the right. If omitted, the maximum defaults to Infinity. The delimiter defaults to an empty ByteString. The delimiter is internally coerced to a ByteString with the ByteString constructor.
Content
an alias of Number, indicating that Number is the type of what [[Get]] returns.

Internal Properties

[[Get]] Number
Returns the byte at offset as a Number. The offset is coerced internally with the Number constructor. Thus, if it is omitted or passed as undefined, it defaults to getting the value at offset 0.
[[Put]]
throws a TypeError.

Instance Properties

length
The length in bytes. Not [[Writable]], not [[Configurable]], not [[Enumerable]].

ByteArray

A ByteArray is a mutable, flexible (explicitly growable and shrinkable) representation of a C unsigned char (byte) array.

Constructor

A ByteArray function must exist. Calling and constructing a ByteArray both return a new ByteArray instance.

ByteArray() instanceof ByteArray;
new ByteArray() instanceof ByteArray;
ByteArray() instanceof Binary;
new ByteArray() instanceof Binary;
typeof ByteArray() === "object";
typeof new ByteArray() === "object";
ByteArray()
New, empty ByteArray.
ByteArray(length Number, fill_opt Number)
New ByteArray of the given length, with the given fill number at all offsets. The default filler is 0.
ByteArray(byteArray)
Copy byteArray.
ByteArray(byteString)
Copy contents of byteString.
ByteArray(arrayOfBytes)
Use numbers in arrayOfBytes as contents. Coerces Numbers to bytes by selecting their least significant eight bits.
ByteArray(string, charset)
Create a "ByteArray" from a "String", the result being encoded with charset.

Prototype Properties

toByteArray() ByteArray
Returns a byte for byte copy in a ByteArray.
Note: for best performance, implementations should transfer ownership of the underlying buffer and flag the original owner to copy-on-write.
toByteArray(sourceCharset, targetCharset) ByteArray
Returns this transcoded from the source charset to the target charset as a ByteArray. The source and target charset are internally cast to String with the String constructor.
toByteString() ByteString
Returns a byte for byte copy in a ByteString.
Note: for best performance, implementations should transfer ownership of the underlying buffer and flag the original owner to copy-on-write.
toByteString(sourceCharset, targetCharset) ByteString
Returns this transcoded from the source charset to the target charset as a ByteString. The source and target charset are internally cast to String with the String constructor.
toString() String
Returns a debug representation like "[ByteArray 10]", where 10 is the length this ByteArray.
toString(charset) String
Decodes this according to the given character set and returns the String from the corresponding unicode points. May throw a ValueError if the byte string is malformed or if any of the code points are out of the implementation's supported range.
toArray() Array
Returns an Array containing the bytes as Numbers.
toSource() String
returns "require("binary").ByteArray([])" for a null byte string or "require("binary").ByteArray([0, 1, 2])" for a byte string of length 3 with bytes 0, 1, and 2.
valueAt(offset_opt) Number
Returns the byte at offset as a Number. The offset is coerced internally with the Number constructor. Thus, if it is omitted or passed as undefined, it defaults to getting the value at offset 0.
byteStringAt(offset_opt) ByteString
Returns a unary (length 1) ByteString of the byte at the given offset. The offset is coerced internally with the Number constructor. Thus, if it is omitted or passed as undefined, it defaults to getting the value at offset 0.
indexOf(values, start_opt, stop_opt) Number
Returns the index of the first occurence of of a byte or consecutive bytes as represented by a Number, ByteString, or ByteArray of any length, or returns -1 if no match was found. If start and/or stop are specified, only elements between the the indexes start and stop are searched. start defaults to 0 and stop defaults to the length of this.
lastIndexOf(values, start_opt, stop_opt) Number
Returns the index of the last occurence of of a byte or consecutive bytes as represented by a Number, ByteString, or ByteArray of any length, or returns -1 if no match was found. If start and/or stop are specified, only elements between the the indexes start and stop are searched. start defaults to 0 and stop defaults to the length of this.
range(start_opt, stop_opt) Range
Returns a Range object. The value of stop defaults to the length of this. The value of start defaults to 0. Uses this for the Range's ref property.
copy(target, targetStart, start, stop) ByteArray
Copies the Number values of each byte from this between start and stop to a target ByteArray or Array at the targetStart offset. If undefined or omitted, stop is presumed to be the length of this. targetStart, start, and stop are internally coerced to Numbers with the Number constructor. Throws a TypeError if the target is not a ByteArray or Array. Returns this.
copyFrom(source, sourceStart, start, stop) ByteArray
Copies the Number values of each byte from a given ByteArray or ByteString into this from start to stop at the given sourceStart offset. If undefined or omitted, stop is presumed to be the length of this. targetStart, start, and stop are internally coerced to Numbers with the Number constructor. Throws a TypeError if the target is not a ByteArray or Array. Returns this.
fill(value, start_opt, stop_opt) ByteArray
Fills each of the contained bytes with the given value (either a unary ByteString, ByteArray, or a Number) from the start offset to the stop offset. start and stop must be numbers, undefined, or omitted. If omitted, start is presumed to be 0. If omitted, stop is presumed to beh the length of this ByteArray. If omitted, "value" is presumed to be 0. Returns this.
splice(start_opt, stop_opt, ...values) ByteArray
See Array.prototype.splice
slice(start_opt, stop_opt) ByteArray
See Array.prototype.slice
split(delimiter_opt, max_opt) Array
Returns an Array of ByteArrays pared from the ByteArrays between strings of ByteArrays that match the given delimiter for up to the maximum number of delimiters, starting from the left. If omitted, the maximum defaults to Infinity. The delimiter defaults to an empty ByteArray. The dlimiter is internally coerced to a ByteArray with the ByteArray constructor.
splitRight(delimiter_opt, max_opt) Array
Returns an Array of ByteArrays pared from the ByteArray between strings of ByteArrays that match the given delimiter for up to the maximum number of delimiters, starting from the right. If omitted, the maximum defaults to Infinity. The delimiter defaults to an empty ByteArray. The delimiter is internally coerced to a ByteArray with the ByteArray constructor.
forEach(callback[, thisObject])
See Array.prototype.forEach
every(callback[, thisObject])
See Array.prototype.every
some(callback[, thisObject])
See Array.prototype.some
map(callback[, thisObject])
See Array.prototype.map
Content
an alias of Number, indicating that Number is the type of what [[Get]] returns.

Internal Properties

[[Get]] Number
Returns the Number value of a the byte at the given offset.
[[Put]] Number
Sets the byte value at the given offset. Throws a ValueError if the index is beyond the current bounds of the byte array (byte arrays can be grown or shrunk by explicitly assigning a new length). Implicitly coerces the value to a Number and masks it with 0xFF.

Instance Properties

length
The length in bytes. Not [[Configurable]], not [[Enumerable]]. Assigning to the length of a ByteArray causes the array to reallocate its underlying buffer to the given size, copying the original buffer up to the length of the new buffer if it is less, and filling all bytes beyond the length of the original buffer with the value 0.

Range

A range object is a convenience for copying ranges from one ByteArray or ByteString to another ByteArray.

Constructor

new Range({ref, start, stop})
returns a Range object with the respective properties.

Prototype Properties

copy(target, start_opt, stop_opt)
generically calls <codee>this.ref.copy(target, this.start, Math.min(this.stop, this.start + target.length - start), start)</code>. The default value of stop is the target length.
copyFrom(source, start_opt, stop_opt)
generically calls this.ref.copyFrom(source, this.start, Math.min(this.stop, this.start + source.length - start), start). The default value of stop is the source length.
fill(value)
generically calls this.ref.fill(value, this.start, this.stop)
splice(start_opt, stop_opt, ...values)
generically calls this.ref.splice.apply(this.ref, arguments)

General Requirements

All of the properties specified on prototypes are not [[Enumerable]] on instances.

Any operation that requires encoding, decoding, or transcoding among charsets may throw an error if that charset is not supported by the implementation. All implementations MUST support "us-ascii", "utf-8", and "utf-16".

Charset strings are as defined by IANA http://www.iana.org/assignments/character-sets.

Charsets are case insensitive.

Endianness arguments are case insensitive.

Rationale

The idea of using the particular ByteString and ByteArray types and their respective names originated with Jason Orendorff in the Binary API Brouhaha discussion.

Structures

This proposal is not an object oriented variation on Perl, PHP, Python, or Ruby's pack, unpack, and calcsize with notions of inherent endianness, read/write head position, or intrinsic codec or charset information. The objects described in this proposal are merely for the storage and direct manipulation of strings and arrays of byte data. Some object oriented conveniences are made, but the exercise of implementing pack, unpack, or an object-oriented analog thereof are left as an exercise for a future proposal of a more abstract type or a 'struct' module (as mentioned by Ionut Gabriel Stan on the list). This goes against most mentioned prior art.

Encoding Methods

This proposal also does not provide separate member functions for any particular subset of the possible charsets, encodings, compression algorithms, or consistent hash digests that might operate on a byte string or array, for example "toBase64", "toMd5", or "toUtf8" are not specified. Instead, "toString" accepts IANA charset names and radix numbers for charsets and encodings. The intent is that implementations will opt to make this extensible by falling back to an "encodings" module and searching for modules by the same name and calling "encode" or "decode" exports on those modules if they exist. (As supported originally by Robert Schultz, Davey Waterson, Ross Boucher, and tacitly myself, Kris Kowal, on the First proposition thread on the mailing list). This proposal does not address the need for stream objects to support pipelined codecs and hash digests (mentioned by Tom Robinson and Robert Schultz in the same conversation).

[[Get]] and [[Put]]

This proposal also reflects both group sentiment and a pragmatic point about properties. This isn't a decree that properties like "length" should be consistently used throughout the CommonJS APIs. However, given that all platforms support properties at the native level (to host String and Array objects) and that byte strings and arrays will require support at the native level, pursuing client-side interoperability is beyond the scope of this proposal and therefore properties have been specified. (See comments by Kris Zyp about the implementability of properties in all platforms, comments by Davey Waterson from Aptana about the counter-productivity of attempting to support this API in browsers, and support properties over accessor and mutator functions by Ionut Gabriel Stand and Cameron McCormack on the mailing list).

Future Proofing

This proposal suggests that the specified data types should be exported by a "binary" module. The intent is that eventually ECMAScript will specify native types to replace these, and these new types would be hosted as "primordials", free variables available in all top-level scopes. Because these types are specified to be exported by the "binary" module, there is some ambiguity of whether "instanceof" relationships would be maintained when references are passed among global contexts. It would be valuable for these identities to be preserved. For module systems that share a common global scope, this would suggest that the binary types should be patched into some commonly shared object, like the global scope, only if they do not yet exist. That would permit the first, permissive context to construct the types, and all subsequent sandboxes to share them. Secure sandboxes would have to make other accomodations.

Memory Optimization

In contrast to Maciej Stachowiak's proposal, there are no methods for explicitly managing the mutability of the underlying buffer. Since it is possible for implementations to explicitly use ownership and copy-on-write flags, it is desirable to keep the memory management beneath the surface of JavaScript. For example, with freezing semantics as proposed by Maciej, freezing a mutable byte array could produce an immutable byte string that usurps ownership of the original byte array's buffer, so there would be no need for allocation. However, this would render the original byte array unusable, and all persisting references to that array would be broken and we would have to define failure modes for that object. Alternately, the byte array and byte string could share the buffer. The byte string would be guaranteed to not modify the content of the buffer, and if the owner of the original byte array were to perform a modifying operation on the byte array, the implementation could at that time incur the cost of copying the buffer. Similar optimizations could be applied by all ByteString and ByteArray constructors and conversion functions.

Genericity

In accordance with Daniel Friesen's Binary/C, a high priority in this proposal was duck typing.

a generic way to get a value in the Content type of a ByteArray, ByteString, or String
[[Get]]
a generic way to get the type of what is returned by [[Get]] for ByteArray, ByteString, and String (Number, ByteString, and String respectively)
Content
a generic way to get a ByteString for the value at an offset for either ByteArray or ByteString
byteStringAt(offset) ByteString
a generic way to get a ByteArray for the value at an offset for either ByteArray or ByteString
would have been wasteful to include since byte arrays are meant for assembling large byte buffers through copying data from other binary collections, not for allocation churn.
a generic way to get a Number for the value at an offset for any ByteArray, ByteString, Array, or String
valueAt(offset) Number
a generic way to get an object of length one in the same type that contains just the value at a given offset for any ByteArray, ByteString, Array, or String
slice(offset, offset + 1)

The following methods and properties are interoperable among ByteArray, ByteString, Array, and String:

  • [[Get]]
  • indexOf
  • lastIndexOf
  • slice

The following methods and properties are interoperable among ByteArray, ByteString, and String:

  • [[Get]]
  • Content
  • valueAt
  • join
  • split
  • splitRight
  • indexOf
  • lastIndexOf
  • slice

The following methods and properties are interoperable between ByteArray and ByteString:

  • [[Get]]
  • Content
  • valueAt
  • byteStringAt
  • indexOf
  • lastIndexOf
  • slice
  • copy
  • range

The following methods and properties are interoperable between ByteString and String:

  • [[Get]]
  • Content
  • valueAt
  • indexOf
  • lastIndexOf
  • slice
  • substr
  • substring

The following methods are interoperable between ByteArray and Array:

  • [[Get]]
  • indexOf
  • lastIndexOf
  • slice
  • forEach
  • every
  • some
  • map

Idempotence

Binary/B had a "decodeToString" method and "toString" was required to be distinct, to avoid decoding and encoding hazards. However, in keeping with the existing Number.prototype.toString(radix), for subjective aesthetic reasons having worked with prototypes of both, and because it easier to remember which way encoding and decoding go with "toString", "toByteString", and "toByteArray" than with "encode" and "decode", this proposal eschews "decodeToString". This has the side effect of making the generic "toByteString" and "toString" methods idempotent for converting to and from a charset. For example, if you receive an object that may be a ByteString or a String, but if it is a String it will need to be converted to a ByteString with a given charset, you can simply call "toByteString" with that charset, even repeatedly. Likewise, if you receive a ByteString or a String and you need it to be a String, you can simply call "toString" with the desired charset, allowing a succession of adapters or decorators to make the conversion at any point along the way.

Miscelaneous

"ByteString" does not implement "toUpperCase" or "toLowerCase" since they are not meaningful without the context of a charset.

Unlike the "Array", the "ByteArray" is not variadic so that its initial length constructor is not ambiguous with its copy constructor.

The Binary/B proposal, at Ash Berlin's recommendation, had split methods on both ByteStrings and ByteArrays that accepted as their optional second argument, an object of options for both the number of delimiters to match, but whether to include the delimiter on the right side of each term, presumably including a terminal delimiter that would obviate an empty collection from being returned as the last value. This has been left as an exercise for a byte reader stream's "readLine(delimiter)" method, so that this proposal's split and splitRight methods may more closely resemble their cousins on existing Strings in JavaScript and other languages.

The "join" methods on "ByteArray" and "ByteString" differ from what you would expect in JavaScript based on Array joins, in a way that will be familiar and probably upsetting coming from Python. The delimiter is the left side of the expression. This is because the Array joining method can not be practically extended to determine whether to return a String, ByteString, or ByteArray based on the types of all of the values it contains. It is far more practical to multi-plex the return type based on the type of the left hand side of the expression.

The "Content" property begins with a capital letter to distinguish it as a factory method like the constructor function to which it always refers, albeit ByteString, ByteArray, or Number. To date, this remains a point of discussion. Daniel Friesen's proposal Binary/C uses the name "contentConstructor".

Errata

This proposal is a strict subset of Binary/D with a few exceptions: the Content type and return value of [[Get]] for ByteString has been changed to Number; ByteString and ByteArray are object types instead of anticipating future-compatibility with "bytestring" and "bytearray" types; and the Binary type has been reintroduced. Bit types, radix encoding (base8...base64), and consistency shims for existing primordial types have been removed and are possible candidates for extensions to future revisions of this specification.

Relevant Discussions