Binary/E
STATUS: PROPOSAL
This proposal, based on Binary/D, defines types and methods for constructing, manipulating, and transcoding binary data in both mutable and immutable forms. ByteString resembles its immutable text analog, String, and ByteArray resembles its mutable analog, Array. Both types constrain their data to the byte domain and provide a memory-safe interface to underlying continguous, random-access byte collections.
An overview of the supported types, methods, signatures, and corresponding return types is on Google Docs.
Contents
Specification
The "binary" top-level module must export ByteArray, ByteString, and Range.
- Binary
- a non-constructable base type for ByteString and ByteArray
- ByteString
- immutable, byte quantized
- ByteArray
- mutable, explicitly resizable, byte quantized
- Range
- a sub-range wrapper object
Binary
A base type for ByteString and ByteArray. Calling and constructing Binary must throw a TypeError.
ByteString
A ByteString is an immutable, fixed-width representation of a C unsigned char (byte) array that does not implicitly terminate at the first 0-valued byte.
ByteString instances are comparable with the == and === operators based on equal order and respective values of their content.
Constructor
A ByteString function must exist. Both calling and constructing must return a ByteString Object.
ByteString() instanceof ByteString; new ByteString() instanceof ByteString; ByteString() instanceof Binary; new ByteString() instanceof Binary; typeof ByteString() === "object"; typeof new ByteString() === "object";
- ByteString()
- Construct an empty byte string.
- ByteString(byteString)
- Copies byteString.
- ByteString(byteArray)
- Use the contents of byteArray. For performance, transparently to the layer specified here, the "ByteArray" may relinquish ownership of its underlying byte buffer and switch to a copy-on-write mode.
- ByteString(arrayOfNumbers)
- Use the numbers in arrayOfNumbers as the bytes. Coerces Numbers to bytes by selecting their least significant eight bits.
- ByteString(string, charset)
- Convert a string. The ByteString will contain string encoded with charset.
Prototype Properties
- toByteArray() ByteArray
- Returns a byte for byte copy in a ByteArray.
- Note: for best performance, implementations should transfer ownership of the underlying buffer and flag the original owner to copy-on-write.
- toByteArray(sourceCharset, targetCharset) ByteArray
- Returns this transcoded from the source charset to the target charset as a ByteArray. The source and target charset are internally cast to String with the String constructor.
- toByteString() ByteString
- Returns this ByteString.
- toByteString(sourceCharset, targetCharset) ByteString
- Returns this transcoded from the source charset to the target charset as a ByteString. The source and target charset are internally cast to String with the String constructor.
- toString() String
- Returns a debug representation like "[ByteString 10]", where 10 is the length this ByteString.
- toString(charset) String
- Decodes this ByteString according to the given character set and returns the String from the corresponding unicode points. May throw a ValueError if the byte string is malformed or if any of the code points are out of the implementation's supported range.
- toArray() Array
- Returns an Array containing the bytes as Numbers.
- toSource() String
- returns "require("binary").ByteString([])" for a null byte string or "require("binary").ByteString([0, 1, 2])" for a byte string of length 3 with bytes 0, 1, and 2.
- valueAt(offset_opt) Number
- Returns the byte at offset as a Number. The offset is coerced internally with the Number constructor. Thus, if it is omitted or passed as undefined, it defaults to getting the value at offset 0.
- indexOf(values, start_opt, stop_opt) Number
- Returns the index of the first occurence of of a byte or consecutive bytes as represented by a Number, ByteString, or ByteArray of any length, or returns -1 if no match was found. If start and/or stop are specified, only elements between the the indexes start and stop are searched. start defaults to 0 and stop defaults to the length of this.
- lastIndexOf(values, start_opt, stop_opt) Number
- Returns the index of the last occurence of of a byte or consecutive bytes as represented by a Number, ByteString, or ByteArray of any length, or returns -1 if no match was found. If start and/or stop are specified, only elements between the the indexes start and stop are searched. start defaults to 0 and stop defaults to the length of this.
- range(start_opt, stop_opt) Range
- Returns a Range object. The value of stop defaults to the length of this. The value of start defaults to 0. Uses this for the Range's ref property.
- copy(target, targetStart, start, stop) ByteString
- Copies the Number values of each byte from this between start and stop to a target ByteArray or Array at the targetStart offset. If undefined or omitted, stop is presumed to be the length of this. targetStart, start, and stop are internally coerced to Numbers with the Number constructor. Throws a TypeError if the target is not a ByteArray or Array. Returns this.
- slice(start_opt, stop_opt) ByteString
- See Array.prototype.slice
- substr(start_opt, length_opt) ByteString
- See String.prototype.substr
- substring(start_opt, last_opt) ByteString
- See String.prototype.substring
- concat(...) ByteString
- Returns a ByteString composed of itself concatenated with the given ByteString, ByteArray, and Array values. Throws a TypeError if any of the arguments are not a ByteString, ByteArray, or Array of Numbers. Coerces Numbers to bytes by selecting their least significant eight bits.
- join(array) ByteString
- Returns a ByteString of the ByteStrings, ByteArrays, Arrays of Numbers, or a combination thereof, delimited by this. Arrays of Numbers are converted to ByteArrays. Coerces Numbers to bytes by selecting their least significant eight bits.
- split(delimiter_opt, max_opt) Array
- Returns an Array of ByteStrings pared from the ByteString between strings of ByteStrings that match the given delimiter for up to the maximum number of delimiters, starting from the left. If omitted, the maximum defaults to Infinity. The delimiter defaults to an empty ByteString. The delimiter is internally coerced to a ByteString with the ByteString constructor.
- splitRight(delimiter_opt, max_opt) Array
- Returns an Array of ByteStrings pared from the ByteString between strings of ByteStrings that match the given delimiter for up to the maximum number of delimiters, starting from the right. If omitted, the maximum defaults to Infinity. The delimiter defaults to an empty ByteString. The delimiter is internally coerced to a ByteString with the ByteString constructor.
- Content
- an alias of Number, indicating that Number is the type of what [[Get]] returns.
Internal Properties
- [[Get]] Number
- Returns the byte at offset as a Number. The offset is coerced internally with the Number constructor. Thus, if it is omitted or passed as undefined, it defaults to getting the value at offset 0.
- [[Put]]
- throws a TypeError.
Instance Properties
- length
- The length in bytes. Not [[Writable]], not [[Configurable]], not [[Enumerable]].
ByteArray
A ByteArray is a mutable, flexible (explicitly growable and shrinkable) representation of a C unsigned char (byte) array.
Constructor
A ByteArray function must exist. Calling and constructing a ByteArray both return a new ByteArray instance.
ByteArray() instanceof ByteArray; new ByteArray() instanceof ByteArray; ByteArray() instanceof Binary; new ByteArray() instanceof Binary; typeof ByteArray() === "object"; typeof new ByteArray() === "object";
- ByteArray()
- New, empty ByteArray.
- ByteArray(length Number, fill_opt Number)
- New ByteArray of the given length, with the given fill number at all offsets. The default filler is 0.
- ByteArray(byteArray)
- Copy byteArray.
- ByteArray(byteString)
- Copy contents of byteString.
- ByteArray(arrayOfBytes)
- Use numbers in arrayOfBytes as contents. Coerces Numbers to bytes by selecting their least significant eight bits.
- ByteArray(string, charset)
- Create a "ByteArray" from a "String", the result being encoded with charset.
Prototype Properties
- toByteArray() ByteArray
- Returns a byte for byte copy in a ByteArray.
- Note: for best performance, implementations should transfer ownership of the underlying buffer and flag the original owner to copy-on-write.
- toByteArray(sourceCharset, targetCharset) ByteArray
- Returns this transcoded from the source charset to the target charset as a ByteArray. The source and target charset are internally cast to String with the String constructor.
- toByteString() ByteString
- Returns a byte for byte copy in a ByteString.
- Note: for best performance, implementations should transfer ownership of the underlying buffer and flag the original owner to copy-on-write.
- toByteString(sourceCharset, targetCharset) ByteString
- Returns this transcoded from the source charset to the target charset as a ByteString. The source and target charset are internally cast to String with the String constructor.
- toString() String
- Returns a debug representation like "[ByteArray 10]", where 10 is the length this ByteArray.
- toString(charset) String
- Decodes this according to the given character set and returns the String from the corresponding unicode points. May throw a ValueError if the byte string is malformed or if any of the code points are out of the implementation's supported range.
- toArray() Array
- Returns an Array containing the bytes as Numbers.
- toSource() String
- returns "require("binary").ByteArray([])" for a null byte string or "require("binary").ByteArray([0, 1, 2])" for a byte string of length 3 with bytes 0, 1, and 2.
- valueAt(offset_opt) Number
- Returns the byte at offset as a Number. The offset is coerced internally with the Number constructor. Thus, if it is omitted or passed as undefined, it defaults to getting the value at offset 0.
- byteStringAt(offset_opt) ByteString
- Returns a unary (length 1) ByteString of the byte at the given offset. The offset is coerced internally with the Number constructor. Thus, if it is omitted or passed as undefined, it defaults to getting the value at offset 0.
- indexOf(values, start_opt, stop_opt) Number
- Returns the index of the first occurence of of a byte or consecutive bytes as represented by a Number, ByteString, or ByteArray of any length, or returns -1 if no match was found. If start and/or stop are specified, only elements between the the indexes start and stop are searched. start defaults to 0 and stop defaults to the length of this.
- lastIndexOf(values, start_opt, stop_opt) Number
- Returns the index of the last occurence of of a byte or consecutive bytes as represented by a Number, ByteString, or ByteArray of any length, or returns -1 if no match was found. If start and/or stop are specified, only elements between the the indexes start and stop are searched. start defaults to 0 and stop defaults to the length of this.
- range(start_opt, stop_opt) Range
- Returns a Range object. The value of stop defaults to the length of this. The value of start defaults to 0. Uses this for the Range's ref property.
- copy(target, targetStart, start, stop) ByteArray
- Copies the Number values of each byte from this between start and stop to a target ByteArray or Array at the targetStart offset. If undefined or omitted, stop is presumed to be the length of this. targetStart, start, and stop are internally coerced to Numbers with the Number constructor. Throws a TypeError if the target is not a ByteArray or Array. Returns this.
- copyFrom(source, sourceStart, start, stop) ByteArray
- Copies the Number values of each byte from a given ByteArray or ByteString into this from start to stop at the given sourceStart offset. If undefined or omitted, stop is presumed to be the length of this. targetStart, start, and stop are internally coerced to Numbers with the Number constructor. Throws a TypeError if the target is not a ByteArray or Array. Returns this.
- fill(value, start_opt, stop_opt) ByteArray
- Fills each of the contained bytes with the given value (either a unary ByteString, ByteArray, or a Number) from the start offset to the stop offset. start and stop must be numbers, undefined, or omitted. If omitted, start is presumed to be 0. If omitted, stop is presumed to beh the length of this ByteArray. If omitted, "value" is presumed to be 0. Returns this.
- splice(start_opt, stop_opt, ...values) ByteArray
- See Array.prototype.splice
- slice(start_opt, stop_opt) ByteArray
- See Array.prototype.slice
- split(delimiter_opt, max_opt) Array
- Returns an Array of ByteArrays pared from the ByteArrays between strings of ByteArrays that match the given delimiter for up to the maximum number of delimiters, starting from the left. If omitted, the maximum defaults to Infinity. The delimiter defaults to an empty ByteArray. The dlimiter is internally coerced to a ByteArray with the ByteArray constructor.
- splitRight(delimiter_opt, max_opt) Array
- Returns an Array of ByteArrays pared from the ByteArray between strings of ByteArrays that match the given delimiter for up to the maximum number of delimiters, starting from the right. If omitted, the maximum defaults to Infinity. The delimiter defaults to an empty ByteArray. The delimiter is internally coerced to a ByteArray with the ByteArray constructor.
- forEach(callback[, thisObject])
- See Array.prototype.forEach
- every(callback[, thisObject])
- See Array.prototype.every
- some(callback[, thisObject])
- See Array.prototype.some
- map(callback[, thisObject])
- See Array.prototype.map
- Content
- an alias of Number, indicating that Number is the type of what [[Get]] returns.
Internal Properties
- [[Get]] Number
- Returns the Number value of a the byte at the given offset.
- [[Put]] Number
- Sets the byte value at the given offset. Throws a ValueError if the index is beyond the current bounds of the byte array (byte arrays can be grown or shrunk by explicitly assigning a new length). Implicitly coerces the value to a Number and masks it with 0xFF.
Instance Properties
- length
- The length in bytes. Not [[Configurable]], not [[Enumerable]]. Assigning to the length of a ByteArray causes the array to reallocate its underlying buffer to the given size, copying the original buffer up to the length of the new buffer if it is less, and filling all bytes beyond the length of the original buffer with the value 0.
Range
A range object is a convenience for copying ranges from one ByteArray or ByteString to another ByteArray.
Constructor
- new Range({ref, start, stop})
- returns a Range object with the respective properties.
Prototype Properties
- copy(target, start_opt, stop_opt)
- generically calls <codee>this.ref.copy(target, this.start, Math.min(this.stop, this.start + target.length - start), start)</code>. The default value of stop is the target length.
- copyFrom(source, start_opt, stop_opt)
- generically calls
this.ref.copyFrom(source, this.start, Math.min(this.stop, this.start + source.length - start), start)
. The default value of stop is the source length. - fill(value)
- generically calls
this.ref.fill(value, this.start, this.stop)
- splice(start_opt, stop_opt, ...values)
- generically calls
this.ref.splice.apply(this.ref, arguments)
General Requirements
All of the properties specified on prototypes are not [[Enumerable]] on instances.
Any operation that requires encoding, decoding, or transcoding among charsets may throw an error if that charset is not supported by the implementation. All implementations MUST support "us-ascii", "utf-8", and "utf-16".
Charset strings are as defined by IANA http://www.iana.org/assignments/character-sets.
Charsets are case insensitive.
Endianness arguments are case insensitive.
Rationale
The idea of using the particular ByteString and ByteArray types and their respective names originated with Jason Orendorff in the Binary API Brouhaha discussion.
Structures
This proposal is not an object oriented variation on Perl, PHP, Python, or Ruby's pack, unpack, and calcsize with notions of inherent endianness, read/write head position, or intrinsic codec or charset information. The objects described in this proposal are merely for the storage and direct manipulation of strings and arrays of byte data. Some object oriented conveniences are made, but the exercise of implementing pack, unpack, or an object-oriented analog thereof are left as an exercise for a future proposal of a more abstract type or a 'struct' module (as mentioned by Ionut Gabriel Stan on the list). This goes against most mentioned prior art.
Encoding Methods
This proposal also does not provide separate member functions for any particular subset of the possible charsets, encodings, compression algorithms, or consistent hash digests that might operate on a byte string or array, for example "toBase64", "toMd5", or "toUtf8" are not specified. Instead, "toString" accepts IANA charset names and radix numbers for charsets and encodings. The intent is that implementations will opt to make this extensible by falling back to an "encodings" module and searching for modules by the same name and calling "encode" or "decode" exports on those modules if they exist. (As supported originally by Robert Schultz, Davey Waterson, Ross Boucher, and tacitly myself, Kris Kowal, on the First proposition thread on the mailing list). This proposal does not address the need for stream objects to support pipelined codecs and hash digests (mentioned by Tom Robinson and Robert Schultz in the same conversation).
[[Get]] and [[Put]]
This proposal also reflects both group sentiment and a pragmatic point about properties. This isn't a decree that properties like "length" should be consistently used throughout the CommonJS APIs. However, given that all platforms support properties at the native level (to host String and Array objects) and that byte strings and arrays will require support at the native level, pursuing client-side interoperability is beyond the scope of this proposal and therefore properties have been specified. (See comments by Kris Zyp about the implementability of properties in all platforms, comments by Davey Waterson from Aptana about the counter-productivity of attempting to support this API in browsers, and support properties over accessor and mutator functions by Ionut Gabriel Stand and Cameron McCormack on the mailing list).
Future Proofing
This proposal suggests that the specified data types should be exported by a "binary" module. The intent is that eventually ECMAScript will specify native types to replace these, and these new types would be hosted as "primordials", free variables available in all top-level scopes. Because these types are specified to be exported by the "binary" module, there is some ambiguity of whether "instanceof" relationships would be maintained when references are passed among global contexts. It would be valuable for these identities to be preserved. For module systems that share a common global scope, this would suggest that the binary types should be patched into some commonly shared object, like the global scope, only if they do not yet exist. That would permit the first, permissive context to construct the types, and all subsequent sandboxes to share them. Secure sandboxes would have to make other accomodations.
Memory Optimization
In contrast to Maciej Stachowiak's proposal, there are no methods for explicitly managing the mutability of the underlying buffer. Since it is possible for implementations to explicitly use ownership and copy-on-write flags, it is desirable to keep the memory management beneath the surface of JavaScript. For example, with freezing semantics as proposed by Maciej, freezing a mutable byte array could produce an immutable byte string that usurps ownership of the original byte array's buffer, so there would be no need for allocation. However, this would render the original byte array unusable, and all persisting references to that array would be broken and we would have to define failure modes for that object. Alternately, the byte array and byte string could share the buffer. The byte string would be guaranteed to not modify the content of the buffer, and if the owner of the original byte array were to perform a modifying operation on the byte array, the implementation could at that time incur the cost of copying the buffer. Similar optimizations could be applied by all ByteString and ByteArray constructors and conversion functions.
Genericity
In accordance with Daniel Friesen's Binary/C, a high priority in this proposal was duck typing.
- a generic way to get a value in the Content type of a ByteArray, ByteString, or String
- [[Get]]
- a generic way to get the type of what is returned by [[Get]] for ByteArray, ByteString, and String (Number, ByteString, and String respectively)
- Content
- a generic way to get a ByteString for the value at an offset for either ByteArray or ByteString
- byteStringAt(offset) ByteString
- a generic way to get a ByteArray for the value at an offset for either ByteArray or ByteString
- would have been wasteful to include since byte arrays are meant for assembling large byte buffers through copying data from other binary collections, not for allocation churn.
- a generic way to get a Number for the value at an offset for any ByteArray, ByteString, Array, or String
- valueAt(offset) Number
- a generic way to get an object of length one in the same type that contains just the value at a given offset for any ByteArray, ByteString, Array, or String
- slice(offset, offset + 1)
The following methods and properties are interoperable among ByteArray, ByteString, Array, and String:
- [[Get]]
- indexOf
- lastIndexOf
- slice
The following methods and properties are interoperable among ByteArray, ByteString, and String:
- [[Get]]
- Content
- valueAt
- join
- split
- splitRight
- indexOf
- lastIndexOf
- slice
The following methods and properties are interoperable between ByteArray and ByteString:
- [[Get]]
- Content
- valueAt
- byteStringAt
- indexOf
- lastIndexOf
- slice
- copy
- range
The following methods and properties are interoperable between ByteString and String:
- [[Get]]
- Content
- valueAt
- indexOf
- lastIndexOf
- slice
- substr
- substring
The following methods are interoperable between ByteArray and Array:
- [[Get]]
- indexOf
- lastIndexOf
- slice
- forEach
- every
- some
- map
Idempotence
Binary/B had a "decodeToString" method and "toString" was required to be distinct, to avoid decoding and encoding hazards. However, in keeping with the existing Number.prototype.toString(radix), for subjective aesthetic reasons having worked with prototypes of both, and because it easier to remember which way encoding and decoding go with "toString", "toByteString", and "toByteArray" than with "encode" and "decode", this proposal eschews "decodeToString". This has the side effect of making the generic "toByteString" and "toString" methods idempotent for converting to and from a charset. For example, if you receive an object that may be a ByteString or a String, but if it is a String it will need to be converted to a ByteString with a given charset, you can simply call "toByteString" with that charset, even repeatedly. Likewise, if you receive a ByteString or a String and you need it to be a String, you can simply call "toString" with the desired charset, allowing a succession of adapters or decorators to make the conversion at any point along the way.
Miscelaneous
"ByteString" does not implement "toUpperCase" or "toLowerCase" since they are not meaningful without the context of a charset.
Unlike the "Array", the "ByteArray" is not variadic so that its initial length constructor is not ambiguous with its copy constructor.
The Binary/B proposal, at Ash Berlin's recommendation, had split methods on both ByteStrings and ByteArrays that accepted as their optional second argument, an object of options for both the number of delimiters to match, but whether to include the delimiter on the right side of each term, presumably including a terminal delimiter that would obviate an empty collection from being returned as the last value. This has been left as an exercise for a byte reader stream's "readLine(delimiter)" method, so that this proposal's split and splitRight methods may more closely resemble their cousins on existing Strings in JavaScript and other languages.
The "join" methods on "ByteArray" and "ByteString" differ from what you would expect in JavaScript based on Array joins, in a way that will be familiar and probably upsetting coming from Python. The delimiter is the left side of the expression. This is because the Array joining method can not be practically extended to determine whether to return a String, ByteString, or ByteArray based on the types of all of the values it contains. It is far more practical to multi-plex the return type based on the type of the left hand side of the expression.
The "Content" property begins with a capital letter to distinguish it as a factory method like the constructor function to which it always refers, albeit ByteString, ByteArray, or Number. To date, this remains a point of discussion. Daniel Friesen's proposal Binary/C uses the name "contentConstructor".
Errata
This proposal is a strict subset of Binary/D with a few exceptions: the Content type and return value of [[Get]] for ByteString has been changed to Number; ByteString and ByteArray are object types instead of anticipating future-compatibility with "bytestring" and "bytearray" types; and the Binary type has been reintroduced. Bit types, radix encoding (base8...base64), and consistency shims for existing primordial types have been removed and are possible candidates for extensions to future revisions of this specification.