Binary/C

From CommonJS Spec Wiki
< Binary(Redirected from CommonJS/Binary/C)
Jump to: navigation, search
See also: the Show of hands as well as the Unpacking and Essay portions which were removed.

This proposal was written by Daniel Friesen as an alternative to the Binary/B proposal, it is accompanied by IO/B/Buffer.

This proposal extends the Blob type that a number of existing Server-side JavaScript implementations use, and a Buffer type reflecting the StringBuffer/StringBuilder within Java.

Instead of Binary/B's ByteArray, Binary/C defers to IO/B/Buffer to provide buffers.

One goal of this proposal is interoperability between Strings and Blobs. That is, like Binary/B, this proposal aims to permit a class of generic algorithms that can operate on both Blob and String through a generic intersection between their API's. However, this proposal avoids things that seem counter-intuitive, like putting .charAt on a Blob, a byte collection. Instead, this proposal augments String with a .valueAt so that method can be used generically on both Blob and String and defers to IO/B/Buffer to provide a Buffer system which works on either type.

This proposal and IO/B/Buffer is based off of API's drafted for MonkeyScript (Blob Buffer).

Terms and reading notes

To avoid confusion and ambiguity these are the basic definitions of terms used within this document.

List
A type which groups a series of items in a specific order.
Sequence
A type of list which manages a list of fixed-unit pieces of data.
These units of data are normally either bytes or characters. "Sequence" is basically a term which refers generically to both Strings, Blobs/ByteStrings, and mutable counterparts like Buffer and whatnot.
Array
A type of list which manages a list of items. These items are not related to one another in any way other than their inclusion in the list and do not need to be of the same type.
A key importance is an Array is a loose collection of items, these items do not have any sort of fixed unit to them.
memcopy
Where used memcopy is used it refers to the technique of copying memory as directly as possible from one source to another. At the very least this refers to copying from A to B without creating an intermediate Blob.

Where "as if by" is used in the spec the result is meant, the algorithm should not be affected by changes to the class' prototype.

Differences between a Sequence and an Array

While this may not be the case in lower level languages, JavaScript's API does make a clear distinction between strings and arrays.

Units
A Sequence is built up of a list of single unit items. Whilst an Array is built up of unitless items, the array does nothing but point to objects, it contains nothing itself. The sequence "abc" is made up of 3 units { a, b, c } whilst [1,"asdf",3,{}] is made up of 4 items { 1, "asdf", 3, {} } with no relation to each other and no fixed units as we see two separate numbers in there, a 4 unit sequence inside of it, and an object which could have an indefinite hierarchy.
Spillover
Depending on whether the type is a Sequence or an Array type functions such as .indexOf may "spill" or "overflow" over multiple items. Sequences spill, while Arrays do not spill. There is a subtle difference in the api between the two.
  • sequence.indexOf(sequence, [offset]);
  • array.indexOf(item, [offset]);
When using .indexOf on a sequence you give it another sequence. indexOf does not look for just a single item, but a sequence of items within that sequence. Contrasted to this, when using .indexOf on an array it ONLY looks for a single item and the search is unaffected by adjacent items.
This is apparent from how "foobarbaz".index("bar"); returns the index of "bar" despite the fact that 'b', 'a', and 'r' are 3 units within this 9 unit long sequence (in this 1 unit being 1 character). While contrasted to this [1,2,3,4,5].indexOf([2,3,4]); does NOT return the location of the 2, 3, and 4 inside this array. The reason for this being that indexOf on an array is a single item operation, it does not spill lookup over into the following items.
Pushing and Popping
Another point which does not get emphasised because strings are immutable in JavaScript and thus don't need methods to mutate them as Arrays do, is the semantics of .push, .pop, etc...
.pop() and .shift() remove ONE item from an array and return it.
As well given one argument .push() and .unshift() add ONE item to an array.
The key point here is [1,2,3].push([4,5,6]); does NOT turn the array into [1,2,3,4,5,6] it just adds the [4,5,6] as a sub array as so [123,[456]].
You can give multiple arguments to these methods, but then you are no longer working with your lists in the same way.
There is another name which does fit this kind of operation, "Append" (Side note, Wrench.js does add .append to Array). Using [1,2,3].append([4,5,6]); DOES push 4, 5, and 6 onto the array creating the array [1,2,3,4,5,6].

The API

The two tiers of the api for this spec define two new classes. Blob (Fluspferd, Google, jslibs have all used this name, it's a fairly long-standing name and normally works similarly) and Buffer, and two subclasses of Buffer, StringBuffer and BlobBuffer.

It is up to an implementation whether they wish to make Blob and Buffer native global objects, or seclude them inside of a binary module. Whether they are made global or not if the implementation implements require() then require('binary'); must return an object containing Blob and Buffer as keys, even if the binary module is simply a module containing exports.Blob = Blob; exports.Buffer = Buffer;.

Blob

Blob is the binary counterpart to String, it has a slightly different API but has many similarities. A Blob is an immutable representation of a sequence of 8bit bytes.

Most of the blob methods work on blobish data, rather than flat blobs. This means that the argument is treated as if it were passed through Blob(), thus .indexOf(255); is the same as if you had done .indexOf(Blob(255)), so you do not need to explicitly convert everything into a blob.

Note that unlike String, Blob is not defined as a primitive datatype by ECMA, this means that typeof will never return 'blob' and all blobs will be objects unlike strings which are normally primitives. Blob works with and without the new constructor and acts the same. It is recommended to use the `Blob()` form

[new] Blob();
Construct an empty blob
[new] Blob(number);
Construct a single unit blob, converting the number a byte. If the item is outside that range, not a number, or not an integer (has a decimal point) a TypeError should be thrown.
[new] Blob(arrayOfNumbers);
Construct an blob the same length as the array, converting numbers 0..255 into bytes. If any item is outside that range, not a number, or not an integer (has a decimal point) a TypeError should be thrown.
[new] Blob(blob);
Passes the blob through.
[new] Blob(string, toCharset);
Construct a new blob with the binary contents of a string. The string will be encoded from the native UTF-16 charset into the charset specified by the toCharset argument and represented in the new blob in 8bit bytes.
Blob.fromByteCode(code);
Blob.fromCode(code);
Returns a new blob using a numeric byte code within the range 0..255.
blob.contentConstructor;
Returns Blob to indicate this has binary content.
blob.length;
Returns the length of the blob. This is immutable.
blob.byteCodeAt(index);
blob.codeAt(index); (@level1)
Extracts a single byte from the blob and returns it as a unsigned integer (Number) such that the number will be in the range 0..255.
blob.concat(otherBlob, ...);
Combines the content of multiple blobs together and returns a new blob.
blob.slice(begin, end);
Extracts a section of the blob and returns a new blob containing it as the contents. (This should behave the same as string.slice and array.slice)
blob.indexOf(blob, offset=0);
blob.lastIndexOf(blob, offset=0);
Returns the index within the calling blob object of the first or last (depending on which method is used) occurrence of the specified value, or -1 if not found.
blob.byteAt(index);
blob.valueAt(index);
Extracts a single byte from the blob and returns a new blob object containing only it.
blob.split();
blob.split(separator);
blob.split(separator, limit);
Splits the blob based on a sequence of bytes ({0 0 0 255 0 0} split by 255 would become [{0 0 0}, {0 0}]) and returns an array of blobs. This is the same as string.split except it does not support regular expressions. Like string.split this supports sequences of more than one unit (ie: You may split {0 0 255 0 0 255 3 0} by the blob {255 0} and get [{0 0}, {0 255 3 0}])
blob.toBlob([fromCharset, toCharset]);
If passed with no argument returns the same blob.
If passed with two charset arguments transcodes the data from one charset to the other and returns the data as a new blob.
Note that if a single argument is passed to this method it should throw a TypeError to prevent gotchas where someone runs .toBlob(charset) on a blob instead of a string where it is relevant.
blob.toString();
Returns a debug representation like "[Blob length=2]", where 2 is the length of the blob. Alternative debug representations are valid too, as long as (A) this method will never fail, (B) the length is included, (C) It is not only the representation of an implicitly converted string.
blob.toString(fromCharset);
Converts the binary data in the blob from the charset specified by fromCharset to the native UTF-16 charset and returns a new string with that content.
blob.toArray();
Returns an array containing the bytes as numbers as if by [ blob.byteCodeAt(i) for ( i in blob ) ].
blob.toArray(fromCharset);
Returns an array containing the decoded Unicode code points as if by var str = blob.toString(fromCharset); [ str.charCodeAt(i) for ( i in str ) ].
blob.toSource();
This method is optional, it should be included if the interpreter being used supports .toSource() on it's various objects and types.
Returns a representation of the blob in the format "(Blob([]))" or "(new Blob([]))". If the blob has content in it the string should contain integers 0..255 representing the blob such that if evaluated (calling the correct Blob function) would return a blob with the same content.

String extensions

string.contentConstructor;
Returns String to indicate this has text content.
string.toBlob(toCharset);
Converts a UTF-16 string into the specified charset and returns a blob containing that binary data.
string.valueAt(index);
An alias for string.charAt(index);
The point of this prototype is so that (string or blob).valueAt(index); may be used independently of whether the sequence is a string or a blob. This will allow strings to maintain .charAt and blobs to maintain .byteAt without returning unintuitive results while still allowing a method of working abstractly without relying on things like (str or blob)[index] which may not be implemented in some engines.
string.codeAt(index);
An alias for string.charCodeAt(index);
The point of this prototype is so that (string or blob).codeAt(index); may be used independently of whether the sequence is a string or a blob. This will allow strings to maintain .charCodeAt and blobs to maintain .byteCodeAt without returning unintuitive results.
String.fromCode(code);
And alias for String.fromCharCode(code);
The point of this prototype is so that (String or Blob).fromCode(code); may be used on .contentConstructor independently of whether the sequence is a string or a blob. This will allow calls to .codeAt to be returned to blob or string format abstractly.

Abstract API

One of the primary focuses was interoperability between Strings and Blobs so that abstract algorithms could be written which work on either strings or blobs.

The entire Buffer api in IO/B/Buffer designed for this purpose, and the following methods on String and Blob are usable in abstract programming:

  • seq.length;
  • seq.contentConstructor (can be used as seq.contentConstructor() to return an empty seq of the same type)
  • seq.valueAt(idx); // Sequence at index
  • seq.codeAt(idx); // Number at index
  • seq.valueOf(); // Returns the same seq (on a buffer returns the equiv Blob or String)
  • seq.indexOf(seq, [off]); and seq.lastIndexOf(seq, [off]); // finding the location of a subsequence
  • seq.concat(...seq); // combining sequences together
  • seq.slice(begin, end); // extracting a portion of a sequence
  • seq.split(sep, [limit]); // split up a sequence using another sequence as a separator
  • Sequence.fromCode(code); // return a single byte or single character based on a numeric code (Turning seq.codeAt(idx) back into a sequence)

Textual casting

This spec does not define any implementation requirements on it's own on the Blob(string, charset); and blob.toString(charset); methods. However it expects that they will behave the same as the encodings spec defines in regards to transcoding, character sets, and errors.

Notes

  • A high priority in this proposal was String/Blob interoperability. While implicit string conversion was avoided it was important to make sure there was a api which could abstractly work with a sequence of data ignorant of whether the data was a string or a blob.
    • .valueAt was added to string so that there was a common method for both blobs and strings without implementing a counterintuitive .charAt on blob. Note that as a result you can actually check .charAt vs .byteAt and string will only have .charAt, while blob will only have .byteAt.
  • Some experimentation with .valueOf needs to be done. .valueOf has type hinting (the first argument is a string hint of what type may be converted to, operators like > and < make use of it as well as a few other cases). It would be nice to see if it's possible to use the native < and > operators to compare blobs on their binary order.
  • For now things like .eq/equals, .lt, gt, etc... have been omited. Do note that Rhino actually implements .equals on String already. Also if we do add these things to blobs we should probably implement the same on strings.

Relevant discussion