BARfly Help - BAR Implementation File Reference - Deserialization Critical Steps

  Deserialization Critical Steps

A critical step in deserialization is a categorical action taken as part of a binary file's deserialization procedure in BAR.  The action is manifested in the form a critical-step method of a construct (a block or structure).

The full deserialization procedure is very complicated, but you don't need to know the full procedure.  BAR allows you to choose only the critical steps that matter the most to you.  There are so many ways to read and interpret a data file that it is easiest to design an I.F. around only the points that make a difference.

Of course, such versatility of the BAR engine means there may be more than one way to construct the critical steps.  This is the intent:  if there is more than one way, it gets you to your solution that much faster.

The critical step-methods of deserialization have the following function names:  Initialization, Validation, BlockSize, Termination, OffsetTweak, and Deserialize.


 Initialization

The Initialization method is called when BAR is considering whether node data match a construct, but before there is any attempt at actually validating the construct.  This function is used to calculate important quantities before any other methods are called on a construct.

Definition of this method is optional.  If the method is not defined, the Validation method is called immediately afterwards.

The Initialization method has return type void and has no parameters.  Examples of implementations:

void Initialization() { _lh_ptr = this; }; //Saves a pointer to the node; this pointer can be referenced later.

void Initialization() { _count = 0; }; //Resets a global counter to zero; this counter can be used as a way to count iterations before terminating a repeating list.


 Validation

The Validation method provides a means of determining whether or not a block or data structure is of a specific type.  BAR calls this method as a way of checking if the node data forms a "match" with the presumed construct.

Definition of this method is optional.  If the method is not defined, BAR assumes a default validation condition:  true.

Validation methods are invoked during deserialization of a file in order to ensure file format compliance.  Validation methods also distinguish between constructs of different types in lists.

The Validation method has a Boolean return value, with a true condition representing a matched construct and false representing a mismatch.  The function has no parameters.

Examples of implementations:

bool Validation() {
  _bfOffBits = bfOffBits; //Saves state variable.
  return (bfType == 'BM'); //Returns true if the header starts with the two letters 'BM'.
};

bool Validation() {
  _offset = 0; //Resets running offset calculation.
  if (*(long*)BAR_ID != 'BAR1' || version < 1) return false; //Returns false if the header starts with
  //something other than the four characters 'BAR1' or has a version number less than 1.
  return true; //Got past false conditions; return true condition.
};


 BlockSize

The BlockSize method calculates the size of a block, in bytes, bits, or units.  BAR calls this method to size a block.  For this reason, BlockSize is only used when deserializing blocks.  BlockSize is not used when deserializing data structures.

Definition of this method is optional.  If the method is not defined, the block size automatically defaults to the maximum available size.

A BlockSize method has a long integer return value, representing the size calculated.  The function has no parameters.

BlockSize is called immediately after the Validation critical step for blocks.  If Validation had failed, BlockSize will not be called--BAR never calls BlockSize on an invalid node.

The return value unit depends on the type of block being sized:

  • For unorganized blocks, the return value represents a unit count.
  • For bit scan blocks, the return value represents a bit count.
  • For organized blocks that are not bit scan blocks, the return value represents a byte count.

BlockSize methods are invoked during deserialization of a file in order to calculate the size of a block.  A BlockSize method can be defined as a direct method of a block, or it can be defined as an indirect method (a child of the block's header structure).  If a header structure has a BlockSize method, this method is used to size the block--not the header structure.

A BlockSize method always has a range of values with which to work whenever it is called.  The global variable maxblocksize represents the largest possible return value, while the global variable minblocksize represents the smallest possible return value.  A critical error results if BlockSize returns a value outside the range of [minblocksize, maxblocksize].

BlockSize only serves to limit a block's size.  It cannot expand a block's size beyond its theoretical upper limits, and it cannot return zero.  Furthermore, the individual deserialization nuances of a block's children might further limit a block's size to a quantity lower than the value BlockSize might return.

Examples of implementations:

long BlockSize() { return (textblock_offset + textblock_size); }; //Return sum of two individual block portion sizes.

long BlockSize() { return ((chunkSize + 8 + 1) & ~1); }; //Return header-provided size, plus additional fields and aligned on a two-byte boundary.

long BlockSize() { return (_num_relocation_items * 4); }; //Return field count scaled by the size of each field.

long BlockSize() { return (maxblocksize / 2); }; //Return half the maximum block size.


 Termination

The Termination method provides a means of determining when a parent block’s node list has no more iterations remaining.  It is called after BlockSize.

A person uses the Termination method when there is no easy way to size blocks ahead of actually processing their contents.  Repeating node lists can be used to capture a variable number of children in an organized block, but these lists still need to be told when to "stop."  The Termination method informs a parent block that this iteration is the last iteration.

A Termination method has a Boolean return value, with true flagging the end of the list, and false flagging the go-ahead to continue repeating the list.  The function has no parameters.

Termination methods are invoked during deserialization of a file in order to find the last iteration of a node list.  If the method is not defined, BAR assumes a default termination condition:  false.

Examples of implementations:

bool Termination() { return true; }; //Have construct act as a de-facto terminator:  wherever it appears, it terminates the list.

bool Termination() { return (++_current_lump >= _num_lumps); }; //Terminate only after a specific number of iterations.


 OffsetTweak (advanced deserialization)

The OffsetTweak method calculates the new absolute position of the parse cursor during deserialization.  Offset tweaking, as it is known in BAR, is an advanced concept that is only designed to be used with nonlinear file formats.

If an OffsetTweak method is defined, it is called after BlockSize and Termination have completed for the node, and deserialization is also complete for all of the children of the construct.

An OffsetTweak method has a long integer return value, representing the new absolute position of the parse cursor.  There are also two parameters.  The first parameter is a long integer indicating the default absolute position of the parse cursor; the second parameter is a Boolean value indicating whether the method is invoked from a deserialization operation (true) or serialization operation (false).  Because OffsetTweak is called by both deserialization and serialization procedures, the second parameter is necessary to distinguish one category of critical step from the other.

Offset tweaking is a very advanced way of conducting deserialization, and should not be used unless absolutely necessary.  When the contents of a file are mapped in a nonlinear manner, OffsetTweak methods are quite powerful:  the constructs are linearly organized while deserialized, but nonlinearly organized while serialized.

If the method is not defined, the default absolute position of the parse cursor is passed directly to the return value.  The default position is the absolute byte position just after the last byte of the evaluated construct (forward-moving cursor with no bytes wasted).  Unless BAR is explicitly told to "jump," so to speak, the normal behavior is to proceed with deserialization as if the file were linear.

An OffsetTweak method always has a range of values with which to work whenever it is called.  The global variable totalfilesize represents the largest possible return value, while zero represents the smallest possible return value.  A critical error results if OffsetTweak returns a value outside the range of [0, totalfilesize].

Example of implementation:

long OffsetTweak(long old_offset, bool is_deserializing) {
  _directory_offset += 16;
  _lump_size = lump_size;

  if (!is_deserializing) return old_offset; //Serialization unsupported

  if (_lump_size == 0) {
    if (_current_lump >= _num_lumps) return old_offset; //Don't move parse cursor if at end
    return _directory_offset; //Move parse cursor to next directory entry
  }
  else {
    return lump_offset; //Move parse cursor to next lump
  }
};


 Deserialize (advanced deserialization)

The Deserialize method performs translation of a serialized chunk of data to its deserialized equivalent.

A Deserialize method has as return value a pointer or reference to the construct in its deserialized form.  The function has no parameters.  The size and type of the deserialized construct need not match the size and type of the serialized chunk of data.  You have a node going in, and just about any node going out.

Deserialize methods are invoked immediately after the OffsetTweak critical step, if it exists.  However, it is only necessary to actually define a Deserialize method if the serialized form of the data is of a very convoluted nature, such as a compressed and/or encrypted bit stream.

If the method is not defined, the contents of the construct undergo no changes as part of deserialization.  In other words, you get as output exactly the node processed as part of all previous critical-step methods.

What makes Deserialize especially powerful is the prospect of creating many from one, or one from many.  You can translate an organized block with many of children and sub-children into just a simple data structure.  Conversely, you can translate just a few bytes into a massive organized block with many "tree branches" just within the body of a single function.

Examples of implementation:

void *Deserialize() {
  char *oldbuffer = (char*)this;
  long count = totalfilesize;
  long final_size = totalfilesize;
  char *newbuffer;
  char *newcursor;

  while (count) {
    if (*oldbuffer == 13) --final_size;
    ++oldbuffer;
    --count;
  }
  ...
};

void *Deserialize() {
  ...
  switch (composite_choice) {
    case CTYPE_CHAR:
    case CTYPE_UCHAR:
      newptr = new chardata[total_result_size];
      if (!newptr) { m_Last_Error = alloc_error_str; return 0; }
    break;
    case CTYPE_CHAR_BE:
  ...
};


 Rules governing critical steps

It is quite unusual to have a construct declared in an I.F. with every critical-step method defined.  The usage of critical steps depends on the need for a particular type of calculation at a given point in deserialization.

Still, an I.F. designer might become confused as to what is "actually happening" during the deserialization procedure if one or two critical steps do not appear to work as expected.  The following rules should help the user understand how critical steps fit together.

General order for critical steps is the following:

  • Minimum size test (exit early if too small)
  • Initialization (exit early if critical error)
  • Validation (exit early if invalid)
  • BlockSize (exit early if critical error; only called when deserializing blocks)
  • Termination (exit early if critical error)
  • Organized block body deserialization (exit early if body is invalid; only called when deserializing organized blocks)
  • OffsetTweak (exit early if critical error)
  • Deserialize (exit early if critical error)

If an organized block and its header structure both have the same critical-step method (one direct declaration and one indirect declaration):

  • Initialization for block is called first; Initialization for header structure is called after BlockSize for block.
  • Validation for block is called first; Validation for header structure is called after BlockSize for block.
  • BlockSize for block is called first; BlockSize for header structure is called after BlockSize for block.
  • Termination for header structure is called first; Termination for block is called immediately afterwards.
  • OffsetTweak is only called for header structure.  It is not called for block.
  • Deserialize is only called for header structure.  It is not called for block.

Minimum size test:

The minimum size test is a special pre-deserialization step for each construct.  An unorganized block has a minimum size of 1 unit, a data structure has a minimum size (bits or bytes) equalling its fixed size as declared in the I.F., and an organized block has a minimum size of 1 byte or the size of its header structure, if one exists, whichever one is larger.  Failure to satisfy the minimum size test renders the construct invalid before any critical steps are executed.

Organized block body deserialization varies depending on the construct type selected as the body:

  • Data structure:  Just one data structure composes the body.  This is deserialized just after the Termination critical step of the parent organized block.
  • Unorganized block:  Just one unorganized block composes the body.  This is deserialized just after the Termination critical step of the parent organized block.
  • Organized block:  Just one organized block composes the body.  This is deserialized just after the Termination critical step of the parent organized block.
  • Decision list:  A choice between possible alternatives.
    • BAR tries to validate the first construct choice in the list, accepting it as the block body if it is valid.  If it is not valid, BAR tries the second choice in the list, and so on.
    • If BAR has gone through the entire list without finding a valid construct, and the decision list is optional, the block body is left empty.
    • If BAR has gone through the entire list without finding a valid construct, and the decision list is NOT optional, the parent block is rendered invalid.
  • Node list:  A sequence of individual child constructs to compose the organized block's node children.  The children can be blocks, structures, or decision lists.
    • If any list member other than an optional decision list is invalid, the parent block is rendered invalid.
    • If the list does not repeat, the block body is composed of just a single node list iteration.
    • If the list does repeat, the block body can be composed of one or more iterations of the node list.

A repeating node list terminates under the following circumstances:

  • Maximum space for block body is exhausted.  A BlockSize method limits the size of a block, and if the last iteration of a node list takes up all possible space, the node list terminates by default.
  • A Termination routine returned true for a child.  A Termination method in one of the parent block's children flags the iteration as the last one.  This will further limit a block size even if it was explicitly assigned via a BlockSize method.
  • The node list auto-terminates in the event of an invalid iteration.  The "autoterminate" attribute for a node list automatically terminates a list if a single invalid iteration is encountered.  In the event of such an invalid iteration, the partial iteration is discarded, leaving only valid iterations.

Block minimum size considerations:

  • Unorganized blocks must have at least one unit.  You cannot size an unorganized block to zero units as a way to declare it invalid; you must explicitly flag its invalid condition with the Validation function.
  • Organized blocks can have a zero-length body only if the block has a header structure.  A zero-length body for a block without a header structure yields a zero-length parent organized block, which is not allowed.  Instead of sizing an organized block to zero bytes, you should explicitly flag its invalid condition with the Validation function.
  • The "autoterminate" attribute only works for node lists with at least one iteration.  You cannot use "autoterminate" to create a node list of zero iterations.


  See also:  [Deserialization: Critical steps] [Serialization: Critical steps] [Fundamental components]
[Common declarations] [File scope] [Structure scope] [Block scope]
[Node list scope] [Decision list scope] [Function scope] [Expressions]
[Compiler errors] [Compiler warnings]


BARfly Help Copyright © 2009 Christopher Allen