Search This Blog

Sunday, January 8, 2012

Expression Evaluation

Expression evaluation is one of the most difficult and complex parts of this project, but it's one of the most important.  Most data processing an any program is going to involve expressions, so it is essential that I optimize this process as much as possible.  I created 4 different flow charts approaching this problem from different angles, and think I've got a good handle on how this is going to come together. 

The first way I tried involved a large for loop that would cycle through each character in the expression, and use a state machine to separate the characters into numerals, or variables, or operators.  It used a select case to cycle through the operators, and evaluate them one at a time for left and right operands.  I'm not sure this process would work, as it doesn't handle order of operation very well.

Next idea was to pre-parse the expression before trying to evaluate it.  This would involve a large loop similar to the first idea, but no evaluation would happen in stage one; it would just parse up the terms into numbers, variables, operators, etc.  These terms would be stored in an array that the second phase would then evaluate.  It looked like the second loop would be much more compact, since it would just be a bunch of select case statements over and over.  Order of operations would not be such a problem in this model, as the evaluation routine could loop over the collection of terms, evaluating the highest precedence operators first, and working its way down through highest precedence to the lowest (parenthesis, multiplicatives, additives, relational comparisions, assignment, sequential evaluation, etc).  I liked the way this one was organized, but it would require multiple (many, many actually) loops over the terms, moving them around, combining them, and sorting them, over and over.  This would surely require a lot more processing time than I wanted to devote to this.

I thought about using dynamically compiled C to evaluate the expressions, and did some research into what would be involved in this.  I decided against it as it would require even greater processing time than the previous idea.

Finally, I settled on a single-pass method that uses recursion to evaluate expressions, selectively processing or deferring for the next operand/operator in the expression.  This method seems the most promising, and it's the one I'm going with (unless I think of a better way).

Here's the basic idea for how this works:
1) A starter function is called.  The function takes 3 parameters, the ScriptObject byref(containing variables, structs, enums, datablock, etc), the expression (as a string) byval, and a variable to hold the return value of the evaluation, byref.  The function returns an integer type enum, ErrorCode, that indicates if something went wrong.  Before it does anything, it makes a backup copy of the ScriptObject, including all Variables, structs, enums, datablock, everything.  This allows all processing to be reverted in the event of an error occuring during evaluation.  Variable contents could be left in a very inconsistent state in this case, and the only choice is to revert everything and return an error code.  This will allow easier debugging, since analyzing a total mess of jumbled data is a nightmare.

2) The starter creates a few variables to track the progress of the evaluation, and control the other functions.  Pos is the character position in the Expression where evaluation is currently at.  TermIdx is an auto-increment index assigned to each term as they are processed, for easy identification during processing.  ExpTerm is a structure to contain parsed expression terms.  ThisTerm is a var of type ExpTerm that holds the currently processing term.  LastTerm is a var of type ExpTerm that contains the previous term that was handled (or left operand for operators that take two terms).  RightTerm is a var of type ExpTerm that holds the next term to be processed (or a right operand for operators that take two terms).  For operations that need to look-ahead in the expression to make decisions (for example, declaring variables takes a ThisTerm - DataType, a RightTerm - Variable Name, and needs to look ahead to see if there are array subscripts to choose wrether to make a simple variable or an array), RightTerm2 holds the look-ahead term.  If the RightTerm2 turns out to be something unexpected, or unusable for ThisTerm, RightTerm2 is moved into NextTerm for safe keeping.

3) The Starter calls a function that parses the expression into terms, GetNextTerm.  This function takes 5 parameters, and return an ErrorCode.  Param 1: ScriptObject that contains variables, structs, enums, datablock, etc; Param 2: Expression to be parsed, Param 3: NextTerm, byref, that will be populated with the NextTerm parsed; Param 4: Pos, the position in Expression to resume parsing at; Param 5: TermIdx, the auto-increment index for the retrieved term.  This function loops through characters in the Expression, starting as Pos, ignores whitespace until the first character is found.  It checks the character for operator symbols, and if found, it identifies the operator and stores in NextTerm. 

If the first character found is alphanumeric, it builds a string Symbol until a non-alphanumeric character is reached.  It then checks Symbol to see if it is numeric: if so, it checks if the stop character was a decimal point, and continues to build the number if it was.  Once the number is fully parsed, it checks it against some constants to see what kind of numeral is it, floating point, 32-bit integer, or 64-bit interger, based on size, and decimal place existance.  It creates a Term out of the parsed info and returns.

If the Symbol is non-numeric, it assumes some sort of moniker is present.  It first checks to see if Symbol is a primitive data type: if so, it returns the Term as a DataType.  Next, it searches through the Moniker Index to find out if Symbol is an enum type, struct type, functoin call, or variable.  If enum or struct, it returns the DataType.  If function, it returns a function call Term.  If variable, it grabs the details of the variable from the VarCatalog, and returns Variable Term.

Lastly, if it still hasn't identified Symbol, it returns an UnknownMoniker Term.  This isn't an error condition, it just indicates that this term could be a new variable declaration.  Further evaluation will be required to determine if the Moniker is actually a reserved work and an error should be thrown.

4) Now that the first Term is parsed and ready to process, the Starter invokes the recursive Evaluate function.  This function takes 7 parameters, and returns ErrorCode.  Param 1: ScriptObject byref; Param 2: Expression to evaluate; Param 3: Result of the evaluation, byref; Param 4: Pos, byref; Param 5: TermIdx, byref; Param 6: ThisTerm; Param 7: LastTerm. The function is called with all available info at this point, and LastTerm is assigned Nothing, since there wasn't a last term.

5) Evaluate runs a Select Case against the type of ThisTerm.  For each possible Term type, RightTerm is populated if needed, then RightTerm2 if populated to check for precedence.  If the current term is higher priority than RightTerm2, the current Operation is evaluated and then Evaluate is called recursively; The results of this operation are LeftTerm, RightTerm2 becomes ThisTerm, and all other variables are passed on as well.  The Evaluate returns, its Result is passed on as this recursion's Result, and the function returns.

If RightTerm2 is higher precedence, then Evaluate is called before evaluating ThisTerm; LeftTerm is populated with RightTerm, and ThisTerm is populated with RightTerm2, all other vars are passed on to the recursion.  When Evaluate returns, its Result will contain the results of processing, and those are used as RightTerm.  The Operation ThisTerm is then evaluates against LeftTerm and Result, and the results are returned in Result, and the function returns.

6) After all terms have been evaluated, RightTerm2 will start populating with EmptyTerm, which is the lowest precedence priority, and all operations will complete, and if they were deferred, they will evaluate now.  Then each recursion will get a EmptyTerm when they populate RightTerm2, and eventually all Evaluate recursions will return, and execution will fall back to the Starter.

7) Starter will then analyze the ErrorCode that was returned.  If everything is ok, the returned Result will be packaged into this function's Result parameter, and execution will return to the main code execution body.  All variable assignments will be committed at this point.  If something goes wrong, and the code hasn't indicated it can handle the exception (unhandled exception), then execution is halted so I can analyze the two data blocks (the original and post-mortem versions) and the passed expression and all parsed Terms to find out what went wrong.  If the code indicates it handles exceptions, then ScriptObject will be reverted to the backup and an ErrorCode will be returned.  This allowed the script code to continue without worrying about garbled datablocks.


So, that's the plan.  I think it is a workable solution to the problem.  If anyone has any ideas or suggestions (or potential pitfalls), I'd love to hear them in the comments.

No comments:

Post a Comment