Saturday, August 31, 2013

A General Algorithm for Creating a Tree Structure from a String with Nested Parentheses


Suppose there's a tree structure MyTree, which contains three fields: Integer id, String value, and List<MyTree> children. When constructing a MyTree object with constructor "MyTree(Integer id, String value)", the id parameter should not be null or negative, and the value parameter should contain only English letters or numbers. The toString() public method of MyTree returns a String that contains all available information about the MyTree object and has a pattern like this:


(1:AA(..child 1..)(..child 2..)(..child 3..)...(..child n..))

For example, 

"(1:AA(2:BB)(3:CC(31:wef(52:huEf)(35:weDf(554:Q)))(43:f34w(42:wefw)))(4:ge(43:werw))(5:zer(99:f5)(100:f5)))"

is the tree string of a MyTree object (the return value of the "public String toString()" method in class MyTree) with id 1, value AA, and four MyTree children with id 2, 3, 4, 5 respectively.


The problem is: how to create a tree structure from a string with nested parentheses? In other words, how to convert this kind of tree strings back to objects with MyTree data structure?

In addition, there may be spaces inserted between parentheses, values, ids, and colons that do not damage the validity of the tree string. For example, "(1:AA(65:wete))" and "(  1  : AA(65 :  wete  )  )" are converted to the same MyTree object.


This problem seems quite difficult when first encountered, since the tree structure can't be constructed with just a few regular expressions. The following is my implementation of an algorithm for creating the tree structure from a tree string with nested parentheses. You can download the codes from http://sites.google.com/site/moderntone/MyStringParser.zip






MyTree.java


package com.treeparse;

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class MyTree {

 private Integer id;
 private String value;
 private List<MyTree> children;
 
 public MyTree(Integer id, String value){
  if (id == null || id < 0)
   new Exception("Error: The id should be non-negative.").printStackTrace();
  Pattern pattern = Pattern.compile("[^a-zA-Z0-9]");
  Matcher matcher = pattern.matcher(value);
  if (matcher.find())
   new Exception("Error: The value should contain only English letters or numbers.").printStackTrace();
  this.id = id;
  this.value = value;
  children = new ArrayList<MyTree>();
 }

 public String getValue() {
  return value;
 }
 public void setValue(String value){
  this.value = value;
 }
 
 public int getId(){
  return id;
 }
 public void setId(int id){
  this.id = id;
 }
 
 public List<MyTree> getChildren() {
  return children;
 }

 public MyTree getChild(int i){
  return children.get(i);
 }
 
 public boolean isLeaf() {
  return children.size() == 0;
 }
 
 public static MyTree valueOf(String treeString){
  MyParser mp = new MyParser();
  return mp.parse(treeString);
 }
 
 @Override
 public String toString(){
  StringBuilder builder = new StringBuilder("(" + id + ":" + value);
  for (MyTree child : children)
   builder.append(child);
  return builder.append(")").toString();
 }
}


MyTokenizer.java

package com.treeparse;

import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class MyTokenizer {
 
 private String treeString;
 protected boolean isInvalid;
 
 protected String value;
 protected Integer id; 
 protected List<String> childUnits;
 
 public MyTokenizer(String treeString) {
  this.treeString = treeString.trim();
  tokenize();
 }
 
 private void tokenize(){
  try {
   setId();
   setValue();
   setChildUnits();
  } catch (Exception e) {
   System.out.println("MyTokenizer.tokenize() Exception e = " + e);
   isInvalid = true;
   return;
  }
 }
 
 private void setId(){
  try {
   Pattern pattern = Pattern.compile("(?<=\\() *\\d+ *(?=:)"); 
   Matcher matcher = pattern.matcher(treeString);
   if (matcher.find()){
    id = Integer.parseInt(matcher.group().trim());
    return;
   }
  } catch (Exception e) {
   System.out.println("MyTokenizer.setId() Exception e = " + e);
  }
  isInvalid = true;
 }
 
 private int tempIndex;
 private void setValue(){
  try {
   Pattern pattern = Pattern.compile("(?<=:) *[a-zA-Z0-9]+ *[\\(\\)]"); 
   Matcher matcher = pattern.matcher(treeString);
   if (matcher.find()){
    tempIndex = treeString.indexOf(matcher.group()) + matcher.group().length() - 1;
    value = matcher.group().substring(0, matcher.group().length() - 1).trim();
    return;
   }
  } catch (Exception e) {
   System.out.println("MyTokenizer.setValue() Exception e = " + e);
  }
  isInvalid = true;
 }
 

 /**
  * Examine treeString's parentheses from leftParenthesis to treeString.length() - 1
  * @param leftParenthesis
  */
 private void setChildUnits(){
  this.childUnits = new ArrayList<String>();
  int cv = 0;
  Integer tempStart = null;
  for (int i = tempIndex; i < treeString.length() - 1; i ++){
   switch (treeString.charAt(i)) {
   case '(':
    if (++cv == 1)
     tempStart = i;
    break;
   case ')':
    if (--cv == 0){
     childUnits.add(treeString.substring(tempStart, i+1));
     tempStart = null;
    }
    break;
   }
  }
  if (tempStart != null)
   isInvalid = true;
 }
}


MyParser.java

package com.treeparse;

import java.util.List;

public class MyParser {

 public MyParser() {
  
 }

 public MyTree parse(String treeString){
  MyTokenizer tokenizer = new MyTokenizer(treeString);
  if (tokenizer.isInvalid){
   System.err.println("warning: the input is not a valid MyTree string.");
   return null;
  }
  List<String> childUnits = tokenizer.childUnits;
  MyTree myTree = new MyTree(tokenizer.id, tokenizer.value);
  List<MyTree> children = myTree.getChildren();
  for (String childUnit : childUnits)
   children.add(parse(childUnit));
  return myTree;
 }
 
 public static void main(String[] args) {
  String testString = "(1 : AA (2 : BB) (3: wef (56:3t4) (6:4g3))  )";
  testString = "(1:AA(2:BB)(3:CC(31:wef(52:huEf)(35:weDf(554:Q)))(43:f34w(42:wefw)))(4:ge(43:werw))(5:zer(99:f5)(100:f5)))";
  MyTree tree = MyTree.valueOf(testString);
  System.out.println(tree);

  for (MyTree myTree : tree.getChildren()){
   System.out.println(myTree.getId() + " with tree string: " + myTree);
  }
 }
 
}


Key point of the algorithm: 
Identify the children tree strings of the root, and parse them recursively to create the whole tree structure



A  key point of the codes below is in setChildUnits() method in MyTokenizer.java. The method determines tree string's sub-strings that represent the children of the root, and the MyParser parses them recursively until all tree strings as leaves are parsed and the whole tree are constructed. 


The idea behind the method is not complex. The program first identifies the start and end indexes of the sub-string about the part of all children tree strings and then keeps track of the parentheses with the value of (number of left parentheses) - (number of right parentheses), stored in local variable cv. 

If cv is 1 when a left parenthesis is checked, it is the start parenthesis of a child tree string; if cv is 0 when a right parenthesis is checked, it is the end parenthesis of a child tree string. This claim can be easily proved. Suppose the tree string is a valid one, then left parentheses and right parentheses must occur  pairwise.Thus, the cv value of the start parenthesis of any child is 1 + n - n = 1, and cv of the end parenthesis of any child is obviously constantly zero. For all left parentheses beside the start parentheses of children, the cv value is always larger than 1, since 1 + n - m > 1 and m < n; for all right parentheses beside the end parentheses of children, cv is m + 0 with m = 1 or more and thus always larger than 0. With this idea, the start and end parentheses of all children tree strings are identified.


The concept of this implementation can similarly be applied to all kinds of tree structures with multiple fields and tree strings with different patterns. There are three major classes: a tree class, and tokenizer class, and a parser class. The tree class represents the tree structure; the tokenizer classes examines the tree string to obtain information about all fields and the first-generation children; the parser class recursively parses the tree strings until the tree structure is constructed.