Andreas Rozek      

LuaJava Programming Examples

This page contains a loose collection of Lua programming patterns. You may directly copy the source code shown here (and on the following pages) into your program or just use it as a basis for your own development.
 

Please, also consider my "Hints for Reading" and the "List of Recent Changes".
(Problems displaying this page? Ugly graphics? Please, click here)

HTML Parsing

One of the applications the author uses Lua for is the parsing of HTML documents. The following section gives you a few examples for simple HTML parsing which might be sufficient if you don't need to build an object model for your HTML document but deal with specific HTML tags or elements only.

Locating HTML Tags (I)

Assuming, that a given HTML document has been read into a Lua string as a whole, the following function iterates over any tag found in that document:

local function TagsOf (Argument)
  local Cursor,Length = 1,string.len(Argument);       -- iterate over whole file

  return function ()
    if (Cursor > Length) then return nil; end;            -- end-of-file reached

--**** look for next HTML tag ****

    local first,last = string.find(Argument, "<", Cursor, true);
    if (first == nil) or (first == Length) then Cursor = Length+1; return nil; end;

--**** skip any HTML comments ****

    while (string.sub(Argument, first, first+3) == "<!--") do
      first,last = string.find(Argument, "-->", first+4, true);-- HTML compliant
      if (first == nil) or (last == Length) then Cursor = Length+1; return nil; end;

--**** look for next tag (which might - again - be a comment) ****

      first,last = string.find(Argument, "<", last+1, true);
      if (first == nil) or (first == Length) then Cursor = Length+1; return nil; end;
    end;

--**** locate the end of this tag (according to HTML rules) ****

    last = string.find(Argument, "[>\"']", last+1) or Length+1;
    local Token = string.sub(Argument,last,last);                 -- might be ""
    while (Token == "'") or (Token == '"') do            -- locate end of string
      last = string.find(Argument, Token, last+1, true);
      if (last == nil) then last = Length+1; break; end;         -- be forgiving

      last  = string.find(Argument, "[>\"']", last+1) or Length+1;
      Token = string.sub(Argument,last,last);                     -- might be ""
    end;

    Cursor = last+1;      -- end-of-token (or end-of-file) reached, be forgiving

--**** yield the "coordinates" of the current (probably incomplete) tag ****

    if (Token == ">") and (string.sub(Argument,last-1,last-1) == "/") then
      last = last-1;                                        -- take care of "/>"
    end;

    return first+1,math.min(last-1,Length);       -- without "<" and ">" or "/>"
  end;
end;

Given a string containing a complete HTML document when it is constructed, this iterator returns the positions of the first and last character belonging to an HTML tag (without the leading "<" and trailing ">" or "/>") upon every invocation. After having parsed the whole string, the iterator just returns nil.

This iterator could be used in a loop like the following:

local HTMLFile = io.open(filename,"r");           -- no error handling right now
  local HTMLContent = HTMLFile:read("*a");                               -- dto.

  for TagBegin,TagEnd in TagsOf(HTMLContent) do
    local first,last = string.find(HTMLContent, "[!/]?%w*", TagBegin);
    if (first == nil) or (last > TagEnd) then
      println("(invalid (or missing) token name)");
    else
      println(string.sub(HTMLContent,first,last));
    end;
  end;
HTMLFile:close();

which prints all (non-commentary) HTML tags found in a given file - one per line. Any ordinary text between these tags is ignored. Just insert a filename of your choice into the first line of the code shown above and run it.

Please note, that the parser shown above is a rather simple one and assumes some basic HTML coding discipline, at least: '<' characters within ordinary text will be misinterpreted as the start of a HTML tag and cause the following text to be trated like the definition of attributes for that tag.

Locating HTML Tags (II)

If more information about a given HTML tag is needed - including the capability of detecting (at least some) spurious tags - a more sophisticated approach is needed:

local function TagsOf (Argument)
  local Cursor,Length = 1,string.len(Argument);       -- iterate over whole file

  return function ()
    if (Cursor > Length) then return nil; end;            -- end-of-file reached

--**** look for next (potential) HTML tag ****

    local HTMLTag = {};
    repeat                                            -- until HTMLTag[0] exists
      local first,last = string.find(Argument, "<", Cursor, true);
      if (first == nil) or (last == Length) then Cursor = Length+1; return nil; end;

--  **** skip any HTML comments ****

      while (string.sub(Argument, first, first+3) == "<!--") do
        first,last = string.find(Argument, "-->", first+4, true); -- HTML compliant
        if (first == nil) or (last == Length) then Cursor = Length+1; return nil; end

--    **** look for next tag (which might - again - be a comment) ****

        first,last = string.find(Argument, "<", last+1, true);
        if (first == nil) or (last == Length) then Cursor = Length+1; return nil; end
      end;

--  **** look for a valid tag name ****

      first,Cursor = string.find(Argument, "%S", last+1);     -- skip whitespace
      if (first == nil) then Cursor = Length+1; return nil; end;

      first,last = string.find(Argument, "[!/]?%a%w*", Cursor);
      if (first == Cursor) then                          -- valid tag name found
        HTMLTag[0] = string.lower(string.sub(Argument, first,last));

--    **** now look for valid attributes ****

        if (HTMLTag[0] == "!doctype") then     -- parse "doctype" tag as a whole
          first = string.find(Argument, "%S", last+1);        -- skip whitespace
          if (first == nil) then Cursor = Length+1; return HTMLTag; end;

          last = string.find(Argument, "[>\"']", last) or Length+1;
          local Token = string.sub(Argument,last,last);           -- might be ""
          while (Token == "'") or (Token == '"') do      -- locate end of string
            last = string.find(Argument, Token, last+1, true);
            if (last == nil) then last = Length+1; break; end;   -- be forgiving

            last  = string.find(Argument, "[>\"']", last+1) or Length+1;
            Token = string.sub(Argument,last,last);               -- might be ""
          end;

          Cursor = last+1;                          -- position cursor after ">"

          if (Token == ">") and (string.sub(Argument,last-1,last-1) == "/") then
            last = last-1;                   -- take care of "/>" (just in case)
          end;

          HTMLTag[1] = string.gsub(string.sub(Argument,first,last-1), "[%s]+", " ");
        else                                     -- parse an "ordinary" HTML tag
          first = string.find(Argument, "%S", last+1);        -- skip whitespace
          if (first == nil) then Cursor = Length+1; return HTMLTag; end;

          while (string.sub(Argument,first,first) ~= ">") do
            local start,stop = string.find(Argument, "^[%a_][%w-_]*", first);
            if (start == nil) then Cursor = first; return HTMLTag; end;

            local Name = string.lower(string.sub(Argument,start,stop));

            first,last = string.find(Argument, "^%s*=", stop+1);     -- find "="
            if (first == nil) then       -- just an attribute, no explicit value
              HTMLTag[Name] = Name;                     -- XHTML-like definition
            else
              HTMLTag[Name] = "";             -- just a first, provisional value

              start,stop = string.find(Argument, "%S", last+1);--skip whitespace
              if (start == nil) then Cursor = Length+1; return HTMLTag; end;

              local Token = string.sub(Argument,start,start);
              if (Token == "'") or (Token == '"') then   -- locate end of string
                stop = string.find(Argument, Token, start+1, true) or Length+1;
                HTMLTag[Name] = string.sub(Argument, start+1,stop-1);
              elseif (Token == ">") then     -- oops, premature end-of-tag found
                stop = start-1;             -- prepare ">" for being found again
              else                 -- look for a non-quoted HTML attribute value
                first,last = string.find(Argument, "^[%a-._:]+", start);
                if (first == nil) then       -- invalid HTML value, be forgiving
                  first,last = string.find(Argument, "^[^%s/>]+", start);
                  if (first == nil) then Cursor = start; return HTMLTag; end;
                end;

                HTMLTag[Name],stop = string.sub(Argument, first,last), last;
              end;
            end;

            first = string.find(Argument, "%S", stop+1);      -- skip whitespace
            if (first == nil) then Cursor = Length+1; return HTMLTag; end;
          end;

          Cursor = first+1;                         -- position cursor after ">"
        end;
      end;
    until (HTMLTag[0] ~= nil);       -- wait for a valid tag, skip anything else

    return HTMLTag;                        -- return the completely analyzed tag
  end;
end;

The new code now returns one table per tag with the following contents:

  • item [0] contains the name of the tag (including a potential "!" or "/" prefix),

  • if the tag is a "!doctype" tag, item [1] contains the full doctype definition,

  • otherwise any other item represents an attribute-value pair of the underlying HTML tag - with the item key denoting the attribute name (converted to lower case) and the item value representing the (unmodified) attribute value. HTML attributes without any explicit value assignment will be stored with their name as their value - as required by XHTML.

An example for using this iterator could be:

local HTMLFile = io.open(filename,"r");           -- no error handling right now
  local HTMLContent = HTMLFile:read("*a");                               -- dto.

  for Tag in TagsOf(HTMLContent) do
    print("<", Tag[0]);
      for Attribute,Value in Tag do
        if (Attribute ~= 0) then
          if (Attribute == Value) then
            print(" ", Attribute);
          else
            print(" ", Attribute, "=\"", Value, "\"");         -- not fool-proof
          end;
        end;
      end;
    println(">");
  end;
HTMLFile:close();

which prints all tags found (one per line) in a HTML-like format.

Disclaimer

Please, also consider the author's Disclaimer!


http://www.Andreas-Rozek.de/LuaJava/Examples/index_en.html    (last Modification: 21.11.2004)