| Andreas Rozek | |
|
|
|
LuaJava Programming Examples
This page contains a loose collection of Lua programming patterns. You may
directly copy the source code shown here (and on the following pages) into
your program or just use it as a basis for your own development.
Please, also consider my "Hints for Reading"
and the "List of Recent Changes". |
HTML ParsingOne of the applications the author uses Lua for is the parsing of HTML documents. The following section gives you a few examples for simple HTML parsing which might be sufficient if you don't need to build an object model for your HTML document but deal with specific HTML tags or elements only. Locating HTML Tags (I)Assuming, that a given HTML document has been read into a Lua string as a whole, the following function iterates over any tag found in that document: local function TagsOf (Argument)
local Cursor,Length = 1,string.len(Argument); -- iterate over whole file
return function ()
if (Cursor > Length) then return nil; end; -- end-of-file reached
--**** look for next HTML tag ****
local first,last = string.find(Argument, "<", Cursor, true);
if (first == nil) or (first == Length) then Cursor = Length+1; return nil; end;
--**** skip any HTML comments ****
while (string.sub(Argument, first, first+3) == "<!--") do
first,last = string.find(Argument, "-->", first+4, true);-- HTML compliant
if (first == nil) or (last == Length) then Cursor = Length+1; return nil; end;
--**** look for next tag (which might - again - be a comment) ****
first,last = string.find(Argument, "<", last+1, true);
if (first == nil) or (first == Length) then Cursor = Length+1; return nil; end;
end;
--**** locate the end of this tag (according to HTML rules) ****
last = string.find(Argument, "[>\"']", last+1) or Length+1;
local Token = string.sub(Argument,last,last); -- might be ""
while (Token == "'") or (Token == '"') do -- locate end of string
last = string.find(Argument, Token, last+1, true);
if (last == nil) then last = Length+1; break; end; -- be forgiving
last = string.find(Argument, "[>\"']", last+1) or Length+1;
Token = string.sub(Argument,last,last); -- might be ""
end;
Cursor = last+1; -- end-of-token (or end-of-file) reached, be forgiving
--**** yield the "coordinates" of the current (probably incomplete) tag ****
if (Token == ">") and (string.sub(Argument,last-1,last-1) == "/") then
last = last-1; -- take care of "/>"
end;
return first+1,math.min(last-1,Length); -- without "<" and ">" or "/>"
end;
end;
Given a string containing a complete HTML document when it is constructed, this iterator returns the positions of the first and last character belonging to an HTML tag (without the leading "<" and trailing ">" or "/>") upon every invocation. After having parsed the whole string, the iterator just returns nil. This iterator could be used in a loop like the following: local HTMLFile = io.open(filename,"r"); -- no error handling right now
local HTMLContent = HTMLFile:read("*a"); -- dto.
for TagBegin,TagEnd in TagsOf(HTMLContent) do
local first,last = string.find(HTMLContent, "[!/]?%w*", TagBegin);
if (first == nil) or (last > TagEnd) then
println("(invalid (or missing) token name)");
else
println(string.sub(HTMLContent,first,last));
end;
end;
HTMLFile:close();
which prints all (non-commentary) HTML tags found in a given file - one per line. Any ordinary text between these tags is ignored. Just insert a filename of your choice into the first line of the code shown above and run it. Please note, that the parser shown above is a rather simple one and assumes some basic HTML coding discipline, at least: '<' characters within ordinary text will be misinterpreted as the start of a HTML tag and cause the following text to be trated like the definition of attributes for that tag. Locating HTML Tags (II)If more information about a given HTML tag is needed - including the capability of detecting (at least some) spurious tags - a more sophisticated approach is needed: local function TagsOf (Argument)
local Cursor,Length = 1,string.len(Argument); -- iterate over whole file
return function ()
if (Cursor > Length) then return nil; end; -- end-of-file reached
--**** look for next (potential) HTML tag ****
local HTMLTag = {};
repeat -- until HTMLTag[0] exists
local first,last = string.find(Argument, "<", Cursor, true);
if (first == nil) or (last == Length) then Cursor = Length+1; return nil; end;
-- **** skip any HTML comments ****
while (string.sub(Argument, first, first+3) == "<!--") do
first,last = string.find(Argument, "-->", first+4, true); -- HTML compliant
if (first == nil) or (last == Length) then Cursor = Length+1; return nil; end
-- **** look for next tag (which might - again - be a comment) ****
first,last = string.find(Argument, "<", last+1, true);
if (first == nil) or (last == Length) then Cursor = Length+1; return nil; end
end;
-- **** look for a valid tag name ****
first,Cursor = string.find(Argument, "%S", last+1); -- skip whitespace
if (first == nil) then Cursor = Length+1; return nil; end;
first,last = string.find(Argument, "[!/]?%a%w*", Cursor);
if (first == Cursor) then -- valid tag name found
HTMLTag[0] = string.lower(string.sub(Argument, first,last));
-- **** now look for valid attributes ****
if (HTMLTag[0] == "!doctype") then -- parse "doctype" tag as a whole
first = string.find(Argument, "%S", last+1); -- skip whitespace
if (first == nil) then Cursor = Length+1; return HTMLTag; end;
last = string.find(Argument, "[>\"']", last) or Length+1;
local Token = string.sub(Argument,last,last); -- might be ""
while (Token == "'") or (Token == '"') do -- locate end of string
last = string.find(Argument, Token, last+1, true);
if (last == nil) then last = Length+1; break; end; -- be forgiving
last = string.find(Argument, "[>\"']", last+1) or Length+1;
Token = string.sub(Argument,last,last); -- might be ""
end;
Cursor = last+1; -- position cursor after ">"
if (Token == ">") and (string.sub(Argument,last-1,last-1) == "/") then
last = last-1; -- take care of "/>" (just in case)
end;
HTMLTag[1] = string.gsub(string.sub(Argument,first,last-1), "[%s]+", " ");
else -- parse an "ordinary" HTML tag
first = string.find(Argument, "%S", last+1); -- skip whitespace
if (first == nil) then Cursor = Length+1; return HTMLTag; end;
while (string.sub(Argument,first,first) ~= ">") do
local start,stop = string.find(Argument, "^[%a_][%w-_]*", first);
if (start == nil) then Cursor = first; return HTMLTag; end;
local Name = string.lower(string.sub(Argument,start,stop));
first,last = string.find(Argument, "^%s*=", stop+1); -- find "="
if (first == nil) then -- just an attribute, no explicit value
HTMLTag[Name] = Name; -- XHTML-like definition
else
HTMLTag[Name] = ""; -- just a first, provisional value
start,stop = string.find(Argument, "%S", last+1);--skip whitespace
if (start == nil) then Cursor = Length+1; return HTMLTag; end;
local Token = string.sub(Argument,start,start);
if (Token == "'") or (Token == '"') then -- locate end of string
stop = string.find(Argument, Token, start+1, true) or Length+1;
HTMLTag[Name] = string.sub(Argument, start+1,stop-1);
elseif (Token == ">") then -- oops, premature end-of-tag found
stop = start-1; -- prepare ">" for being found again
else -- look for a non-quoted HTML attribute value
first,last = string.find(Argument, "^[%a-._:]+", start);
if (first == nil) then -- invalid HTML value, be forgiving
first,last = string.find(Argument, "^[^%s/>]+", start);
if (first == nil) then Cursor = start; return HTMLTag; end;
end;
HTMLTag[Name],stop = string.sub(Argument, first,last), last;
end;
end;
first = string.find(Argument, "%S", stop+1); -- skip whitespace
if (first == nil) then Cursor = Length+1; return HTMLTag; end;
end;
Cursor = first+1; -- position cursor after ">"
end;
end;
until (HTMLTag[0] ~= nil); -- wait for a valid tag, skip anything else
return HTMLTag; -- return the completely analyzed tag
end;
end;
The new code now returns one table per tag with the following contents:
An example for using this iterator could be: local HTMLFile = io.open(filename,"r"); -- no error handling right now
local HTMLContent = HTMLFile:read("*a"); -- dto.
for Tag in TagsOf(HTMLContent) do
print("<", Tag[0]);
for Attribute,Value in Tag do
if (Attribute ~= 0) then
if (Attribute == Value) then
print(" ", Attribute);
else
print(" ", Attribute, "=\"", Value, "\""); -- not fool-proof
end;
end;
end;
println(">");
end;
HTMLFile:close();
which prints all tags found (one per line) in a HTML-like format. |
DisclaimerPlease, also consider the author's Disclaimer! |
|
|
|
| http://www.Andreas-Rozek.de/LuaJava/Examples/index_en.html | (last Modification: 21.11.2004) |