Class

Tokenizer

Extends:

A simple HTML tokenizer. It simply breaks a stream of text into tokens, where each token is a string. Each string represents either "text", or an HTML element.

This currently assumes valid XHTML, which means no free < or > characters.

Usage:

tokenizer = HTML::Tokenizer.new(text)
while token = tokenizer.next
  p token
end
Public Attributes
line The current line number
position The current (byte) position in the text
Public Methods
new Create a new Tokenizer for the given text.
next Return the next token in the sequence, or nil if there are no more tokens in the stream.
Private Methods
consume_quoted_regions Skips over quoted strings, so that less-than and greater-than characters within the strings are ignored.
scan_tag Treat the text at the current position as a tag, and scan it. Supports comments, doctype tags, and regular tags, and ignores less-than and greater-than characters within quoted strings.
scan_text Scan all text up to the next < character and return it.
update_current_line Counts the number of newlines in the text and updates the current line accordingly.
Comments

Have your say
Please use Textile formatting (click here for a cheat sheet). Use <code/> and <pre/> for code samples.
Click here to login with OpenID to to post comments.