A simple HTML tokenizer. It simply breaks a stream of text into tokens, where each token is a string. Each string represents either "text", or an HTML element.
This currently assumes valid XHTML, which means no free < or > characters.
Usage:
tokenizer = HTML::Tokenizer.new(text) while token = tokenizer.next p token end
| Public Attributes | |
|---|---|
| line | The current line number |
| position | The current (byte) position in the text |
| Public Methods | |
|---|---|
| new | Create a new Tokenizer for the given text. |
| next | Return the next token in the sequence, or nil if there are no more tokens in the stream. |
| Private Methods | |
|---|---|
| consume_ |
Skips over quoted strings, so that less-than and greater-than characters within the strings are ignored. |
| scan_ |
Treat the text at the current position as a tag, and scan it. Supports comments, doctype tags, and regular tags, and ignores less-than and greater-than characters within quoted strings. |
| scan_ |
Scan all text up to the next < character and return it. |
| update_ |
Counts the number of newlines in the text and updates the current line accordingly. |
<code/>and<pre/>for code samples.