Break the HTML source in str into a series of tokens and return them. The tokens are just 2-element Array tuples with a type and the actual content. If this function is called with a block, the type and text parts of each token will be yielded to it one at a time as they are extracted.
Source Code
# File bluecloth.rb, line 1063 def tokenize_html( str ) depth = 0 tokens = [] @scanner.string = str.dup type, token = nil, nil until @scanner.empty? @log.debug "Scanning from %p" % @scanner.rest # Match comments and PIs without nesting if (( token = @scanner.scan(MetaTag) )) type = :tag # Do nested matching for HTML tags elsif (( token = @scanner.scan(HTMLTagOpenRegexp) )) tagstart = @scanner.pos @log.debug " Found the start of a plain tag at %d" % tagstart # Start the token with the opening angle depth = 1 type = :tag # Scan the rest of the tag, allowing unlimited nested <>s. If # the scanner runs out of text before the tag is closed, raise # an error. while depth.nonzero? # Scan either an opener or a closer chunk = @scanner.scan( HTMLTagPart ) or raise "Malformed tag at character %d: %p" % [ tagstart, token + @scanner.rest ] @log.debug " Found another part of the tag at depth %d: %p" % [ depth, chunk ] token += chunk # If the last character of the token so far is a closing # angle bracket, decrement the depth. Otherwise increment # it for a nested tag. depth += ( token[-1, 1] == '>' ? -1 : 1 ) @log.debug " Depth is now #{depth}" end # Match text segments else @log.debug " Looking for a chunk of text" type = :text # Scan forward, always matching at least one character to move # the pointer beyond any non-tag '<'. token = @scanner.scan_until( /[^<]+/m ) end @log.debug " type: %p, token: %p" % [ type, token ] # If a block is given, feed it one token at a time. Add the token to # the token list to be returned regardless. if block_given? yield( type, token ) end tokens << [ type, token ] end return tokens end
<code/>and<pre/>for code samples.