NAME XHTML::Util - (alpha software) powerful utilities for common but difficult to nail HTML munging. VERSION 0.04 SYNOPSIS use strict; use warnings; use XHTML::Util; my $xu = XHTML::Util->new; print $xu->enpara("This is naked\n\ntext for making into paragraphs."); #
This is naked# #
text for making into paragraphs.print $xu->enpara("
Quotes should probably have paras.", "blockquote"); #
#print $xu->strip_tags('Something.','a'); # Something. DESCRIPTION This is a set of itches I'm sick of scratching 5 different ways from the Sabbath. Right now it's in alpha-mode so please sample but don't count on the interface or behavior. Some of the code is fire tested in other places but as this is a new home and API, it's subject to change. Like they say, release early, release often. Like I say: Release whatever you've got so you'll be embarrassed into making it better. You can use CSS expressions to most of the methods. E.g., to only enpara the contents of div tags with a class of "enpara" -- "" -- you could do this- print $xu->enpara($content, "div.enpara"); To do the contents of all blockquotes and divs- print $xu->enpara($content, "div, blockquote"); METHODS new Creates a new "XHTML::Util" object. strip_tags Why you might need this- my $post_title = "I <3 kittehs"; my $blog_link = some_link_maker($post_title); print $blog_link; I <3 kittehs That ain't legal so there's no definition for what browsers should do with it. Some sort of tolerate it, some don't. It's never going to be a good user experience. What you can do, and I've done successfully for years, is something like this- my $post_title = "I <3 kittehs"; my $safe_title = $xu->strip_tags($post_title, ["a"]); # Menu link should only go to the single post page. my $menu_view_title = some_link_maker($safe_title); # No need to link back to the page you're viewing already. my $single_view_title = $post_title; remove Takes a content block and a CSS selector string. Completely removes the matched nodes, including their content. This differs from "strip_tags" which retains the child nodes intact and only removes the tag(s) proper. my $cleaned = $xu->remove($html, "center, img[src^='http']"); traverse [Not implemented.] Walks the given nodes and executes the given callback. translate_tags [Not implemented.] Translates one tag to another. remove_style [Not implemented.] Removes styles from matched nodes. To remove all style from a fragment- $xu->remove_style($content, "*"); inline_stylesheets [Not implemented.] Moves all linked stylesheet information into inline style attributes. This is useful, for example, when distributing a document fragment like an RSS/Atom feed and having it match its online appearance. html_to_xhtml [Not implemented.] Upgrades old or broken HTML to valid XHTML. validate [Not implemented.] Validates a given document or fragment against its claimed DTD or one provided by name. enpara To add paragraph markup to naked text. There are many, many implementations of this basic idea out there as well as many like Markdown which do much more. While some are decent, none is really meant to sling arbitrary HTML and get DWIM behavior from places where it's left out; every implementation I've seen either has rigid syntax or has beaucoup failure prone edge cases. Consider these- Is this a paragraph or two?
Quotes should probably have paras.#
I can do HTML when I'm paying attention.
Or I need to for some reason.Oh, I stopped paying attention... What happens here? Or here? I'd like to see this in a paragraph so it's legal markup.
now this should not be touched!I meant to do that. With "XHTML::Util->enpara" you will get-
Is this a paragraph
I can do HTML when I'm paying attention.
Or I need to for some reason.
Oh, I stopped paying attention... What happens here? Or here?
I'd like to see this in a paragraph so it's legal markup.
now this should not be touched!
I meant to do that.xml_parser Don't use unless you read the code and see why/how. selector_to_xpath This wraps "selector_to_xpath" in selector_to_xpath HTML::Selector::Xpath. Not really meant to be used but exposed in case you want it. print $xu->selector_to_xpath("form[name='register'] input[type='password']"); # //form[@name='register']//input[@type='password'] TO DO Finish spec and tests. Get it running solid enough to remove alpha label. Generalize the argument handling. Provide optional setting or methods for returning nodes intead of serialized content. Improve document/head related handling/options. BUGS AND LIMITATIONS All input should be utf8 or at least safe to run Encode::decode_utf8 on. Regular Latin character sets, I suspect, will be fine but extended sets will probably give garbage or unpredictable results; guessing. This module is currently targeted to working with body fragments. You will get fragments back, not documents. I want to expand it to handle both and deal with doc, DTD, head and such but that's not its primary use case so it won't come first. I have used many of these methods and snippets in many projects and I'm tired of recycling them. Some are extremely useful and, at least in the case of "enpara", better than any other implementation I've been able to find in any language. That said, a lot of the code herein is not well tested or at least not well tested in this incarnation. Bug reports and good feedback are adored. SEE ALSO XML::LibXML, HTML::Tagset, HTML::Entities, HTML::Selector::XPath, HTML::TokeParser::Simple, CSS::Tiny. CSS W3Schools,