, simple quotes can be used instead of double quotes, at
the moment you cannot escape the quotes (this will be added as soon as I
dig out my copy of Mastering Regular Expressions from its storage box).
The text returned is, as per what I (and Matt Sergeant!) understood from
the XPATH spec the concatenation of all the text in the element, excluding
all markup. Thus to call a handler on the elementC<< text bold
>>
the appropriate condition is C. Note that this is not
exactly conformant to the XPATH spec, it just tries to mimic it while being
still quite concise.
A extension of that notation is C)="foo"]> where the
handler will be called if a child of a C element has a text value of
C. At the moment only direct children of the C element are checked.
If you need to test on descendants of the element let me know. The fix is
trivial but would slow down the checks, so I'd like to keep it the way it is.
A B is a condition on the content of an element, in the form
C. This is the same as a string condition except that
the text of the element is matched to the regexp. The C, C, C and C
modifiers can be used on the regexp.
The C<< gi[string(B)=~ /foo/"] >> extension is also supported.
An B is a simple condition of an attribute of the
current element in the form C (simple quotes can be used
instead of double quotes, you can escape quotes either).
If several attribute_condition are true the same element all the handlers
can be called in turn (in the order in which they were first defined).
If the C<="val"> part is ommited ( the condition is then C) then
the handler is triggered if the attribute actually exists for the element,
no matter what it's value is.
A B looks like C<'/doc/section/chapter/title'>, it starts with
a / then gives all the gi's to the element. The handler will be called if
the path to the current element (in the input document) is exactly as
defined by the C.
A B is like a full_path except it does not start with a /:
C<'chapter/title'> for example. The handler will be called if the path to
the element (in the input document) ends as defined in the C.
B: (hopefully temporary) at the moment C,
C and C are only supported on a
simple gi, not on a path.
A B (generic identifier) is just a tag name.
#CDATA can be used to call a handler for a CDATA.
A special gi B<_all_> is used to call a function for each element.
The special gi B<_default_> is used to call a handler for each element
that does NOT have a specific handler.
The order of precedence to trigger a handler is:
I, I, I,
I, I, longer I, shorter
I, I, I<_default_> .
B: once a handler has been triggered if it returns 0 then no other
handler is called, exept a C<_all_> handler which will be called anyway.
If a handler returns a true value and other handlers apply, then the next
applicable handler will be called. Repeat, rince, lather..; The exception
to that rule is when the C>
option is set, in which case only the first handler will be called.
Note that it might be a good idea to explicitely return a short true value
(like 1) from handlers: this ensures that other applicable handlers are
called even if the last statement for the handler happens to evaluate to
false. This might also speedup the code by avoiding the result of the last
statement of the code to be copied and passed to the code managing handlers.
It can really pay to have 1 instead of a long string returned.
When an element is CLOSED the corresponding handler is called, with 2
arguments: the twig and the C >. The twig includes the
document tree that has been built so far, the element is the complete sub-tree
for the element. This means that handlers for inner elements are called before
handlers for outer elements.
C<$_> is also set to the element, so it is easy to write inline handlers like
para => sub { $_->change_gi( 'p'); }
Text is stored in elements where gi is #PCDATA (due to mixed content, text
and sub-element in an element there is no way to store the text as just an
attribute of the enclosing element).
B: if you have used purge or flush on the twig the element might not
be complete, some of its children might have been entirely flushed or purged,
and the start tag might even have been printed (by C) already, so changing
its gi might not give the expected result.
More generally, the I, I and I expressions are
evaluated against the input document. Which means that even if you have changed
the gi of an element (changing the gi of a parent element from a handler for
example) the change will not impact the expression evaluation. Attributes in
I are different though. As the initial value of attribute
is not stored the handler will be triggered if the B attribute/value
pair is found when the element end tag is found. Although this can be quite
confusing it should not impact most of users, and allow others to play clever
tricks with temporary attributes. Let me know if this is a problem for you.
=item twig_roots
This argument let's you build the tree only for those elements you are
interested in.
Example: my $t= XML::Twig->new( twig_roots => { title => 1, subtitle => 1});
$t->parsefile( file);
my $t= XML::Twig->new( twig_roots => { 'section/title' => 1});
$t->parsefile( file);
return a twig containing a document including only C and C
elements, as children of the root element.
You can use I, I,
I, I, I, I<_default_> and I<_all_> to
trigger the building of the twig.
I and I cannot be used as the content
of the element, and the string, have not yet been parsed when the condition
is checked.
B: path are checked for the document. Even if the C option
is used they will be checked against the full document tree, not the virtual
tree created by XML::Twig
B: twig_roots elements should NOT be nested, that would hopelessly
confuse XML::Twig ;--(
Note: you can set handlers (twig_handlers) using twig_roots
Example: my $t= XML::Twig->new( twig_roots =>
{ title => sub { $_{1]->print;},
subtitle => \&process_subtitle
}
);
$t->parsefile( file);
=item twig_print_outside_roots
To be used in conjunction with the C argument. When set to a true
value this will print the document outside of the C elements.
Example: my $t= XML::Twig->new( twig_roots => { title => \&number_title },
twig_print_outside_roots => 1,
);
$t->parsefile( file);
{ my $nb;
sub number_title
{ my( $twig, $title);
$nb++;
$title->prefix( "$nb "; }
$title->print;
}
}
This example prints the document outside of the title element, calls
C for each C element, prints it, and then resumes printing
the document. The twig is built only for the C elements.
If the value is a reference to a file handle then the document outside the
C elements will be output to this file handle:
open( OUT, ">out_file") or die "cannot open out file out_file:$!";
my $t= XML::Twig->new( twig_roots => { title => \&number_title },
# default output to OUT
twig_print_outside_roots => \*OUT,
);
{ my $nb;
sub number_title
{ my( $twig, $title);
$nb++;
$title->prefix( "$nb "; }
$title->print( \*OUT); # you have to print to \*OUT here
}
}
=item start_tag_handlers
A hash C<{ expression => \&handler}>. Sets element handlers that are called when
the element is open (at the end of the XML::Parser C handler). The handlers
are called with 2 params: the twig and the element. The element is empty at
that point, its attributes are created though.
You can use I, I,
I, I, I, I<_default_> and I<_all_> to trigger
the handler.
I and I cannot be used as the content of
the element, and the string, have not yet been parsed when the condition is
checked.
The main uses for those handlers are to change the tag name (you might have to
do it as soon as you find the open tag if you plan to C the twig at some
point in the element, and to create temporary attributes that will be used
when processing sub-element with C.
You should also use it to change tags if you use C. If you change the tag
in a regular C then the start tag might already have been flushed.
B: C handlers can be called outside of C if this
argument is used, in this case handlers are called with the following arguments:
C<$t> (the twig), C<$gi> (the gi of the element) and C<%att> (a hash of the
attributes of the element).
If the C argument is also used then the start tag
will be printed if the last handler called returns a C value, if it
does not then the start tag will B be printed (so you can print a
modified string yourself for example);
Note that you can use the L method in C
(and only there).
=item end_tag_handlers
A hash C<{ expression => \&handler}>. Sets element handlers that are called when
the element is closed (at the end of the XML::Parser C handler). The handlers
are called with 2 params: the twig and the gi of the element.
I are called when an element is completely parsed, so why have
this redundant option? There is only one use for C: when using
the C option, to trigger a handler for an element B the roots.
It is for example very useful to number titles in a document using nested
sections:
my @no= (0);
my $no;
my $t= XML::Twig->new(
start_tag_handlers =>
{ section => sub { $no[$#no]++; $no= join '.', @no; push @no, 0; } },
twig_roots =>
{ title => sub { $_[1]->prefix( $no); $_[1]->print; } },
end_tag_handlers => { section => sub { pop @no; } },
twig_print_outside_roots => 1
);
$t->parsefile( $file);
Using the C argument without C will result in an
error.
=item do_not_chain_handlers
If this option is set to a true value, then only one handler will be called for
each element, even if several satisfy the condition
Note that the C<_all_> handler will still be called regardeless
=item ignore_elts
This option lets you ignore elements when building the twig. This is useful
in cases where you cannot use C to ignore elements, for example if
the element to ignore is a sibling of elements you are interested in.
Example:
my $twig= XML::Twig->new( ignore_elts => { elt => 1 });
$twig->parsefile( 'doc.xml');
This will build the complete twig for the document, except that all C
elements (and their children) will be left out.
=item char_handler
A reference to a subroutine that will be called every time C is found.
=item elt_class
The name of a class used to store elements. this class should inherit from
C (and by default it is C). This option is used
to subclass the element class and extend it with new methods.
This option is needed because during the parsing of the XML, elements are created
by C, without any control from the user code.
=item keep_atts_order
Setting this option to a true value causes the attribute hash to be tied to
a Tie::IxHash object.
This means that Tie::IxHash needs to be installe for this option to be
available. It also means that the hash keeps its order, so you will get
the attributes in order. This allows outputing the attributes in the same
order as they were in the original document.
=item keep_encoding
This is a (slightly?) evil option: if the XML document is not UTF-8 encoded and
you want to keep it that way, then setting keep_encoding will use theC
original_string method for character, thus keeping the original encoding, as
well as the original entities in the strings.
See the C test file to see what results you can expect from the
various encoding options.
B: if the original encoding is multi-byte then attribute parsing will
be EXTREMELY unsafe under any Perl before 5.6, as it uses regular expressions
which do not deal properly with multi-byte characters. You can specify an
alternate function to parse the start tags with the C option
(see below)
B: this option is NOT used when parsing with the non-blocking parser
(C, C, parse_done methods) which you probably should
not use with XML::Twig anyway as they are totally untested!
=item output_encoding
This option generates an output_filter using C, C or
C and C, and sets the encoding in the XML
declaration. This is the easiest way to deal with encodings, if you need
more sophisticated features, look at C below
=item output_filter
This option is used to convert the character encoding of the output document.
It is passed either a string corresponding to a predefined filter or
a subroutine reference. The filter will be called every time a document or
element is processed by the "print" functions (C, C, C).
Pre-defined filters:
=over 4
=item latin1
uses either C, C or C and C
or a regexp (which works only with XML::Parser 2.27), in this order, to convert
all characters to ISO-8859-1 (aka latin1)
=item html
does the same conversion as C, plus encodes entities using
C (oddly enough you will need to have HTML::Entities intalled
for it to be available). This should only be used if the tags and attribute
names themselves are in US-ASCII, or they will be converted and the output will
not be valid XML any more
=item safe
converts the output to ASCII (US) only plus I (C<nnn;>)
this should be used only if the tags and attribute names themselves are in
US-ASCII, or they will be converted and the output will not be valid XML any
more
=item safe_hex
same as C except that the character entities are in hexa (C<nnn;>)
=item encode_convert ($encoding)
Return a subref that can be used to convert utf8 strings to C<$encoding>).
Uses C.
my $conv = XML::Twig::encode_convert( 'latin1');
my $t = XML::Twig->new(output_filter => $conv);
=item iconv_convert ($encoding)
this function is used to create a filter subroutine that will be used to
convert the characters to the target encoding using C (which needs
to be installed, look at the documentation for the module and for the
C library to find out which encodings are available on your system)
my $conv = XML::Twig::iconv_convert( 'latin1');
my $t = XML::Twig->new(output_filter => $conv);
=item unicode_convert ($encoding)
this function is used to create a filter subroutine that will be used to
convert the characters to the target encoding using C
and C (which need to be installed, look at the documentation
for the modules to find out which encodings are available on your system)
my $conv = XML::Twig::unicode_convert( 'latin1');
my $t = XML::Twig->new(output_filter => $conv);
=back
The C and C methods do not use the filter, so their
result are always in unicode.
Those predeclared filters are based on subroutines that can be used
by themselves (as C).
=over 4
=item html_encode ($string)
Use C to encode a utf8 string
=item safe_encode ($string)
Use either a regexp (perl < 5.8) or C to encode non-ascii characters
in the string in C<< ; >> format
=item safe_encode_hex ($string)
Use either a regexp (perl < 5.8) or C to encode non-ascii characters
in the string in C<< ; >> format
=item regexp2latin1 ($string)
Use a regexp to encode a utf8 string into latin 1 (ISO-8859-1). Does not
work with Perl 5.8.0!
=back
=item output_text_filter
same as output_filter, except it doesn't apply to the brackets and quotes
around attribute values. This is useful for all filters that could change
the tagging, basically anything that does not just change the encoding of
the output. C, C and C are better used with this option.
=item input_filter
This option is similar to C except the filter is applied to
the characters before they are stored in the twig, at parsing time.
=item parse_start_tag
If you use the C option then this option can be used to replace
the default parsing function. You should provide a coderef (a reference to a
subroutine) as the argument, this subroutine takes the original tag (given
by XML::Parser::Expat C method) and returns a gi and the
attributes in a hash (or in a list attribute_name/attribute value).
=item expand_external_ents
When this option is used external entities (that are defined) are expanded
when the document is output using "print" functions such as C >,
C >, C > and C >.
Note that in the twig the entity will be stored as an element whith a
gi 'C<#ENT>', the entity will not be expanded there, so you might want to
process the entities before outputting it.
=item load_DTD
If this argument is set to a true value, C or C on the twig
will load the DTD information. This information can then be accessed through
the twig, in a C for example. This will load even an external DTD.
Note that to do this the module will generate a temporary file in the current
directory. If this is a problem let me know and I will add an option to
specify an alternate directory.
See L for more information
=item DTD_handler
Set a handler that will be called once the doctype (and the DTD) have been
loaded, with 2 arguments, the twig and the DTD.
=item no_prolog
Does not output a prolog (XML declaration and DTD)
=item id
This optional argument gives the name of an attribute that can be used as
an ID in the document. Elements whose ID is known can be accessed through
the elt_id method. id defaults to 'id'.
See C >
=item discard_spaces
If this optional argument is set to a true value then spaces are discarded
when they look non-significant: strings containing only spaces are discarded.
This argument is set to true by default.
=item keep_spaces
If this optional argument is set to a true value then all spaces in the
document are kept, and stored as C.
C and C cannot be both set.
=item discard_spaces_in
This argument sets C to true but will cause the twig builder to
discard spaces in the elements listed.
The syntax for using this argument is:
XML::Twig->new( discard_spaces_in => [ 'elt1', 'elt2']);
=item keep_spaces_in
This argument sets C to true but will cause the twig builder to
keep spaces in the elements listed.
The syntax for using this argument is:
XML::Twig->new( keep_spaces_in => [ 'elt1', 'elt2']);
=item pretty_print
Set the pretty print method, amongst 'C' (default), 'C',
'C', 'C', 'C', 'C' and 'C'
pretty_print formats:
=over 4
=item none
The document is output as one ling string, with no line breaks except those
found within text elements
=item nsgmls
Line breaks are inserted in safe places: that is within tags, between a tag
and an attribute, between attributes and before the > at the end of a tag.
This is quite ugly but better than C, and it is very safe, the document
will still be valid (conforming to its DTD).
This is how the SGML parser C splits documents, hence the name.
=item nice
This option inserts line breaks before any tag that does not contain text (so
element with textual content are not broken as the \n is the significant).
B: this option leaves the document well-formed but might make it
invalid (not conformant to its DTD). If you have elements declared as
then a C element including a C one will be printed as
bar is just pcdata
This is invalid, as the parser will take the line break after the C tag
as a sign that the element contains PCDATA, it will then die when it finds the
C tag. This may or may not be important for you, but be aware of it!
=item indented
Same as C (and with the same warning) but indents elements according to
their level
=item indented_c
Same as C but a little more compact: the closing tags are on the
same line as the preceeding text
=item record
This is a record-oriented pretty print, that display data in records, one field
per line (which looks a LOT like C)
=item record_c
Stands for record compact, one record per line
=back
=item empty_tags
Set the empty tag display style ('C', 'C' or 'C').
=item comments
Set the way comments are processed: 'C' (default), 'C' or
'C'
Comments processing options:
=over 4
=item drop
drops the comments, they are not read, nor printed to the output
=item keep
comments are loaded and will appear on the output, they are not
accessible within the twig and will not interfere with processing
though
B: comments in the middle of a text element such as
text more text -->
are output at the end of the text:
text more text
=item process
comments are loaded in the twig and will be treated as regular elements
(their C is C<#COMMENT>) this can interfere with processing if you
expect C<< $elt->{first_child} >> to be an element but find a comment there.
Validation will not protect you from this as comments can happen anywhere.
You can use C<< $elt->first_child( 'gi') >> (which is a good habit anyway)
to get where you want.
Consider using C if you are outputing SAX events from XML::Twig.
=back
=item pi
Set the way processing instructions are processed: 'C', 'C'
(default) or 'C'
Note that you can also set PI handlers in the C option:
'?' => \&handler
'?target' => \&handler 2
The handlers will be called with 2 parameters, the twig and the PI element if
C is set to C, and with 3, the twig, the target and the data if
C is set to C. Of course they will not be called if C is set to
C.
If C is set to C the handler should return a string that will be used
as-is as the PI text (it should look like "C< >" or '' if you
want to remove the PI),
Only one handler will be called, C or C> if no specific handler for
that target is available.
=item map_xmlns
This option is passed a hashref that maps uri's to prefixes. The prefixes in
the document will be replaced by the ones in the map. The mapped prefixes can
(actually have to) be used to trigger handlers, navigate or query the document.
Here is an example:
my $t= XML::Twig->new( map_xmlns => {'http://www.w3.org/2000/svg' => "svg"},
twig_handlers =>
{ 'svg:circle' => sub { $_->set_att( r => 20) } },
pretty_print => 'indented',
)
->parse( '
'
)
->print;
This will output:
Note that this only works when working on the entire twig. If you use C
then the namespace processing is not done. I have to keep something to do for the
next release ;--)
=item keep_original_prefix
When used with C> this option will make C use the original
namespace prefixes when outputing a document. The mapped prefix will still be used
for triggering handlers and in navigation and query methods.
my $t= XML::Twig->new( map_xmlns => {'http://www.w3.org/2000/svg' => "svg"},
twig_handlers =>
{ 'svg:circle' => sub { $_->set_att( r => 20) } },
keep_original_prefix => 1,
pretty_print => 'indented',
)
->parse( '