Methods summary
public static
|
#
muteErrorHandler( )
Error-handler that mutes errors, alternative to shut-up operator.
Error-handler that mutes errors, alternative to shut-up operator.
|
public static
string
|
#
unsafeIconv( string $in, string $out, string $text )
iconv wrapper which mutes errors, but doesn't work around bugs.
iconv wrapper which mutes errors, but doesn't work around bugs.
Parameters
- $in
string $in Input encoding
- $out
string $out Output encoding
- $text
string $text The text to convert
Returns
string
|
public static
string
|
#
iconv( string $in, string $out, string $text, integer $max_chunk_size = 8000 )
iconv wrapper which mutes errors and works around bugs.
iconv wrapper which mutes errors and works around bugs.
Parameters
- $in
string $in Input encoding
- $out
string $out Output encoding
- $text
string $text The text to convert
- $max_chunk_size
integer $max_chunk_size
Returns
string
|
public static
string
|
#
cleanUTF8( string $str, boolean $force_php = false )
Cleans a UTF-8 string for well-formedness and SGML validity
Cleans a UTF-8 string for well-formedness and SGML validity
It will parse according to UTF-8 and return a valid UTF8 string, with
non-SGML codepoints excluded.
Parameters
- $str
string $str The string to clean
- $force_php
boolean $force_php
Returns
string
Note
Just for reference, the non-SGML code points are 0 to 31 and 127 to 159,
inclusive. However, we allow code points 9, 10 and 13, which are the tab, line
feed and carriage return respectively. 128 and above the code points map to
multibyte UTF-8 representations.
Fallback code adapted from utf8ToUnicode by Henri Sivonen and hsivonen@iki.fi at < http://iki.fi/hsivonen/php-utf8/>
under the LGPL license. Notes on what changed are inside, but in general, the
original code transformed UTF-8 text into an array of integer Unicode
codepoints. Understandably, transforming that back to a string would be somewhat
expensive, so the function was modded to directly operate on the string.
However, this discourages code reuse, and the logic enumerated here would be
useful for any function that needs to be able to understand UTF-8 characters. As
of right now, only smart lossless character encoding converters would need that,
and I'm probably not going to implement them. Once again, PHP 6 should solve all
our problems.
|
public static
|
|
public static
boolean
|
|
public static
string
|
|
public static
string
|
#
convertFromUTF8( string $str, HTMLPurifier_Config $config, HTMLPurifier_Context $context )
Converts a string from UTF-8 based on configuration.
Converts a string from UTF-8 based on configuration.
Parameters
Returns
string
Note
Currently, this is a lossy conversion, with unexpressable characters being
omitted.
|
public static
string
|
#
convertToASCIIDumbLossless( string $str )
Lossless (character-wise) conversion of HTML to ASCII
Lossless (character-wise) conversion of HTML to ASCII
Parameters
- $str
string $str UTF-8 string to be converted to ASCII
Returns
string ASCII encoded string with non-ASCII character entity-ized
Note
Uses decimal numeric entities since they are best supported.
This is a DUMB function: it has no concept of keeping character entities that
the projected character encoding can allow. We could possibly implement a smart
version but that would require it to also know which Unicode codepoints the
charset supported (not an easy task).
Sort of with cleanUTF8() but it assumes that $str is well-formed UTF-8
Warning
Adapted from MediaWiki, claiming fair use: this is a common algorithm. If you
disagree with this license fudgery, implement it yourself.
|
public static
integer
|
#
testIconvTruncateBug( )
glibc iconv has a known bug where it doesn't handle the magic //IGNORE stanza
correctly. In particular, rather than ignore characters, it will return an
EILSEQ after consuming some number of characters, and expect you to restart
iconv as if it were an E2BIG. Old versions of PHP did not respect the errno, and
returned the fragment, so as a result you would see iconv mysteriously
truncating output. We can work around this by manually chopping our input into
segments of about 8000 characters, as long as PHP ignores the error code. If PHP
starts paying attention to the error code, iconv becomes unusable.
glibc iconv has a known bug where it doesn't handle the magic //IGNORE stanza
correctly. In particular, rather than ignore characters, it will return an
EILSEQ after consuming some number of characters, and expect you to restart
iconv as if it were an E2BIG. Old versions of PHP did not respect the errno, and
returned the fragment, so as a result you would see iconv mysteriously
truncating output. We can work around this by manually chopping our input into
segments of about 8000 characters, as long as PHP ignores the error code. If PHP
starts paying attention to the error code, iconv becomes unusable.
Returns
integer Error code indicating severity of bug.
|
public static
Array
|
#
testEncodingSupportsASCII( string $encoding, boolean $bypass = false )
This expensive function tests whether or not a given character encoding
supports ASCII. 7/8-bit encodings like Shift_JIS will fail this test, and
require special processing. Variable width encodings shouldn't ever fail.
This expensive function tests whether or not a given character encoding
supports ASCII. 7/8-bit encodings like Shift_JIS will fail this test, and
require special processing. Variable width encodings shouldn't ever fail.
Parameters
- $encoding
string $encoding Encoding name to test, as per iconv format
- $bypass
boolean $bypass Whether or not to bypass the precompiled arrays.
Returns
Array of UTF-8 characters to their corresponding ASCII, which can be used to "undo"
any overzealous iconv action.
|