Multibyte String Funktionen

References

Multibyte character encoding schemes and their related issues are fairly complicated, and are beyond the scope of this documentation. Please refer to the following URLs and other resources for further information regarding these topics.

Unicode materials

» http://www.unicode.org/
Japanese/Korean/Chinese character information

» http://examples.oreilly.com/cjkvinfo/doc/cjk.inf

Inhaltsverzeichnis

mb_check_encoding — Check if the string is valid for the specified encoding
mb_convert_case — Perform case folding on a string
mb_convert_encoding — Convert character encoding
mb_convert_kana — Convert "kana" one from another ("zen-kaku", "han-kaku" and more)
mb_convert_variables — Convert character code in variable(s)
mb_decode_mimeheader — Decode string in MIME header field
mb_decode_numericentity — Decode HTML numeric string reference to character
mb_detect_encoding — Detect character encoding
mb_detect_order — Set/Get character encoding detection order
mb_encode_mimeheader — Encode string for MIME header
mb_encode_numericentity — Encode character to HTML numeric string reference
mb_encoding_aliases — Get aliases of a known encoding type
mb_ereg_match — Regular expression match for multibyte string
mb_ereg_replace — Replace regular expression with multibyte support
mb_ereg_search_getpos — Returns start point for next regular expression match
mb_ereg_search_getregs — Retrieve the result from the last multibyte regular expression match
mb_ereg_search_init — Setup string and regular expression for a multibyte regular expression match
mb_ereg_search_pos — Returns position and length of a matched part of the multibyte regular expression for a predefined multibyte string
mb_ereg_search_regs — Returns the matched part of a multibyte regular expression
mb_ereg_search_setpos — Set start point of next regular expression match
mb_ereg_search — Multibyte regular expression match for predefined multibyte string
mb_ereg — Regular expression match with multibyte support
mb_eregi_replace — Replace regular expression with multibyte support ignoring case
mb_eregi — Regular expression match ignoring case with multibyte support
mb_get_info — Get internal settings of mbstring
mb_http_input — Detect HTTP input character encoding
mb_http_output — Set/Get HTTP output character encoding
mb_internal_encoding — Set/Get internal character encoding
mb_language — Set/Get current language
mb_list_encodings — Returns an array of all supported encodings
mb_output_handler — Callback function converts character encoding in output buffer
mb_parse_str — Parse GET/POST/COOKIE data and set global variable
mb_preferred_mime_name — Get MIME charset string
mb_regex_encoding — Returns current encoding for multibyte regex as string
mb_regex_set_options — Set/Get the default options for mbregex functions
mb_send_mail — Send encoded mail
mb_split — Split multibyte string using regular expression
mb_strcut — Get part of string
mb_strimwidth — Get truncated string with specified width
mb_stripos — Finds position of first occurrence of a string within another, case insensitive
mb_stristr — Finds first occurrence of a string within another, case insensitive
mb_strlen — Get string length
mb_strpos — Find position of first occurrence of string in a string
mb_strrchr — Finds the last occurrence of a character in a string within another
mb_strrichr — Finds the last occurrence of a character in a string within another, case insensitive
mb_strripos — Finds position of last occurrence of a string within another, case insensitive
mb_strrpos — Find position of last occurrence of a string in a string
mb_strstr — Finds first occurrence of a string within another
mb_strtolower — Make a string lowercase
mb_strtoupper — Make a string uppercase
mb_strwidth — Return width of string
mb_substitute_character — Set/Get substitution character
mb_substr_count — Count the number of substring occurrences
mb_substr — Get part of string

25 BenutzerBeiträge:
- Beiträge aktualisieren...

johannesponader at dontspamme dot googlemail dot co
17.10.2010 18:46


Please note that when migrating code to handle UTF-8 encoding, not only the functions mentioned here are useful, but also the function htmlentities() has to be changed to htmlentities($var, ENT_COMPAT, "UTF-8") or similar. I didn't scan the manual for it, but there could be some more functions that need adjustments like this.

phpnet at rcpt dot at
19.08.2010 16:46


<?php

/**

* Multibyte safe version of trim()

* Always strips whitespace characters (those equal to \s)

*

* @author Peter Johnson

* @email phpnet@rcpt.at

* @param $string The string to trim

* @param $chars Optional list of chars to remove from the string ( as per trim() )

* @param $chars_array Optional array of preg_quote'd chars to be removed

* @return string

*/

public static function mb_trim( $string, $chars = "", $chars_array = array() )

{

    for( $x=0; $x<iconv_strlen( $chars ); $x++ ) $chars_array[] = preg_quote( iconv_substr( $chars, $x, 1 ) );

    $encoded_char_list = implode( "|", array_merge( array( "\s","\t","\n","\r", "\0", "\x0B" ), $chars_array ) );



    $string = mb_ereg_replace( "^($encoded_char_list)*", "", $string );

    $string = mb_ereg_replace( "($encoded_char_list)*$", "", $string );

    return $string;

}

?>

mt at mediamedics dot nl
17.12.2009 14:52


A multibyte one-to-one alternative for the str_split function (http://php.net/manual/en/function.str-split.php):



<?php

    function mb_str_split($string, $split_length = 1){

            

        mb_internal_encoding('UTF-8'); 

        mb_regex_encoding('UTF-8');  

        

        $split_length = ($split_length <= 0) ? 1 : $split_length;

        

        $mb_strlen = mb_strlen($string, 'utf-8');

        

        $array = array();

                

        for($i = 0; $i < $mb_strlen; $i + $split_length){

        

            $array[] = mb_substr($string, $i, $split_length); 

        }



        return $array;

    

    }

?>

peter AT(no spam) dezzignz dot com
30.10.2009 1:26


The function trim() has not failed me so far in my multibyte applications, but in case one needs a truly multibyte function, here it is. The nice thing is that the character to remove can be whitespace or any other specified character, even a multibyte character.



<?php



// multibyte string split



function mbStringToArray ($str) {

    if (empty($str)) return false;

    $len = mb_strlen($str);

    $array = array();

    for ($i = 0; $i < $len; $i++) {

        $array[] = mb_substr($str, $i, 1);

        }

    return $array;

    }



// removes $rem at both ends



function mb_trim ($str, $rem = ' ') {

    if (empty($str)) return false;

    // convert to array

    $arr = mbStringToArray($str);

    $len = count($arr);

    // left side

    for ($i = 0; $i < $len; $i++) {

        if ($arr[$i] === $rem) $arr[$i] = '';

        else break;

        }

    // right side

    for ($i = $len-1; $i >= 0; $i--) {

        if ($arr[$i] === $rem) $arr[$i] = '';

        else break;

        }

    // convert to string

    return implode ('', $arr);

    }



?>

roydukkey at roydukkey dot com
23.10.2009 6:31


This would be one way to create a multibyte substr_replace function





<?php


function mb_substr_replace($output, $replace, $posOpen, $posClose) {


        return mb_substr($output, 0, $posOpen).$replace.mb_substr($output, $posClose+1);


    }


?>

sakai at d4k dot net
26.06.2009 14:46


I hope this mb_str_replace will work for arrays.  Please use mb_internal_encoding() beforehand, if you need to change the encoding.



Thanks to marc at ermshaus dot org for the original.



<?php



if(!function_exists('mb_str_replace')) {



    function mb_str_replace($search, $replace, $subject) {



        if(is_array($subject)) {

            $ret = array();

            foreach($subject as $key => $val) {

                $ret[$key] = mb_str_replace($search, $replace, $val);

            }

            return $ret;

        }



        foreach((array) $search as $key => $s) {

            if($s == '') {

                continue;

            }

            $r = !is_array($replace) ? $replace : (array_key_exists($key, $replace) ? $replace[$key] : '');

            $pos = mb_strpos($subject, $s);

            while($pos !== false) {

                $subject = mb_substr($subject, 0, $pos) . $r . mb_substr($subject, $pos + mb_strlen($s));

                $pos = mb_strpos($subject, $s, $pos + mb_strlen($r));

            }

        }



        return $subject;



    }



}



?>

mitgath at gmail dot com
30.04.2009 15:26


according to:


http://bugs.php.net/bug.php?id=21317


here's missing function





<?php


function mb_str_pad ($input, $pad_length, $pad_string, $pad_style, $encoding="UTF-8") {


   return str_pad($input,


strlen($input)-mb_strlen($input,$encoding)+$pad_length, $pad_string, $pad_style);


}


?>

Ben XO
17.11.2008 2:14


PHP5 has no mb_trim(), so here's one I made. It work just as trim(), but with the added bonus of PCRE character classes (including, of course, all the useful Unicode ones such as \pZ).





Unlike other approaches that I've seen to this problem, I wanted to emulate the full functionality of trim() - in particular, the ability to customise the character list.





<?php


    /**


     * Trim characters from either (or both) ends of a string in a way that is


     * multibyte-friendly.


     *


     * Mostly, this behaves exactly like trim() would: for example supplying 'abc' as


     * the charlist will trim all 'a', 'b' and 'c' chars from the string, with, of


     * course, the added bonus that you can put unicode characters in the charlist.


     *


     * We are using a PCRE character-class to do the trimming in a unicode-aware


     * way, so we must escape ^, \, - and ] which have special meanings here.


     * As you would expect, a single \ in the charlist is interpretted as


     * "trim backslashes" (and duly escaped into a double-\ ). Under most circumstances


     * you can ignore this detail.


     *


     * As a bonus, however, we also allow PCRE special character-classes (such as '\s')


     * because they can be extremely useful when dealing with UCS. '\pZ', for example,


     * matches every 'separator' character defined in Unicode, including non-breaking


     * and zero-width spaces.


     *


     * It doesn't make sense to have two or more of the same character in a character


     * class, therefore we interpret a double \ in the character list to mean a


     * single \ in the regex, allowing you to safely mix normal characters with PCRE


     * special classes.


     *


     * *Be careful* when using this bonus feature, as PHP also interprets backslashes


     * as escape characters before they are even seen by the regex. Therefore, to


     * specify '\\s' in the regex (which will be converted to the special character


     * class '\s' for trimming), you will usually have to put *4* backslashes in the


     * PHP code - as you can see from the default value of $charlist.


     *


     * @param string 


     * @param charlist list of characters to remove from the ends of this string.


     * @param boolean trim the left?


     * @param boolean trim the right?


     * @return String


     */


    function mb_trim($string, $charlist='\\\\s', $ltrim=true, $rtrim=true)


    {


        $both_ends = $ltrim && $rtrim;





        $char_class_inner = preg_replace(


            array( '/[\^\-\]\\\]/S', '/\\\{4}/S' ),


            array( '\\\\\\0', '\\' ),


            $charlist


        );





        $work_horse = '[' . $char_class_inner . ']+';


        $ltrim && $left_pattern = '^' . $work_horse;


        $rtrim && $right_pattern = $work_horse . '$';





        if($both_ends)


        {


            $pattern_middle = $left_pattern . '|' . $right_pattern;


        }


        elseif($ltrim)


        {


            $pattern_middle = $left_pattern;


        }


        else


        {


            $pattern_middle = $right_pattern;


        }





        return preg_replace("/$pattern_middle/usSD", '', $string) );


    }


?>

marc at ermshaus dot org
4.10.2008 0:05


A small correction to patrick at hexane dot org's mb_str_replace function. The original function does not work as intended in case $replacement contains $needle.



<?php

function mb_str_replace($needle, $replacement, $haystack)

{

    $needle_len = mb_strlen($needle);

    $replacement_len = mb_strlen($replacement);

    $pos = mb_strpos($haystack, $needle);

    while ($pos !== false)

    {

        $haystack = mb_substr($haystack, 0, $pos) . $replacement

                . mb_substr($haystack, $pos + $needle_len);

        $pos = mb_strpos($haystack, $needle, $pos + $replacement_len);

    }

    return $haystack;

}

?>

patrick at hexane dot org
27.06.2008 17:18


I wonder why there isn't a mb_str_replace().  Here's one for now:



function mb_str_replace( $needle, $replacement, $haystack ) {

  $needle_len = mb_strlen($needle);

  $pos = mb_strpos( $haystack, $needle);

  while (!($pos ===false)) {

    $front = mb_substr( $haystack, 0, $pos );

    $back  = mb_substr( $haystack, $pos + $needle_len);

    $haystack = $front.$replacement.$back;

    $pos = mb_strpos( $haystack, $needle);

  }

  return $haystack;

}

Smelly
26.04.2007 7:09


Below is some code to output a UTF-8 encoded CSV in a way understandable by Excel. It requires iconv instead of mbstring.



header("Content-type: application/octet-stream");

header("Content-Transfer-Encoding: binary");

header("Content-Disposition: attachment; filename=report.xls");

    

// assume $tmpString contains UTF-8 encoded CSV:

$tmpString =  iconv ( 'UTF-8', 'UTF-16LE//IGNORE', $tmpString );



print chr(255).chr(254).$tmpString;

chris at maedata dot com
25.04.2007 6:50


The opposite of what Eugene Murai wrote in a previous comment is true when importing/uploading a file. For instance, if you export an Excel spreadsheet using the Save As Unicode Text option, you can use the following to convert it to UTF-8 after uploading:



//Convert file to UTF-8 in case Windows mucked it up

$file = explode( "\n", mb_convert_encoding( trim( file_get_contents( $_FILES['file']['tmp_name'] ) ), 'UTF-8', 'UTF-16' ) );

mdoocy at u dot washington dot edu
14.03.2007 19:30


Note that some of the multi-byte functions run in O(n) time, rather than constant time as is the case for their single-byte equivalents. This includes any functionality requiring access at a specific index, since random access is not possible in a string whose number of bytes will not necessarily match the number of characters. Affected functions include: mb_substr(), mb_strstr(), mb_strcut(), mb_strpos(), etc.

motin at demomusic dot nu
16.02.2007 14:24


Follow up on last note from 2007-jan-20: http://se2.php.net/manual/en/function.mb-strlen.php#72979



There is the correct way of simulating singlebyte strlen as well as some pitfalls to watch out for when developing in a mb-func_overload:ed environment.

motin at demomusic dot nu
20.01.2007 2:12


As peter dot albertsson at spray dot se already pointed out, overloading strlen may break code that handles binary data and relies upon strlen for bytelengths. 



The problem occurs when a file is filled with a string using fwrite in the following manner:



$len = strlen($data);

fwrite($fp, $data, $len);



fwrite takes amount of bytes as the third parameter, but mb_strlen returns the amount of characters in the string. Since multibyte characters are possibly more than one byte in length each - this will result in that the last characters of $data never gets written to the file. 



After hours of investigating why PEAR::Cache_Lite didn't work - the above is what I found. 



I made an attempt at using single byte functions, but it doesn't work. Posting here anyway in case it helps someone else:



/**

* PHP Singe byte functions simulation (non successful)

* 

* Usage: sb_string(functionname, arg1, arg2, etc);

* Example: sb_string("strlen", "tuöéä"); returns 8 (should...)

*/

function sb_string() {



  $arguments = func_get_args(); 



  $func_overloading = ini_get("mbstring.func_overload");



  ini_set("mbstring.func_overload", 0);



  $ret = call_user_func_array(array_shift($arguments), $arguments);



  ini_set("mbstring.func_overload", $func_overloading);



  return $ret;



}

pdezwart .at. snocap
10.10.2006 20:28


If you are trying to emulate the UnicodeEncoding.Unicode.GetBytes() function in .NET, the encoding you want to use is: UCS-2LE

hayk at mail dot ru
17.08.2006 21:36


Since PHP 5.1.0 and PHP 4.4.2 there is an Armenian ArmSCII-8 (ArmSCII-8, ArmSCII8, ARMSCII-8, ARMSCII8) encoding avaliable.

daniel at softel dot jp
24.07.2006 13:41


Note that although "multi-byte" hints at total internationalization, the mb_ API was designed by a Japanese person to support the Japanese language.



Some of the functions, for example mb_convert_kana(), make absolutely no sense outside of a Japanese language environment.



It should perhaps be considered "lucky" if the functions work with non-Japanese multi-byte languages.



I don't mean any disrespect to the mb_ API because I'm using it everyday and I appreciate its usefulness, but maybe a better name would be the jp_ API.

Aardvark
13.03.2006 20:37


Since not all hosted servces currently support the multi-byte function set, it may still be necessary to process Unicode strings using standard single byte functions.  The function at the following link - http://www.kanolife.com/escape/2006/03/php-unicode-processing.html - shows by example how to do this.  While this only covers UTF-8, the standard PHP function "iconv" allows conversion into and out of UTF-8 if strings need to be input or output in other encodings.

peter kehl
9.03.2006 17:34


UTF-16LE solution for CSV for Excel by Eugene Murai works well:

$unicode_str_for_Excel = chr(255).chr(254).mb_convert_encoding( $utf8_str, 'UTF-16LE', 'UTF-8');



However, then Excel on Mac OS X doesn't identify columns properly and its puts each whole row in its own cell. In order to fix that, use TAB "\\t" character as CSV delimiter rather than comma or colon.



You may also want to use HTTP encoding header, such as

header( "Content-type: application/vnd.ms-excel; charset=UTF-16LE" );

15.08.2005 4:24


get the string octet-size, when mbstring.func_overload is set to 2 :



<?php

function str_sizeof($string) {

    return count(preg_split("`.`", $string)) - 1 ;

}

?>



answering to peter albertsson, once you got your data octet-size, you can access each octet with something

$string[0] ... $string[$size-1], since the [ operator doesn't complies with multibytes strings.

peter dot albertsson at spray dot se
21.05.2005 12:43


Setting mbstring.func_overload = 2 may break your applications that deal with binary data.



After having set mbstring.func_overload = 2 and  mbstring.internal_encoding = UTF-8 I can't even read a binary file and print/echo it to output without corrupting it.

nzkiwi at NOSPAMmte dot biglobe dot ne dot jp
14.04.2005 1:37


A friend has pointed out that the entry 

"mbstring.http_input PHP_INI_ALL" in Table 1 on the mbstring page appears to be wrong: above Example 4 it says that "There is no way to control HTTP input character conversion from PHP script. To disable HTTP input character conversion, it has to be done in php.ini". 

Also the table shows the old-PHP-version defaults: 

;; Disable HTTP Input conversion 

mbstring.http_input = pass  *BUT* (for PHP 4.3.0 or higher) 

;; Disable HTTP Input conversion 

mbstring.encoding_translation = Off

Eugene Murai
24.02.2005 7:20


PHP can input and output Unicode, but a little different from what Microsoft means: when Microsoft says "Unicode", it unexplicitly means little-endian UTF-16 with BOM(FF FE = chr(255).chr(254)), whereas PHP's "UTF-16" means big-endian with BOM. For this reason, PHP does not seem to be able to output Unicode CSV file for Microsoft Excel. Solving this problem is quite simple: just put BOM infront of UTF-16LE string.



Example:



$unicode_str_for_Excel = chr(255).chr(254).mb_convert_encoding( $utf8_str, 'UTF-16LE', 'UTF-8');

Geoffrey
1.02.2005 9:59


For Windows users php_mbstring can be added as follows:-



if you have dowloaded  the "short" version of PHP, 

(php-4.3.10-installer.exe), download the full version . 

(php-4.3.10-Win32.zip)



unzip it, find php_mbstring.dll in

f:\php-4.3.10-Win32\extensions, and copy it across to your

php\extensions directory 



use Notepad to open your PHP.INI 



change the extension_dir line to read 

extension_dir = "e:\php\extensions\"  (or whatever your

directory is called)



remove the semi-colon on line 

 ; extension=php_mbstring.dll



save PHP.INI,  restart PHP

Ein Service von Reinhard Neidl - Webprogrammierung.

Multibyte String Funktionen

References

Inhaltsverzeichnis