PHP Doku:: Führt eine umfassende Suche nach Übereinstimmungen mit regulärem Ausdruck durch - function.preg-match-all.html

Verlauf / Chronik / History: (1) anzeigen

Sie sind hier:
Doku-StartseitePHP-HandbuchFunktionsreferenzTextverarbeitungReguläre Ausdrücke (Perl-kompatibel)PCRE-Funktionenpreg_match_all

Ein Service von Reinhard Neidl - Webprogrammierung.

PCRE-Funktionen

<<preg_last_error

preg_match>>

preg_match_all

(PHP 4, PHP 5)

preg_match_allFührt eine umfassende Suche nach Übereinstimmungen mit regulärem Ausdruck durch

Beschreibung

int preg_match_all ( string $pattern , string $subject , array &$matches [, int $flags = PREG_PATTERN_ORDER [, int $offset = 0 ]] )

Durchsucht subject nach allen Übereinstimmungen mit dem in pattern angegebenen regulären Ausdruck und legt sie in der durch flags festgelegten Reihenfolge in matches ab.

Nachdem die erste Übereinstimmung gefunden wurde, wird die nachfolgende Suche jeweils am Ende der letzten Übereinstimmung fortgesetzt.

Parameter-Liste

pattern

Der Ausdruck, nach dem gesucht werden soll, als Zeichenkette.

subject

Die zu durchsuchende Zeichenkette.

matches

Ein mehrdimensionales Array mit allen gefundenen Übereinstimmungen, das den flags entsprechend sortiert ist.

flags

Kann eine Kombination folgender Flags sein (beachten Sie, dass es keinen Sinn hat, PREG_PATTERN_ORDER zusammen mit PREG_SET_ORDER zu verwenden):

PREG_PATTERN_ORDER

Ordnet die Ergebnisse so an, dass $matches[0] ein Array von Übereinstimmungen mit dem kompletten Suchmuster ist, $matches[1] ein Array von Zeichenketten, die auf das erste eingeklammerte Teilsuchmuster passen und so weiter.

<?php
preg_match_all
("|<[^>]+>(.*)</[^>]+>|U",
    
"<b>Beispiel: </b><div align=left>das ist ein Test</div>",
    
$ausgabePREG_PATTERN_ORDER);
echo 
$ausgabe[0][0] . ", " $ausgabe[0][1] . "\n";
echo 
$ausgabe[1][0] . ", " $ausgabe[1][1] . "\n";
?>

Das oben gezeigte Beispiel erzeugt folgende Ausgabe:

<b>Beispiel: </b>, <div align=left>das ist ein Test</div>
Beispiel: , das ist ein Test

Also enthält $ausgabe[0] ein Array von Zeichenketten, die auf das komplette Suchmuster passen und $ausgabe[1] ein Array von Zeichenketten, die sich zwischen Tags befinden.

PREG_SET_ORDER

Ordnet die Ergebnisse so an, dass $matches[0] ein Array aus dem ersten Satz von Übereinstimmungen ist, $matches[1] ein Array aus dem zweiten Satz von Übereinstimmungen und so weiter.

<?php
preg_match_all
("|<[^>]+>(.*)</[^>]+>|U",
    
"<b>Beispiel: </b><div align=\"left\">das ist ein Test</div>",
    
$ausgabePREG_SET_ORDER);
echo 
$ausgabe[0][0] . ", " $ausgabe[0][1] . "\n";
echo 
$ausgabe[1][0] . ", " $ausgabe[1][1] . "\n";
?>

Das oben gezeigte Beispiel erzeugt folgende Ausgabe:

<b>Beispiel: </b>, Beispiel:
<div align="left">das ist ein Test</div>, das ist ein Test

PREG_OFFSET_CAPTURE

Wenn dieses Flag gesetzt ist, wird mit jeder gefundenen Übereinstimmung der dazugehörige Versatz in der Zeichenkette zurückgegeben. Beachten Sie, dass dies den Wert von matches in ein Array ändert, in dem jedes Element ein Array ist, das aus der übereinstimmenden Zeichenkette als Element 0 und deren Stelle in subject als Element 1 besteht.

Falls kein Flag für die Anordnung angegeben wurde, wird PREG_PATTERN_ORDER angenommen.

offset

Normalerweise beginnt die Suche am Anfang der Zeichenkette. Der optionale Parameter offset kann verwendet werden, um eine andere Stelle in Bytes anzugeben, ab der gesucht werden soll.

Hinweis:

Die Verwendung von offset entspricht nicht der Übergabe von substr($zeichenkette, $versatz) an Stelle der Zeichenkette an preg_match_all(), weil pattern Angaben wie zum Beispiel ^, $ oder (?<=x) enthalten kann. Für Beispiele siehe preg_match().

Rückgabewerte

Gibt die Anzahl der Übereinstimmungen mit dem kompletten Suchmuster zurück (die auch Null sein kann) oder FALSE, falls ein Fehler auftrat.

Changelog

Version Beschreibung
5.2.2 Benannte Teilsuchmuster (named subpatterns) akzeptieren nun die Syntaxen (?<name>) und (?'name') sowie (?P<name>). Vorherige Versionen akzeptierten nur (?P<name>).
4.3.3 Den Parameter offset hinzugefügt
4.3.0 Das Flag PREG_OFFSET_CAPTURE hinzugefügt

Beispiele

Beispiel #1 Alle Telefonnummern aus einem Text holen.

<?php
preg_match_all
("/\(?  (\d{3})?  \)?  (?(1)  [\-\s] ) \d{3}-\d{4}/x",
                
"Wählen Sie 555-1212 oder 1-800-555-1212"$telefon);
?>

Beispiel #2 Zusammengehörende HTML-Tags finden (gierig)

<?php
// Das \\2 ist ein Beispiel für Rückreferenzierung. Es teilt pcre mit, dass
// der reguläre Ausdruck auf den für das zweite Klammerpaar gefundenen
// Ausdruck selbst, also in diesem Fall auf den für ([\w]+) gefundenen
// Ausdruck passen muss.
// Der zusätzliche Backslash wird wegen der doppelten Anführungsstriche
// benötigt.
$html "<b>fett gedruckter Text</b><a href=howdy.html>klick mich an</a>";

preg_match_all("/(<([\w]+)[^>]*>)(.*?)(<\/\\2>)/"$html$trefferPREG_SET_ORDER);

foreach (
$treffer as $wert) {
  echo 
"gefunden: " $wert[0] . "\n";
  echo 
"Teil 1: " $wert[1] . "\n";
  echo 
"Teil 2: " $wert[2] . "\n";
  echo 
"Teil 3: " $wert[3] . "\n";
  echo 
"Teil 4: " $wert[4] . "\n\n";
}
?>

Das oben gezeigte Beispiel erzeugt folgende Ausgabe:

gefunden: <b>fett gedruckter Text</b>
Teil 1: <b>
Teil 2: b
Teil 3: fett gedruckter Text
Teil 4: </b>

gefunden: <a href=howdy.html>klick mich an</a>
Teil 1: <a href=howdy.html>
Teil 2: a
Teil 3: klick mich an
Teil 4: </a>

Beispiel #3 Benannte Teilsuchmuster (named subpatterns)

<?php

$str 
= <<<FOO
a: 1
b: 2
c: 3
FOO;

preg_match_all('/(?P<name>\w+): (?P<zahl>\d+)/'$str$treffer);

/* Folgendes funktioniert ab PHP 5.2.2 (PCRE 7.0) ebenfalls, für die
 * Rückwärtskompatibilität wird aber die vorherige Form empfohlen. */
// preg_match_all('/(?<name>\w+): (?<zahl>\d+)/', $str, $treffer);

print_r($treffer);

?>

Das oben gezeigte Beispiel erzeugt folgende Ausgabe:

Array
(
    [0] => Array
        (
            [0] => a: 1
            [1] => b: 2
            [2] => c: 3
        )

    [name] => Array
        (
            [0] => a
            [1] => b
            [2] => c
        )

    [1] => Array
        (
            [0] => a
            [1] => b
            [2] => c
        )

    [zahl] => Array
        (
            [0] => 1
            [1] => 2
            [2] => 3
        )

    [2] => Array
        (
            [0] => 1
            [1] => 2
            [2] => 3
        )

)

Siehe auch

  • preg_match() - Führt eine Suche mit einem regulären Ausdruck durch
  • preg_replace() - Sucht und ersetzt mit regulären Ausdrücken
  • preg_split() - Zerlegt eine Zeichenkette anhand eines regulären Ausdrucks
  • preg_last_error() - Liefert den Fehlercode der letzten PCRE RegEx-Auswertung


18 BenutzerBeiträge:
- Beiträge aktualisieren...
buuh
6.12.2010 11:03
if you want to extract all {token}s from a string:

<?php
$pattern
= "/{[^}]*}/";
$subject = "{token1} foo {token2} bar";
preg_match_all($pattern, $subject, $matches);
print_r($matches);
?>

output:

Array
(
    [0] => Array
        (
            [0] => {token1}
            [1] => {token2}
        )

)
no at bo dot dy
8.09.2010 20:23
For parsing queries with entities use:

<?php
preg_match_all
("/(?:^|(?<=\&(?![a-z]+\;)))([^\=]+)=(.*?)(?:$|\&(?![a-z]+\;))/i",
 
$s, $m, PREG_SET_ORDER );
?>
avengis at gmail dot com
23.09.2009 11:25
The next function works with almost any complex xml/xhtml string

<?php
/**
* Find and close unclosed xml tags
**/
function close_tags($text) {
   
$patt_open    = "%((?<!</)(?<=<)[\s]*[^/!>\s]+(?=>|[\s]+[^>]*[^/]>)(?!/>))%";
   
$patt_close    = "%((?<=</)([^>]+)(?=>))%";
    if (
preg_match_all($patt_open,$text,$matches))
    {
       
$m_open = $matches[1];
        if(!empty(
$m_open))
        {
           
preg_match_all($patt_close,$text,$matches2);
           
$m_close = $matches2[1];
            if (
count($m_open) > count($m_close))
            {
               
$m_open = array_reverse($m_open);
                foreach (
$m_close as $tag) $c_tags[$tag]++;
                foreach (
$m_open as $k => $tag)    if ($c_tags[$tag]--<=0) $text.='</'.$tag.'>';
            }
        }
    }
    return
$text;
}
?>
royaltm75 at gmail dot com
13.09.2009 23:44
I have received complains, that my html2a() code (see below) doesn't work in some cases.
It is however not the problem with algorithm or procedure, but with PCRE recursive stack limits.

If you use recursive PCRE (?R) you should remember to increase those two ini settings:

ini_set('pcre.backtrack_limit', 10000000);
ini_set('pcre.recursion_limit', 10000000);

But be warned: (from php.ini)

;Please note that if you set this value to a high number you may consume all
;the available process stack and eventually crash PHP (due to reaching the
;stack size limit imposed by the Operating System).

I have written this example mainly to demonstrate the power of PCRE LANGUAGE, not the power of it's implementation  :)

But if you like it, use it, of course on your own risk.
elyknosrac at gmail dot com
19.07.2009 0:51
Using preg_match_all I made a pretty handy function.

<?php

function reg_smart_replace($pattern, $replacement, $subject, $replacementChar = "$$$", $limit = -1)
{
    if (!
$pattern || ! $subject || ! $replacement ) { return false; }
   
   
$replacementChar = preg_quote($replacementChar);
   
   
preg_match_all ( $pattern, $subject, $matches);
   
    if (
$limit > -1) {
        foreach (
$matches as $count => $value )
        {
            if (
$count + 1 > $limit ) { unset($matches[$count]); }
        }
    }
    foreach (
$matches[0] as $match) {
       
$rep = ereg_replace($replacementChar, $match, $replacement);
       
$subject = ereg_replace($match, $rep, $subject);
    }
   
    return
$subject;
}
?>

This function can turn blocks of text into clickable links or whatever.  Example:

<?php
reg_smart_replace
(EMAIL_REGEX, '<a href="mailto:$$$">$$$</a>', $description)
?>
will turn all email addresses into actual links.

Just substitute $$$ with the text that will be found by the regex.  If you can't use $$$ then use the 4th parameter $replacementChar
ad
1.04.2009 15:18
i have made up a simple function to extract a number from a string..

I am not sure how good it is, but it works.

It gets only the numbers 0-9, the "-", " ", "(", ")", "."

characters.. This is as far as I know the most widely used characters for a Phone number.

<?php
function clean_phone_number($phone) {
       if (!empty(
$phone)) {
              
//var_dump($phone);
              
preg_match_all('/[0-9\(\)+.\- ]/s', $phone, $cleaned);
               foreach(
$cleaned[0] as $k=>$v) {
                      
$ready .= $v;
               }
              
var_dump($ready);
               die;
               if (
mb_strlen($cleaned) > 4 && mb_strlen($cleaned) <=25) {
                       return
$cleaned;
               }
               else {
                       return
false;
               }
       }
       return
false;
}
?>
royaltm75 at NOSPAM dot gmail dot com
21.02.2009 11:55
The power of pregs is limited only by your *imagination* :)
I wrote this html2a() function using preg recursive match (?R) which provides quite safe and bulletproof html/xml extraction:
<?php
function html2a ( $html ) {
  if ( !
preg_match_all( '
@
\<\s*?(\w+)((?:\b(?:\'[^\']*\'|"[^"]*"|[^\>])*)?)\>
((?:(?>[^\<]*)|(?R))*)
\<\/\s*?\\1(?:\b[^\>]*)?\>
|\<\s*(\w+)(\b(?:\'[^\']*\'|"[^"]*"|[^\>])*)?\/?\>
@uxis'
, $html = trim($html), $m, PREG_OFFSET_CAPTURE | PREG_SET_ORDER) )
    return
$html;
 
$i = 0;
 
$ret = array();
  foreach (
$m as $set) {
    if (
strlen( $val = trim( substr($html, $i, $set[0][1] - $i) ) ) )
     
$ret[] = $val;
   
$val = $set[1][1] < 0
     
? array( 'tag' => strtolower($set[4][0]) )
      : array(
'tag' => strtolower($set[1][0]), 'val' => html2a($set[3][0]) );
    if (
preg_match_all( '
/(\w+)\s*(?:=\s*(?:"([^"]*)"|\'([^\']*)\'|(\w+)))?/usix
'
, isset($set[5]) && $set[2][1] < 0
 
? $set[5][0]
  :
$set[2][0]
  ,
$attrs, PREG_SET_ORDER ) ) {
      foreach (
$attrs as $a) {
       
$val['attr'][$a[1]]=$a[count($a)-1];
      }
    }
   
$ret[] = $val;
   
$i = $set[0][1]+strlen( $set[0][0] );
  }
 
$l = strlen($html);
  if (
$i < $l )
    if (
strlen( $val = trim( substr( $html, $i, $l - $i ) ) ) )
     
$ret[] = $val;
  return
$ret;
}
?>

Now let's try it with this example: (there are some really nasty xhtml compliant bugs, but ... we shouldn't worry)

<?php
$html
= <<<EOT
some leftover text...
     < DIV class=noCompliant style = "text-align:left;" >
... and some other ...
< dIv > < empty>  </ empty>
  <p> This is yet another text <br  >
     that wasn't <b>compliant</b> too... <br   />
     </p>
 <div class="noClass" > this one is better but we don't care anyway </div ><P>
    <input   type= "text"  name ='my "name' value  = "nothin really." readonly>
end of paragraph </p> </Div>   </div>   some trailing text
EOT;

$a = html2a($html);
//now we will make some neat html out of it
echo a2html($a);

function
a2html ( $a, $in = "" ) {
  if (
is_array($a) ) {
   
$s = "";
    foreach (
$a as $t)
      if (
is_array($t) ) {
       
$attrs="";
        if ( isset(
$t['attr']) )
          foreach(
$t['attr'] as $k => $v )
           
$attrs.=" ${k}=".( strpos( $v, '"' )!==false ? "'$v'" : "\"$v\"" );
       
$s.= $in."<".$t['tag'].$attrs.( isset( $t['val'] ) ? ">\n".a2html( $t['val'], $in."  " ).$in."</".$t['tag'] : "/" ).">\n";
      } else
       
$s.= $in.$t."\n";
  } else {
   
$s = empty($a) ? "" : $in.$a."\n";
  }
  return
$s;
}
?>
This produces:
some leftover text...
<div class="noCompliant" style="text-align:left;">
  ... and some other ...
  <div>
    <empty>
    </empty>
    <p>
      This is yet another text
      <br/>
      that wasn't
      <b>
        compliant
      </b>
      too...
      <br/>
    </p>
    <div class="noClass">
      this one is better but we don't care anyway
    </div>
    <p>
      <input type="text" name='my "name' value="nothin really." readonly="readonly"/>
      end of paragraph
    </p>
  </div>
</div>
some trailing text
meaneye at mail dot com
15.10.2008 11:56
Recently I had to write search engine in hebrew and ran into huge amount of problems. My data was stored in MySQL table with utf8_bin encoding.

So, to be able to write hebrew in utf8 table you need to do
<?php
$prepared_text
= addslashes(urf8_encode($text));
?>

But then I had to find if some word exists in stored text. This is the place I got stuck. Simple preg_match would not find text since hebrew doesnt work that easy. I've tried with /u and who kows what else.

Solution was somewhat logical and simple...
<?php
$db_text
= bin2hex(stripslashes(utf8_decode($db_text)));
$word = bin2hex($word);

$found = preg_match_all("/($word)+/i", $db_text, $matches);
?>

I've used preg_match_all since it returns number of occurences. So I could sort search results acording to that.

Hope someone finds this useful!
MonkeyMan
7.10.2008 10:25
Here is a way to match everything on the page, performing an action for each match as you go. I had used this idiom in other languages, where its use is customary, but in PHP it seems to be not quite as common.

<?php
function custom_preg_match_all($pattern, $subject)
{
   
$offset = 0;
   
$match_count = 0;
    while(
preg_match($pattern, $subject, $matches, PREG_OFFSET_CAPTURE, $offset))
    {
       
// Increment counter
       
$match_count++;
   
       
// Get byte offset and byte length (assuming single byte encoded)
       
$match_start = $matches[0][1];
       
$match_length = strlen(matches[0][0]);

       
// (Optional) Transform $matches to the format it is usually set as (without PREG_OFFSET_CAPTURE set)
       
foreach($matches as $k => $match) $newmatches[$k] = $match[0];
       
$matches = $new_matches;
   
       
// Your code here
       
echo "Match number $match_count, at byte offset $match_start, $match_length bytes long: ".$matches[0]."\r\n";
           
       
// Update offset to the end of the match
       
$offset = $match_start + $match_length;
    }

    return
$match_count;
}
?>

Note that the offsets returned are byte values (not necessarily number of characters) so you'll have to make sure the data is single-byte encoded. (Or have a look at paolo mosna's strByte function on the strlen manual page).
I'd be interested to know how this method performs speedwise against using preg_match_all and then recursing through the results.
sledge NOSPAM
19.06.2008 22:46
Perhaps you want to find the positions of all anchor tags.  This will return a two dimensional array of which the starting and ending positions will be returned.

<?php
function getTagPositions($strBody)
{
   
define(DEBUG, false);
   
define(DEBUG_FILE_PREFIX, "/tmp/findlinks_");
   
   
preg_match_all("/<[^>]+>(.*)<\/[^>]+>/U", $strBody, $strTag, PREG_PATTERN_ORDER);
   
$intOffset = 0;
   
$intIndex = 0;
   
$intTagPositions = array();

    foreach(
$strTag[0] as $strFullTag) {
        if(
DEBUG == true) {
           
$fhDebug = fopen(DEBUG_FILE_PREFIX.time(), "a");
           
fwrite($fhDebug, $fulltag."\n");
           
fwrite($fhDebug, "Starting position: ".strpos($strBody, $strFullTag, $intOffset)."\n");
           
fwrite($fhDebug, "Ending position: ".(strpos($strBody, $strFullTag, $intOffset) + strlen($strFullTag))."\n");
           
fwrite($fhDebug, "Length: ".strlen($strFullTag)."\n\n");
           
fclose($fhDebug);
        }
       
$intTagPositions[$intIndex] = array('start' => (strpos($strBody, $strFullTag, $intOffset)), 'end' => (strpos($strBody, $strFullTag, $intOffset) + strlen($strFullTag)));
       
$intOffset += strlen($strFullTag);
       
$intIndex++;
    }
    return
$intTagPositions;
}

$strBody = 'I have lots of <a href="http://my.site.com">links</a> on this <a href="http://my.site.com">page</a> that I want to <a href="http://my.site.com">find</a> the positions.';

$strBody = strip_tags(html_entity_decode($strBody), '<a>');
$intTagPositions = getTagPositions($strBody);
print_r($intTagPositions);

/*****
Output:

Array (
    [0] => Array (
        [start] => 15
        [end] => 53 )
    [1] => Array (
        [start] => 62
        [end] => 99 )
    [2] => Array (
        [start] => 115
        [end] => 152 )
 )
*****/
?>
spambegone at cratemedia dot com
21.04.2008 8:39
I found simpleXML to be useful only in cases where the XML was extremely small, otherwise the server would run out of memory (I suspect there is a memory leak or something?). So while searching for alternative parsers, I decided to try a simpler approach. I don't know how this compares with cpu usage, but I know it works with large XML structures. This is more a manual method, but it works for me since I always know what structure of data I will be receiving.

Essentially I just preg_match() unique nodes to find the values I am looking for, or I preg_match_all to find multiple nodes. This puts the results in an array and I can then process this data as I please.

I was unhappy though, that preg_match_all() stores the data twice (requiring twice the memory), one array for all the full pattern matches, and one array for all the sub pattern matches. You could probably write your own function that overcame this. But for now this works for me, and I hope it saves someone else some time as well.

// SAMPLE XML
<RETS ReplyCode="0" ReplyText="Operation Successful">
  <COUNT Records="14" />
  <DELIMITER value="09" />
  <COLUMNS>PropertyID</COLUMNS>
  <DATA>521897</DATA>
  <DATA>677208</DATA>
  <DATA>686037</DATA>
</RETS>

<?PHP

// SAMPLE FUNCTION
function parse_xml($xml) {
   
   
   
// GET DELIMITER (single instance)
   
$match_res = preg_match('/<DELIMITER value ?= ?"(.*)" ?\/>/', $xml, $matches);
    if(!empty(
$matches[1])) {
       
$results["delimiter"] = chr($matches[1]);
    } else {
       
// DEFAULT DELIMITER
       
$results["delimiter"] = "\t";
    }
    unset(
$match_res, $matches);
   
   
   
// GET MULTIPLE DATA NODES (multiple instances)
   
$results["data_count"] = preg_match_all("/<DATA>(.*)<\/DATA>/", $xml, $matches);
   
// GET MATCHES OF SUB PATTERN, DISCARD THE REST
   
$results["data"]=$matches[1];
    unset(
$match_res, $matches);
   
   
// UNSET XML TO SAVE MEMORY (should unset outside the function as well)
   
unset($xml);

   
// RETURN RESULTS ARRAY
   
return $results;
   
   
}

?>
bruha
4.03.2008 9:13
To count str_length in UTF-8 string i use

$count = preg_match_all("/[[:print:]\pL]/u", $str, $pockets);

where
[:print:] - printing characters, including space
\pL - UTF-8 Letter
/u - UTF-8 string
other unicode character properties on http://www.pcre.org/pcre.txt
dolbegraeb
29.01.2008 1:30
please note, that the function of "mail at SPAMBUSTER at milianw dot de" can result in invalid xhtml in some cases. think i used it in the right way but my result is sth like this:

<img src="./img.jpg" alt="nice picture" />foo foo foo foo </img>

correct me if i'm wrong.
i'll see when there's time to fix that. -.-
mr davin
12.07.2007 23:57
<?php
// Returns an array of strings where the start and end are found
   
function findinside($start, $end, $string) {
       
preg_match_all('/' . preg_quote($start, '/') . '([^\.)]+)'. preg_quote($end, '/').'/i', $string, $m);
        return
$m[1];
    }
   
   
$start = "mary has";
   
$end = "lambs.";
   
$string = "mary has 6 lambs. phil has 13 lambs. mary stole phil's lambs. now mary has all the lambs.";

   
$out = findinside($start, $end, $string);

   
print_r ($out);

/* Results in
(
    [0] =>  6
    [1] =>  all the
)
*/
?>
phektus at gmail dot com
27.06.2007 8:22
If you'd like to include DOUBLE QUOTES on a regular expression for use with preg_match_all, try ESCAPING THRICE, as in: \\\"

For example, the pattern:
'/<table>[\s\w\/<>=\\\"]*<\/table>/'

Should be able to match:
<table>
<row>
<col align="left" valign="top">a</col>
<col align="right" valign="bottom">b</col>
</row>
</table>
.. with all there is under those table tags.

I'm not really sure why this is so, but I tried just the double quote and one or even two escape characters and it won't work. In my frustration I added another one and then it's cool.
chuckie
6.12.2006 15:20
This is a function to convert byte offsets into (UTF-8) character offsets (this is reagardless of whether you use /u modifier:

<?php

function mb_preg_match_all($ps_pattern, $ps_subject, &$pa_matches, $pn_flags = PREG_PATTERN_ORDER, $pn_offset = 0, $ps_encoding = NULL) {
 
// WARNING! - All this function does is to correct offsets, nothing else:
  //
 
if (is_null($ps_encoding))
   
$ps_encoding = mb_internal_encoding();

 
$pn_offset = strlen(mb_substr($ps_subject, 0, $pn_offset, $ps_encoding));
 
$ret = preg_match_all($ps_pattern, $ps_subject, $pa_matches, $pn_flags, $pn_offset);

  if (
$ret && ($pn_flags & PREG_OFFSET_CAPTURE))
    foreach(
$pa_matches as &$ha_match)
      foreach(
$ha_match as &$ha_match)
       
$ha_match[1] = mb_strlen(substr($ps_subject, 0, $ha_match[1]), $ps_encoding);
   
//
    // (code is independent of PREG_PATTER_ORDER / PREG_SET_ORDER)

 
return $ret;
  }

?>
phpnet at sinful-music dot com
20.02.2006 9:53
Here's some fleecy code to 1. validate RCF2822 conformity of address lists and 2. to extract the address specification (the part commonly known as 'email'). I wouldn't suggest using it for input form email checking, but it might be just what you want for other email applications. I know it can be optimized further, but that part I'll leave up to you nutcrackers. The total length of the resulting Regex is about 30000 bytes. That because it accepts comments. You can remove that by setting $cfws to $fws and it shrinks to about 6000 bytes. Conformity checking is absolutely and strictly referring to RFC2822. Have fun and email me if you have any enhancements!

<?php
function mime_extract_rfc2822_address($string)
{
       
//rfc2822 token setup
       
$crlf           = "(?:\r\n)";
       
$wsp            = "[\t ]";
       
$text           = "[\\x01-\\x09\\x0B\\x0C\\x0E-\\x7F]";
       
$quoted_pair    = "(?:\\\\$text)";
       
$fws            = "(?:(?:$wsp*$crlf)?$wsp+)";
       
$ctext          = "[\\x01-\\x08\\x0B\\x0C\\x0E-\\x1F" .
                         
"!-'*-[\\]-\\x7F]";
       
$comment        = "(\\((?:$fws?(?:$ctext|$quoted_pair|(?1)))*" .
                         
"$fws?\\))";
       
$cfws           = "(?:(?:$fws?$comment)*(?:(?:$fws?$comment)|$fws))";
       
//$cfws           = $fws; //an alternative to comments
       
$atext          = "[!#-'*+\\-\\/0-9=?A-Z\\^-~]";
       
$atom           = "(?:$cfws?$atext+$cfws?)";
       
$dot_atom_text  = "(?:$atext+(?:\\.$atext+)*)";
       
$dot_atom       = "(?:$cfws?$dot_atom_text$cfws?)";
       
$qtext          = "[\\x01-\\x08\\x0B\\x0C\\x0E-\\x1F!#-[\\]-\\x7F]";
       
$qcontent       = "(?:$qtext|$quoted_pair)";
       
$quoted_string  = "(?:$cfws?\"(?:$fws?$qcontent)*$fws?\"$cfws?)";
       
$dtext          = "[\\x01-\\x08\\x0B\\x0C\\x0E-\\x1F!-Z\\^-\\x7F]";
       
$dcontent       = "(?:$dtext|$quoted_pair)";
       
$domain_literal = "(?:$cfws?\\[(?:$fws?$dcontent)*$fws?]$cfws?)";
       
$domain         = "(?:$dot_atom|$domain_literal)";
       
$local_part     = "(?:$dot_atom|$quoted_string)";
       
$addr_spec      = "($local_part@$domain)";
       
$display_name   = "(?:(?:$atom|$quoted_string)+)";
       
$angle_addr     = "(?:$cfws?<$addr_spec>$cfws?)";
       
$name_addr      = "(?:$display_name?$angle_addr)";
       
$mailbox        = "(?:$name_addr|$addr_spec)";
       
$mailbox_list   = "(?:(?:(?:(?<=:)|,)$mailbox)+)";
       
$group          = "(?:$display_name:(?:$mailbox_list|$cfws)?;$cfws?)";
       
$address        = "(?:$mailbox|$group)";
       
$address_list   = "(?:(?:^|,)$address)+";

       
//output length of string (just so you see how f**king long it is)
       
echo(strlen($address_list) . " ");

       
//apply expression
       
preg_match_all("/^$address_list$/", $string, $array, PREG_SET_ORDER);

        return
$array;
};
?>
mnc at u dot nu
3.02.2006 7:05
PREG_OFFSET_CAPTURE always seems to provide byte offsets, rather than character position offsets, even when you are using the unicode /u modifier.



PHP Powered Diese Seite bei php.net
The PHP manual text and comments are covered by the Creative Commons Attribution 3.0 License © the PHP Documentation Group - Impressum - mail("TO:Reinhard Neidl",...)