Regular expression with foreign characters.
                            
            I am working on a search engine handling utf-8 encoded text in any language.
Everything is working so far: The search term is recieved from the user, passed on to the database, and matching rows are returned to the browser - all in utf-8 all the way.
Typing certain foreign characters as ñ, é and ô also matches any n, e and o and vice versa (using MySQL's LIKE operator).
The problem appears when I try to highlight the search terms in the resulting page.
This is done using PHP's preg_replace function and in this case ñ only matches ñ, not n, as well as é matches é but not e and so on. The result simply is that some found rows won't have anything highlighted.
Is there a way to make the regex insensitive to these differences (in a similar way that the i modifier makes it case-insensitive i.e. n also matches N)? 
I have tried using the u modifier (for utf-8) but it did not seem to have any effect.
Please help me here!
Jakob
            
                    Status: 
                Open    Apr 19, 2007 - 11:47 AM
            
            
                PHP, regex, regexp, regular expressions, UTF-8, web development
            
                
                            
                    
                
         
     
    
        2answers
                    
        
            
                Answers
        
    May 19, 2007 - 04:21 AM
    From other forums and experts I've learned that there is no way to do what I wanted with a simple regex modifer.
However, I've found a solution that is not very complicated.
Here is a detailed description of my PHP approach:
// First I make a string of characters grouped together, which should be treated as equivalent
$equiv = "aàáâãäå,eéèêë,iìíîï,oòóôõö,uùúûü,yýÿ,nñ,cç";
// The groups are split into an array and each group is processed
$equiv = explode(",", $equiv);
foreach ($equiv as $e)
{
// If either of the characters of a group is found in my search term, they will be replaced by the
// entire group (in [] brackets) before matching the search term against the search result text
// I use the /u modifier because my document is utf-8 encoded
$term = preg_replace("/[$e]/iu", "[$e]", $term);
}
// The modified search term will now match similar terms of the search result text $str
// and wrap them in a 'highlighting' tag
$str = preg_replace("/$term/iu", "$0", $str);
Example:
- term = "leon"
- "leon" will not match "léon"
- therefore "leon" will be substituted with "l[eéèêë]on"
- "l[eéèêë]on" will now match "léon"
Hope it's useful :-)
Jakob
        
    
 
 
                            
        
    May 19, 2007 - 04:22 AM
    Shit, I see that some of my characters were not well received by quomon, so please ignore the strange parts of this:
$equiv = "aàáâãäå,eéèêë,iì
237;îï,oòóôõö,uùúûü,y
53;ÿ,nñ,cç"; 
The numbers should have been shown as foreign characters.
        
    
 
 
                
             
         
    Answer this question    
    
    
        
        
            Share Your Own Experience & Expertise
            We look to ensure that every question is answered by the best people with relevant expertise and experience, the best answers include multiple perspectives.  Do you have relevant expertise or experience to contribute your answer to any of these commonly asked questions?
            
            
         
 
                            
Add New Comment