Regular expression with foreign characters.
I am working on a search engine handling utf-8 encoded text in any language.
Everything is working so far: The search term is recieved from the user, passed on to the database, and matching rows are returned to the browser - all in utf-8 all the way.
Typing certain foreign characters as ñ, é and ô also matches any n, e and o and vice versa (using MySQL's LIKE operator).
The problem appears when I try to highlight the search terms in the resulting page.
This is done using PHP's preg_replace function and in this case ñ only matches ñ, not n, as well as é matches é but not e and so on. The result simply is that some found rows won't have anything highlighted.
Is there a way to make the regex insensitive to these differences (in a similar way that the i modifier makes it case-insensitive i.e. n also matches N)?
I have tried using the u modifier (for utf-8) but it did not seem to have any effect.
Please help me here!
Jakob
Status:
Open Apr 19, 2007 - 11:47 AM
PHP, regex, regexp, regular expressions, UTF-8, web development
2answers
Answers
May 19, 2007 - 04:21 AM
From other forums and experts I've learned that there is no way to do what I wanted with a simple regex modifer.
However, I've found a solution that is not very complicated.
Here is a detailed description of my PHP approach:
// First I make a string of characters grouped together, which should be treated as equivalent
$equiv = "aàáâãäå,eéèêë,iìíîï,oòóôõö,uùúûü,yýÿ,nñ,cç";
// The groups are split into an array and each group is processed
$equiv = explode(",", $equiv);
foreach ($equiv as $e)
{
// If either of the characters of a group is found in my search term, they will be replaced by the
// entire group (in [] brackets) before matching the search term against the search result text
// I use the /u modifier because my document is utf-8 encoded
$term = preg_replace("/[$e]/iu", "[$e]", $term);
}
// The modified search term will now match similar terms of the search result text $str
// and wrap them in a 'highlighting' tag
$str = preg_replace("/$term/iu", "$0", $str);
Example:
- term = "leon"
- "leon" will not match "léon"
- therefore "leon" will be substituted with "l[eéèêë]on"
- "l[eéèêë]on" will now match "léon"
Hope it's useful :-)
Jakob
May 19, 2007 - 04:22 AM
Shit, I see that some of my characters were not well received by quomon, so please ignore the strange parts of this:
$equiv = "aàáâãäå,eéèêë,iì
237;îï,oòóôõö,uùúûü,y
53;ÿ,nñ,cç";
The numbers should have been shown as foreign characters.
Answer this question
Share Your Own Experience & Expertise
We look to ensure that every question is answered by the best people with relevant expertise and experience, the best answers include multiple perspectives. Do you have relevant expertise or experience to contribute your answer to any of these commonly asked questions?
Add New Comment