Register  |  Login

Question Details    

   Question

Time: 14:47 - Apr 19, 2007     Asked by: jgivoni      Status: Answered      Points: 250   

Regular expression with foreign characters.

I am working on a search engine handling utf-8 encoded text in any language.
Everything is working so far: The search term is recieved from the user, passed on to the database, and matching rows are returned to the browser - all in utf-8 all the way.

Typing certain foreign characters as ñ, é and ô also matches any n, e and o and vice versa (using MySQL's LIKE operator).

The problem appears when I try to highlight the search terms in the resulting page.
This is done using PHP's preg_replace function and in this case ñ only matches ñ, not n, as well as é matches é but not e and so on. The result simply is that some found rows won't have anything highlighted.

Is there a way to make the regex insensitive to these differences (in a similar way that the i modifier makes it case-insensitive i.e. n also matches N)?
I have tried using the u modifier (for utf-8) but it did not seem to have any effect.

Please help me here!

Jakob

Ask a New Question

Become a Quomon Expert

Current Categories

 

Other Questions Needing Answers


   

Answer Discussion
Answer Discussion
Answer Summaries
Answer Summary
 
From other forums and experts I've learned that there is no way to do what I wanted with a simple regex modifer.
However, I've found a solution that is not very complicated.

Here is a detailed description of my PHP approach:

// First I make a string of characters grouped together, which should be treated as equivalent
$equiv = "aàáâãäå,eéèêë,iì&#
237;îï,oòóôõö,uùúûü,y
53;ÿ,nñ,cç";

// The groups are split into an array and each group is processed
$equiv = explode(",", $equiv);
foreach ($equiv as $e)
{
// If either of the characters of a group is found in my search term, they will be replaced by the
// entire group (in [] brackets) before matching the search term against the search result text
// I use the /u modifier because my document is utf-8 encoded
 $term = preg_replace("/[$e]/iu", "[$e]", $term);
}

// The modified search term will now match similar terms of the search result text $str
// and wrap them in a 'highlighting' tag
$str = preg_replace("/$term/iu", "<span class='highlight'>$0</span>", $str);

Example:
- term = "leon"
- "leon" will not match "léon"
- therefore "leon" will be substituted with "l[eéèêë]on"
- "l[eéèêë]on" will now match "léon"

Hope it's useful :-)
Jakob

Expert:

jgivoni

Date:

May 19, 2007

Time:

07:21

 

Votes: Good (0) | Bad (0)
Login to rate this answer

Shit, I see that some of my characters were not well received by quomon, so please ignore the strange parts of this:
$equiv = "aàáâãäå,eéèêë,iì&a
mp;#
237;îï,oòóôõö,uùúûü,y
53;ÿ,nñ,cç";
The numbers should have been shown as foreign characters.

Expert:

jgivoni

Date:

May 19, 2007

Time:

07:22

 

Votes: Good (0) | Bad (0)
Login to rate this answer

Question Answered

This question has been answered, and points have been rewarded to the following experts:

jgivoni: 250

You're welcome however to comment or give additional information or if you wish, you have the ability to write an Answer Summary for this question by clicking on the "Answer Summaries" Tab.

 
No summaries have been submitted yet. Want to be the first?



Respond to this question:

New User

  Email:

Upon submission of this form, you will automatically be registered as a Quomon user and we will send your login information to this address

Registered User

Username:

Password:


Forgotten Password

 

New User

  Email:

Upon submission of this form, you will automatically be registered as a Quomon user and we will send your login information to this address

Registered User

Username:

Password:


Forgotten Password

   

"Psst, Quomon is a great site. Pass it on."     Tell a Friend  |   Link To Us  |   Save to Delicious  |   Digg! Digg it


All Questions


Language Options

English:

www.quomon.com

Español:

www.quomon.es

Quomon Blog

blog.quomon.com

Sponsors

Questions and Answers Software
Real Estate Postcards
Marketing Fulfillment