Javascript RegExp + Word boundaries + unicode characters

I am building search and I am going to use javascript autocomplete with it. I am from Finland (finnish language) so I have to deal with some special characters like ä, ö and å

When user types text in to the search input field I try to match the text to data.

Here is simple example that is not working correctly if user types for example "ää". Same thing with "äl"

var title = "this is simple string with finnish word tämä on ääkköstesti älkää ihmetelkö";
// Does not work
var searchterm = "äl";

// does not work
//var searchterm = "ää";

// Works
//var searchterm = "wi";

if ( new RegExp("\\b"+searchterm, "gi").test(title) ) {
    $("#result").html("Match: ("+searchterm+"): "+title);
} else {
    $("#result").html("nothing found with term: "+searchterm);   
}

http://jsfiddle.net/7TsxB/

So how can I get those ä,ö and å characters to work with javascript regex?

I think I should use unicode codes but how should I do that? Codes for those characters are: [\u00C4,\u00E4,\u00C5,\u00E5,\u00D6,\u00F6]

=> äÄåÅöÖ

Answers:

Answer

this question is old, but I think I found a better solution for boundary in regular expressions with unicode letters. Using XRegExp library you can implement a valid \b boundary expanding this

XRegExp('(?=^|$|[^\\p{L}])')

the result is a 4000+ char long, but it seems to work quite performing.

Some explanation: (?= ) is a zero-length lookahead that looks for a begin or end boundary or a non-letter unicode character. The most important think is the lookahead, because the \b doesn't capture anything: it is simply true or false.

Answer

I would recommend you to use XRegExp when you have to work with a specific set of characters from Unicode, the author of this library mapped all kind of regional sets of characters making the work with different languages easier.

Answer

My idea is to search with codes representing the Finnish letters

new RegExp("\\b"+asciiOnly(searchterm), "gi").test(asciiOnly(title))

My original idea was to use plain encodeURI but the % sign seemed to interfere with the regexp.

http://jsfiddle.net/7TsxB/5/

I wrote a crude function using encodeURI to encode every character with code over 128 but removing its % and adding 'QQ' in the beginning. It is not the best marker but I couldn't get non alphanumeric to work.

Answer

I have had a similar problem, but I had to replace an array of terms. All solutions, which I have found did not worked, if two terms were in the text next to each other (because their boundaries overlaped). So I had to use a little modified approach:

var text = "Ješt?. že; \"už\" à. Fürs, 'anlässlich' že že že.";
var terms = ["à","anlässlich","Fürs","už","Ješt?", "že"];
var replaced = [];
var order = 0;
for (i = 0; i < terms.length; i++) {
    terms[i] = "(^\|[ \n\r\t.,;'\"\+!?-])(" + terms[i] + ")([ \n\r\t.,;'\"\+!?-]+\|$)";
}
var re = new RegExp(terms.join("|"), "");
while (true) {
    var replacedString = "";
    text = text.replace(re, function replacer(match){
        var beginning = match.match("^[ \n\r\t.,;'\"\+!?-]+");
        if (beginning == null) beginning = "";
        var ending = match.match("[ \n\r\t.,;'\"\+!?-]+$");
        if (ending == null) ending = "";
        replacedString = match.replace(beginning,"");
        replacedString = replacedString.replace(ending,"");
        replaced.push(replacedString);
        return beginning+"{{"+order+"}}"+ending;
    });
if (replacedString == "") break;
order += 1;
}

See the code in a fiddle: http://jsfiddle.net/antoninslejska/bvbLpdos/1/

The regular expression is inspired by: http://breakthebit.org/post/3446894238/word-boundaries-in-javascripts-regular

I can't say, that I find the solution elegant...

Answer

What you are looking for is the Unicode word boundaries standard:

http://unicode.org/reports/tr29/tr29-9.html#Word_Boundaries

There is a JavaScript implementation here (unciodejs.wordbreak.js)

https://github.com/wikimedia/unicodejs

Answer

There appears to be a problem with Regex and the word boundary \b matching the beginning of a string with a starting character out of the normal 256 byte range.

Instead of using \b, try using (?:^|\\s)

var title = "this is simple string with finnish word tämä on ääkköstesti älkää ihmetelkö";
// Does not work
var searchterm = "äl";

// does not work
//var searchterm = "ää";

// Works
//var searchterm = "wi";

if ( new RegExp("(?:^|\\s)"+searchterm, "gi").test(title) ) {
    $("#result").html("Match: ("+searchterm+"): "+title);
} else {
    $("#result").html("nothing found with term: "+searchterm);   
}

Breakdown:

(?: parenthesis () form a capture group in Regex. Parenthesis started with a question mark and colon ?: form a non-capturing group. They just group the terms together

^ the caret symbol matches the beginning of a string

| the bar is the "or" operator.

\s matches whitespace (appears as \\s in the string because we have to escape the backslash)

) closes the group

So instead of using \b, which matches word boundaries and doesn't work for unicode characters, we use a non-capturing group which matches the beginning of a string OR whitespace.

Answer

The \b character class in JavaScript RegEx is really only useful with simple ASCII encoding. \b is a shortcut code for the boundary between \w and \W sets or \w and the beginning or end of the string. These character sets only take into account ASCII "word" characters, where \w is equal to [a-zA-Z0-9_] and \W is the negation of that class.

This makes the RegEx character classes largely useless for dealing with any real language.

\s should work for what you want to do, provided that search terms are only delimited by whitespace.

Answer

I noticed something really weird with \b when using Unicode:

/\bo/.test("pop"); // false (obviously)
/\bä/.test("päp"); // true (what..?)

/\Bo/.test("pop"); // true
/\Bä/.test("päp"); // false (what..?)

It appears that meaning of \b and \B are reversed, but only when used with non-ASCII Unicode? There might be something deeper going on here, but I'm not sure what it is.

In any case, it seems that the word boundary is the issue, not the Unicode characters themselves. Perhaps you should just replace \b with (^|[\s\\/-_&]), as that seems to work correctly. (Make your list of symbols more comprehensive than mine, though.)

Answer

\b is a shortcut for the transition between a letter and a non-letter character, or vice-versa.

Updating and improving on max_masseti's answer:

With the introduction of the /u modifier for RegExs in ES2018, you can now use \p{L} to represent any unicode letter, and \P{L} (notice the uppercase P) to represent anything but.

EDIT: Previous version was incomplete.

As such:

const text = 'A Fé, o Império, e as terras viciosas';

text.split(/(?<=\p{L})(?=\P{L})|(?<=\P{L})(?=\p{L})/);

// ['A', ' Fé', ',', ' o', ' Império', ',', ' e', ' as', ' terras', ' viciosas']

We're using a lookbehind (?<=...) to find a letter and a lookahead (?=...) to find a non-letter, or vice versa.

Answer

The correct answer to the question is given by andrefs. I will only rewrite it more clearly, after putting all required things together.

For ASCII text, you can use \b for matching a word boundary both at the start and the end of a pattern. When using Unicode text, you need to use 2 different patterns for doing the same:

  • Use (?<=^|\P{L}) for matching the start or a word boundary before the main pattern.
  • Use (?=\P{L}|$) for matching the end or a word boundary after the main pattern.
  • Additionally, use (?i) in the beginning of everything, to make all those matchings case-insensitive.

So the resulting answer is: (?i)(?<=^|\P{L})xxx(?=\P{L}|$), where xxx is your main pattern. This would be the equivalent of (?i)\bxxx\b for ASCII text.

For your code to work, you now need to do the following:

  • Assign to your variable "searchterm", the pattern or words you want to find.
  • Escape the variable's contents. For example, replace '\' with '\\' and also do the same for any reserved special character of regex, like '\^', '\$', '\/', etc. Check here for a question on how to do this.
  • Insert the variable's contents to the pattern above, in the place of "xxx", by simply using the string.replace() method.
Answer

Despite the fact the issue seems to be 8 years old, I run into a similar problem (I had to match Cyrillic letters) not so far ago. I spend a whole day on this and could not find any appropriate answer here on StackOverflow. So, to avoid others making lots of effort, I'd like to share my solution.

Yes, \b word boundary works only with Latin letters (Word boundary: \b):

Word boundary \b doesn’t work for non-Latin alphabets The word boundary test \b checks that there should be \w on the one side from the position and "not \w" – on the other side. But \w means a Latin letter a-z (or a digit or an underscore), so the test doesn’t work for other characters, e.g. Cyrillic letters or hieroglyphs.

Yes, JavaScript RegExp implementation hardly supports UTF-8 encoding.

So, I tried implementing own word boundary feature with the support of non-Latin characters. To make word boundary work just with Cyrillic characters I created such regular expression:

new RegExp(`(?<![\u0400-\u04ff])${cyrillicSearchValue}(?![\u0400-\u04ff])`,'gi')

Where \u0400-\u04ff is a range of Cyrillic characters provided in the table of codes. It is not an ideal solution, however, it works properly in most cases.

To make it work in your case, you just have to pick up an appropriate range of codes from the list of Unicode characters.

To try out my example run the code snippet below.

function getMatchExpression(cyrillicSearchValue) {
  return new RegExp(
    `(?<![\u0400-\u04ff])${cyrillicSearchValue}(?![\u0400-\u04ff])`,
    'gi',
  );
}

const sentence = '????-???? ????? ?????????, ?? ????????? ?????? ????? ? ?????????';

console.log(sentence.match(getMatchExpression('?????')));
// expected output: ["?????"]


console.log(sentence.match(getMatchExpression('??')));
// expected output: null

Tags

Recent Questions

Top Questions

Home Tags Terms of Service Privacy Policy DMCA Contact Us Javascript

©2020 All rights reserved.