regex words matching for Chinese and Japanese character

I know the pattern to detect if it's a string is chinese character but that's not what I need. I need to check if the characters is found in a string.

const words_found = (words, values) => 
 words.some(word => 
   values.match(new RegExp(word + '\\b', 'i'))
)

words_found(['james'], 'my name is james') // true

but failed for chinese character

words_found(['??'], '?????????') // false

Answers:

Answer

\b only works on boundary between words and non-words. In case of Chinese, the entire '?????????' is considered a word, so '??' won't match '?????????' with your regex pattern with \b since '??' is not at the word boundary of '?????????'. '??' on the other hand, will match. For Chinese words, a simple substring match is usually enough.

Answer

Read the documentation for word boundaries.

A word boundary matches the position between a word character followed by a non-word character, or between a non-word character followed by a word character.

where "word character" is something that matches \w (basically single-byte alphanumerics and the underscore), and "non-word character" is something that matches \W.

Note that all Chinese characters, in the sense that we usually think of them, are considered "non-word characters" as relates to the definition of word boundaries in JavaScript regular expressions. In other words, there is no word boundary between ? and ?, because both are non-word characters; similarly, there is no word boundary between ?? and ??, because both ? and ? are non-word characters.

With regard to Japanese, Chinese, and Korean, which do not generally use spaces, there is not even a single clear definition of what the concept of "word" means, and therefore no concept of "word character" or "word boundary". There are libraries which people have worked on for years, involving machine learning, to try to break text into meaningful word-like segments, and they all do it in a slightly different way. The relevant question here is why you think you want to break the Chinese into what you are thinking of as "words" (or find strings which occur right before "word boundaries". What is the point of your \\b that is forcing the match to occur right before a word boundary? What case are you trying to exclude?

Using Unicode regexp properties

However, you may be able to use the new Unicode regexp character class escapes in ECMAScript 2018 (http://2ality.com/2017/07/regexp-unicode-property-escapes.html). For instance, to match Chinese strings occurring before something that doesn't look like a Chinese character (or any letter), you could use

new RegExp(`${word}(?=$|\P{Letter})`, "u")

Roughly speaking, this translates into "find the word, but only it is followed by (using look-ahead, the (?= part) either end-of-string ($) or a a character which does have the Unicode property "Letter". The "u" flag enables Unicode processing.

Of course, this will not help you find ?? as a "word" inside ?????????, because the following character ? falls into the Unicode class "Letter", and so will not match \p{Letter}.

By the way, to match any "non-word" symbol in Unicode, you can use:

[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]

Tags

Recent Questions

Top Questions

Home Tags Terms of Service Privacy Policy DMCA Contact Us Javascript

©2020 All rights reserved.