Russian Input For Word Count
Solution 1:
The \b
notation is defined in terms of “word boundaries”, but with “word” meaning a sequence of ASCII letters, so it cannot be used for Russian texts. A simple approach is to count sequences of Cyrillic letters, and the range from U+0400 to U+0481 covers the Cyrillic letters used in Russian.
var matches = this.value.match(/\b/g);
wordCounts[this.id] = matches ? matches.length / 2 : 0;
by the lines
var matches = this.value.match(/[\u0400-\u0481]+/g);
wordCounts[this.id] = matches ? matches.length : 0;
You should perhaps treat a hyphen as corresponding to a letter (and therefore add \-
inside the brackets), so that a hyphenated compound would be counted as one word, but this is debatable (is e.g. “жили-были” two words or one?)
Solution 2:
The problem is in your regex - \b
doesn't match UTF-8 word boundaries.
Try changing this:
var matches = this.value.match(/\b/g);
To this:
var matches = this.value.match(/[^\s\.\!\?]+/g);
and see if that gives a result for Cyrillic input. If it works then you no longer need to divide by 2 to get the word count.
Post a Comment for "Russian Input For Word Count"