提示

阅读本文需要同时对c++和java有一定了解。

背景

有时我们比较两个字符串时不考虑它们是大写还是小写;举个例子,在这种情况下我们认为“BanAna”和“baNaNA”是等价的。

其中一种思路是:

1. 将两个字符串都转换为小写(或者都转换为大写);

2.比较转换后的两个字符串是否相同。

这里给出一段C++示例代码:

//C++ example that we offen use

bool testIgnoreCase(string str1, string str2){
    transform(str1.begin(),str1.end(),str1.begin(),::tolower);
    transform(str2.begin(),str2.end(),str2.begin(),::tolower);

    //Or
    //transform(str1.begin(),str1.end(),str1.begin(),::toupper);
    //transform(str2.begin(),str2.end(),str2.begin(),::toupper);

    cout<<str1<<" "<<str2<<endl;//apple apple
    return str1 == str2;
}

int main()
{
    string str1 = "ApplE";
    string str2 = "apPle";
    cout<<testIgnoreCase(str1,str2);//1
    return 0;
}

上面的代码同一将两个字符串转换为了小写,然后比较。当然你先转换为大写也行。

看起来功能已经实现了。

但这种做法真的严谨吗?

考虑下面的两个例子:

//C++ example1

bool testIgnoreCase(string str1, string str2){
    transform(str1.begin(),str1.end(),str1.begin(),::tolower);
    transform(str2.begin(),str2.end(),str2.begin(),::tolower);

    //Or
    //transform(str1.begin(),str1.end(),str1.begin(),::toupper);
    //transform(str2.begin(),str2.end(),str2.begin(),::toupper);

    cout<<str1<<" "<<str2<<endl;//ı i
    return str1 == str2;
}

int main()
{
    string str1 = "ı";//unicode=305,注意不在ascii范围内
    string str2 = "I";//常见的大写字母I
    cout<<testIgnoreCase(str1,str2);//0
    return 0;
}
//C++ example2

bool testIgnoreCase(string str1, string str2){
    //transform(str1.begin(),str1.end(),str1.begin(),::tolower);
    //transform(str2.begin(),str2.end(),str2.begin(),::tolower);

    //Or
    transform(str1.begin(),str1.end(),str1.begin(),::toupper);
    transform(str2.begin(),str2.end(),str2.begin(),::toupper);

    cout<<str1<<" "<<str2<<endl;//İ I
    return str1 == str2;
}

int main()
{
    string str1 = "İ";//unicode=304,注意不在ascii范围内
    string str2 = "i";//常见的小写字母i
    cout<<testIgnoreCase(str1,str2);//0
    return 0;
}

从上面两个例子中,可以看到,不管是全部转换为小写还是全部转换为大写,再比较的方式,都是不严谨的。主要的原因是我们没有考虑超出ascii编码范围的字符。

上面的例子中,总共涉及到四个字符,分别为:

i 常见的小写字母i,Ascii=105
I 常见的大写字母I,Ascii=73
ı
unicode=305
İ
unicode=304

因此引出一个疑问:这四个字符,是一族的吗?换句话说,它们是否真的被视为等价?如果它们不等价,上面的问题就不算是问题了。

这个问题就涉及到两种语言之间的差异了:

Java中,它们之间大小写转换关系如下:

而C++中,这几个字符不被视为等价,这就意味着,就算你这样写(先转换为小写,如果还不相等,再转换为大写判断;当然先转换为大写后转换为小写是一样的思路):

bool testIgnoreCase(string str1, string str2){
    transform(str1.begin(),str1.end(),str1.begin(),::tolower);
    transform(str2.begin(),str2.end(),str2.begin(),::tolower);
    if(str1 == str2) {
        return true;
    }
    transform(str1.begin(),str1.end(),str1.begin(),::toupper);
    transform(str2.begin(),str2.end(),str2.begin(),::toupper);
    return str1 == str2;
}

也不会起丝毫作用。

那Java中是如何实现IgnoreCace的呢?

看Java中的equalsIgnoreCase()函数源码:

//Java

    
public boolean equalsIgnoreCase(String anotherString) {
    return (this == anotherString) ? true
            : (anotherString != null)
            && (anotherString.value.length == value.length)
            && regionMatches(true, 0, anotherString, 0, value.length);
}

public boolean regionMatches(boolean ignoreCase, int toffset,
        String other, int ooffset, int len) {
    char ta[] = value;
    int to = toffset;
    char pa[] = other.value;
    int po = ooffset;
    // Note: toffset, ooffset, or len might be near -1>>>1.
    if ((ooffset < 0) || (toffset < 0)
            || (toffset > (long)value.length - len)
            || (ooffset > (long)other.value.length - len)) {
        return false;
    }
    while (len-- > 0) {
        char c1 = ta[to++];
        char c2 = pa[po++];
        if (c1 == c2) {
            continue;
        }
        if (ignoreCase) {
            // If characters don't match but case may be ignored,
            // try converting both characters to uppercase.
            // If the results match, then the comparison scan should
            // continue.
            char u1 = Character.toUpperCase(c1);
            char u2 = Character.toUpperCase(c2);
            if (u1 == u2) {
                continue;
            }
            // Unfortunately, conversion to uppercase does not work properly
            // for the Georgian alphabet, which has strange rules about case
            // conversion.  So we need to make one last check before
            // exiting.
            if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
                continue;
            }
        }
        return false;
    }
    return true;
}

可以看到,Java中的忽略大小写比较先将字符转换为大写,对于不相等的字符,又转换为小写比较;这样做相当于多了一层保障。

再细究,我们先看小写转换,观察其更为底层的实现:

  1 int toLowerCase(int ch) {
  2     int mapChar = ch;
  3     int val = getProperties(ch);
  4 
  5     if ((val & 0x00020000) != 0) {
  6       if ((val & 0x07FC0000) == 0x07FC0000) {
  7         switch(ch) {
  8           // map the offset overflow chars
  9         case 0x0130 : mapChar = 0x0069; break;
 10         case 0x2126 : mapChar = 0x03C9; break;
 11         case 0x212A : mapChar = 0x006B; break;
 12         case 0x212B : mapChar = 0x00E5; break;
 13           // map the titlecase chars with both a 1:M uppercase map
 14           // and a lowercase map
 15         case 0x1F88 : mapChar = 0x1F80; break;
 16         case 0x1F89 : mapChar = 0x1F81; break;
 17         case 0x1F8A : mapChar = 0x1F82; break;
 18         case 0x1F8B : mapChar = 0x1F83; break;
 19         case 0x1F8C : mapChar = 0x1F84; break;
 20         case 0x1F8D : mapChar = 0x1F85; break;
 21         case 0x1F8E : mapChar = 0x1F86; break;
 22         case 0x1F8F : mapChar = 0x1F87; break;
 23         case 0x1F98 : mapChar = 0x1F90; break;
 24         case 0x1F99 : mapChar = 0x1F91; break;
 25         case 0x1F9A : mapChar = 0x1F92; break;
 26         case 0x1F9B : mapChar = 0x1F93; break;
 27         case 0x1F9C : mapChar = 0x1F94; break;
 28         case 0x1F9D : mapChar = 0x1F95; break;
 29         case 0x1F9E : mapChar = 0x1F96; break;
 30         case 0x1F9F : mapChar = 0x1F97; break;
 31         case 0x1FA8 : mapChar = 0x1FA0; break;
 32         case 0x1FA9 : mapChar = 0x1FA1; break;
 33         case 0x1FAA : mapChar = 0x1FA2; break;
 34         case 0x1FAB : mapChar = 0x1FA3; break;
 35         case 0x1FAC : mapChar = 0x1FA4; break;
 36         case 0x1FAD : mapChar = 0x1FA5; break;
 37         case 0x1FAE : mapChar = 0x1FA6; break;
 38         case 0x1FAF : mapChar = 0x1FA7; break;
 39         case 0x1FBC : mapChar = 0x1FB3; break;
 40         case 0x1FCC : mapChar = 0x1FC3; break;
 41         case 0x1FFC : mapChar = 0x1FF3; break;
 42 
 43         case 0x023A : mapChar = 0x2C65; break;
 44         case 0x023E : mapChar = 0x2C66; break;
 45         case 0x10A0 : mapChar = 0x2D00; break;
 46         case 0x10A1 : mapChar = 0x2D01; break;
 47         case 0x10A2 : mapChar = 0x2D02; break;
 48         case 0x10A3 : mapChar = 0x2D03; break;
 49         case 0x10A4 : mapChar = 0x2D04; break;
 50         case 0x10A5 : mapChar = 0x2D05; break;
 51         case 0x10A6 : mapChar = 0x2D06; break;
 52         case 0x10A7 : mapChar = 0x2D07; break;
 53         case 0x10A8 : mapChar = 0x2D08; break;
 54         case 0x10A9 : mapChar = 0x2D09; break;
 55         case 0x10AA : mapChar = 0x2D0A; break;
 56         case 0x10AB : mapChar = 0x2D0B; break;
 57         case 0x10AC : mapChar = 0x2D0C; break;
 58         case 0x10AD : mapChar = 0x2D0D; break;
 59         case 0x10AE : mapChar = 0x2D0E; break;
 60         case 0x10AF : mapChar = 0x2D0F; break;
 61         case 0x10B0 : mapChar = 0x2D10; break;
 62         case 0x10B1 : mapChar = 0x2D11; break;
 63         case 0x10B2 : mapChar = 0x2D12; break;
 64         case 0x10B3 : mapChar = 0x2D13; break;
 65         case 0x10B4 : mapChar = 0x2D14; break;
 66         case 0x10B5 : mapChar = 0x2D15; break;
 67         case 0x10B6 : mapChar = 0x2D16; break;
 68         case 0x10B7 : mapChar = 0x2D17; break;
 69         case 0x10B8 : mapChar = 0x2D18; break;
 70         case 0x10B9 : mapChar = 0x2D19; break;
 71         case 0x10BA : mapChar = 0x2D1A; break;
 72         case 0x10BB : mapChar = 0x2D1B; break;
 73         case 0x10BC : mapChar = 0x2D1C; break;
 74         case 0x10BD : mapChar = 0x2D1D; break;
 75         case 0x10BE : mapChar = 0x2D1E; break;
 76         case 0x10BF : mapChar = 0x2D1F; break;
 77         case 0x10C0 : mapChar = 0x2D20; break;
 78         case 0x10C1 : mapChar = 0x2D21; break;
 79         case 0x10C2 : mapChar = 0x2D22; break;
 80         case 0x10C3 : mapChar = 0x2D23; break;
 81         case 0x10C4 : mapChar = 0x2D24; break;
 82         case 0x10C5 : mapChar = 0x2D25; break;
 83         case 0x10C7 : mapChar = 0x2D27; break;
 84         case 0x10CD : mapChar = 0x2D2D; break;
 85         case 0x1E9E : mapChar = 0x00DF; break;
 86         case 0x2C62 : mapChar = 0x026B; break;
 87         case 0x2C63 : mapChar = 0x1D7D; break;
 88         case 0x2C64 : mapChar = 0x027D; break;
 89         case 0x2C6D : mapChar = 0x0251; break;
 90         case 0x2C6E : mapChar = 0x0271; break;
 91         case 0x2C6F : mapChar = 0x0250; break;
 92         case 0x2C70 : mapChar = 0x0252; break;
 93         case 0x2C7E : mapChar = 0x023F; break;
 94         case 0x2C7F : mapChar = 0x0240; break;
 95         case 0xA77D : mapChar = 0x1D79; break;
 96         case 0xA78D : mapChar = 0x0265; break;
 97         case 0xA7AA : mapChar = 0x0266; break;
 98           // default mapChar is already set, so no
 99           // need to redo it here.
100           // default       : mapChar = ch;
101         }
102       }
103       else {
104         int offset = val << 5 >> (5+18);
105         mapChar = ch + offset;
106       }
107     }
108     return mapChar;
109 }

源码中的getProperties,获取到字符所处的属性集,然后根据不同的情况执行对应的操作,对于我们的例子,源码第9行

case 0x0130 : mapChar = 0x0069; break;

将İ(304)转换为i(105)。注意程序中是16进制的。

再看大写转换:

  1 int toUpperCase(int ch) {
  2     int mapChar = ch;
  3     int val = getProperties(ch);
  4 
  5     if ((val & 0x00010000) != 0) {
  6       if ((val & 0x07FC0000) == 0x07FC0000) {
  7         switch(ch) {
  8           // map chars with overflow offsets
  9         case 0x00B5 : mapChar = 0x039C; break;
 10         case 0x017F : mapChar = 0x0053; break;
 11         case 0x1FBE : mapChar = 0x0399; break;
 12           // map char that have both a 1:1 and 1:M map
 13         case 0x1F80 : mapChar = 0x1F88; break;
 14         case 0x1F81 : mapChar = 0x1F89; break;
 15         case 0x1F82 : mapChar = 0x1F8A; break;
 16         case 0x1F83 : mapChar = 0x1F8B; break;
 17         case 0x1F84 : mapChar = 0x1F8C; break;
 18         case 0x1F85 : mapChar = 0x1F8D; break;
 19         case 0x1F86 : mapChar = 0x1F8E; break;
 20         case 0x1F87 : mapChar = 0x1F8F; break;
 21         case 0x1F90 : mapChar = 0x1F98; break;
 22         case 0x1F91 : mapChar = 0x1F99; break;
 23         case 0x1F92 : mapChar = 0x1F9A; break;
 24         case 0x1F93 : mapChar = 0x1F9B; break;
 25         case 0x1F94 : mapChar = 0x1F9C; break;
 26         case 0x1F95 : mapChar = 0x1F9D; break;
 27         case 0x1F96 : mapChar = 0x1F9E; break;
 28         case 0x1F97 : mapChar = 0x1F9F; break;
 29         case 0x1FA0 : mapChar = 0x1FA8; break;
 30         case 0x1FA1 : mapChar = 0x1FA9; break;
 31         case 0x1FA2 : mapChar = 0x1FAA; break;
 32         case 0x1FA3 : mapChar = 0x1FAB; break;
 33         case 0x1FA4 : mapChar = 0x1FAC; break;
 34         case 0x1FA5 : mapChar = 0x1FAD; break;
 35         case 0x1FA6 : mapChar = 0x1FAE; break;
 36         case 0x1FA7 : mapChar = 0x1FAF; break;
 37         case 0x1FB3 : mapChar = 0x1FBC; break;
 38         case 0x1FC3 : mapChar = 0x1FCC; break;
 39         case 0x1FF3 : mapChar = 0x1FFC; break;
 40 
 41         case 0x023F : mapChar = 0x2C7E; break;
 42         case 0x0240 : mapChar = 0x2C7F; break;
 43         case 0x0250 : mapChar = 0x2C6F; break;
 44         case 0x0251 : mapChar = 0x2C6D; break;
 45         case 0x0252 : mapChar = 0x2C70; break;
 46         case 0x0265 : mapChar = 0xA78D; break;
 47         case 0x0266 : mapChar = 0xA7AA; break;
 48         case 0x026B : mapChar = 0x2C62; break;
 49         case 0x0271 : mapChar = 0x2C6E; break;
 50         case 0x027D : mapChar = 0x2C64; break;
 51         case 0x1D79 : mapChar = 0xA77D; break;
 52         case 0x1D7D : mapChar = 0x2C63; break;
 53         case 0x2C65 : mapChar = 0x023A; break;
 54         case 0x2C66 : mapChar = 0x023E; break;
 55         case 0x2D00 : mapChar = 0x10A0; break;
 56         case 0x2D01 : mapChar = 0x10A1; break;
 57         case 0x2D02 : mapChar = 0x10A2; break;
 58         case 0x2D03 : mapChar = 0x10A3; break;
 59         case 0x2D04 : mapChar = 0x10A4; break;
 60         case 0x2D05 : mapChar = 0x10A5; break;
 61         case 0x2D06 : mapChar = 0x10A6; break;
 62         case 0x2D07 : mapChar = 0x10A7; break;
 63         case 0x2D08 : mapChar = 0x10A8; break;
 64         case 0x2D09 : mapChar = 0x10A9; break;
 65         case 0x2D0A : mapChar = 0x10AA; break;
 66         case 0x2D0B : mapChar = 0x10AB; break;
 67         case 0x2D0C : mapChar = 0x10AC; break;
 68         case 0x2D0D : mapChar = 0x10AD; break;
 69         case 0x2D0E : mapChar = 0x10AE; break;
 70         case 0x2D0F : mapChar = 0x10AF; break;
 71         case 0x2D10 : mapChar = 0x10B0; break;
 72         case 0x2D11 : mapChar = 0x10B1; break;
 73         case 0x2D12 : mapChar = 0x10B2; break;
 74         case 0x2D13 : mapChar = 0x10B3; break;
 75         case 0x2D14 : mapChar = 0x10B4; break;
 76         case 0x2D15 : mapChar = 0x10B5; break;
 77         case 0x2D16 : mapChar = 0x10B6; break;
 78         case 0x2D17 : mapChar = 0x10B7; break;
 79         case 0x2D18 : mapChar = 0x10B8; break;
 80         case 0x2D19 : mapChar = 0x10B9; break;
 81         case 0x2D1A : mapChar = 0x10BA; break;
 82         case 0x2D1B : mapChar = 0x10BB; break;
 83         case 0x2D1C : mapChar = 0x10BC; break;
 84         case 0x2D1D : mapChar = 0x10BD; break;
 85         case 0x2D1E : mapChar = 0x10BE; break;
 86         case 0x2D1F : mapChar = 0x10BF; break;
 87         case 0x2D20 : mapChar = 0x10C0; break;
 88         case 0x2D21 : mapChar = 0x10C1; break;
 89         case 0x2D22 : mapChar = 0x10C2; break;
 90         case 0x2D23 : mapChar = 0x10C3; break;
 91         case 0x2D24 : mapChar = 0x10C4; break;
 92         case 0x2D25 : mapChar = 0x10C5; break;
 93         case 0x2D27 : mapChar = 0x10C7; break;
 94         case 0x2D2D : mapChar = 0x10CD; break;
 95           // ch must have a 1:M case mapping, but we
 96           // can't handle it here. Return ch.
 97           // since mapChar is already set, no need
 98           // to redo it here.
 99           //default       : mapChar = ch;
100         }
101       }
102       else {
103         int offset = val  << 5 >> (5+18);
104         mapChar =  ch - offset;
105       }
106     }
107     return mapChar;
108 }

转换ı(305)时,程序跳到了第103行:

int offset = val  << 5 >> (5+18);

将其转换为I(73)。

至此,上面的例子可以正常运行了。

总结

对于Java:

     1. 对于Ascii码表中的字符,传统方法(只转换为大写或小写)完全没有问题;

     2. 若要考虑更多字符集,需多加考虑,这时要多加一次转换和比较。除了文中列举的字符,还有其他字符存在类似的问题。

对于C++:

     1. 对于Ascii码表中的字符,传统方法(只转换为大写或小写)完全没有问题;

     2. C++对于超出Ascii码表的字符处理方式和Java不同。由于看不到tolower的源码,这里没有进一步分析,有知晓的读者欢迎留言。

后记

文中涉及到了“等价”和“相等”的概念,这里不做具体区分,可参考《Effective C++》详细了解。

参考话题

https://stackoverflow.com/questions/15518731/understanding-logic-in-caseinsensitivecomparator

版权声明:本文为xiaoxi666原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://www.cnblogs.com/xiaoxi666/p/9535084.html