Cyrillic and Unicode

Converting names in Cyrillic into Ascii

July 23, 2017

Daria had problems with exporting matrix representation of results of blockmodeling (and clustering dendrograms) because the EPS files do not support Unicode (Cyrillic). A solution would be to implement in Pajek the export to SVG. A quicker solution is to transcribe the Russian names into Latin alphabet. In R this service is provided in the library stringi. See also ICU Unicode text transforms in the R package stringi.

> F <- readLines("BM.net")[3:108]
> Encoding(F) <- "UTF-8"
> L <- strsplit(F,'\"')
> df <- data.frame(matrix(unlist(L),nrow=106,byrow=TRUE),stringsAsFactors=FALSE)
> N <- df$X2
> Encoding(N) <- "UTF-8"
> library(stringi)
> R <- stri_trans_general(N,"cyrillic-latin;nfd;[:nonspacing mark:] remove;nfc")
> write.csv(R,"BMlatin.nam",row.names=FALSE)

Now we manually copy the names from BMlatin.nam into BM.net.

Check:

> tail(N)
[1] "ГОМЗИН А"      "НЕДУМОВ Я"     "IVANOV I"      "АСТРАХАНЦЕВ Н"
[5] "ТРИПУТИНА В"   "МАКАГОНОВА Н" 
> tail(R)
[1] "GOMZIN A"      "NEDUMOV A"     "IVANOV I"      "ASTRAHANCEV N"
[5] "TRIPUTINA V"   "MAKAGONOVA N" 

Problems with conversion of character Ь

August 1, 2017

> N[44]
[1] "ЗОРЬКИНА К"
> R[44]
[1] "ZORʹKINA K"
> utf8ToInt(R[44])
 [1]  90  79  82 697  75  73  78  65  32  75
> T <- sapply(R,function(w)gsub(intToUtf8(697),"'",w),USE.NAMES=FALSE)
> T[44]
[1] "ZOR'KINA K"
> utf8ToInt(T[44])
 [1] 90 79 82 39 75 73 78 65 32 75
ru/unicode.txt · Last modified: 2017/08/01 16:35 by vlado
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki