This shows you the differences between two versions of the page.
— |
notes:net:cyr [2015/07/16 22:11] (current) vlado created |
||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Cyrillic ====== | ||
+ | <code> | ||
+ | Subject: matrix description | ||
+ | From: "Maria Safonova" <safonovam@yandex.ru> | ||
+ | Date: Sat, August 10, 2013 10:59 | ||
+ | </code> | ||
+ | |||
+ | |||
+ | ... Matrix represents 2-mode network, | ||
+ | |||
+ | Separator is semicolon (;), | ||
+ | |||
+ | 8304 vertices of the first mode (columns, their labels are v0001 - v8304) – are | ||
+ | citing articles in Russian journals on ethnology and sociology. | ||
+ | |||
+ | 89636 vertices of the second mode (rows, their labels are numbers from 1 to 89636) - | ||
+ | are cited papers. As the titles are long complete bibliographical descriptions, my | ||
+ | collegues replaced them by numbers. But in case you might wish to have a look at, I | ||
+ | attach paper titles in a separate file. ... | ||
+ | |||
+ | |||
+ | The task was to convert the matrix in CSV format into corresponding two-mode network in Pajek's format and the list of titles containing also text in cyrillic into an unicode file. | ||
+ | |||
+ | |||
+ | |||
+ | ===== CSV2Pajek ===== | ||
+ | |||
+ | <code> | ||
+ | # CSV2Pajek | ||
+ | # by Vladimir Batagelj, August 10, 2013 | ||
+ | |||
+ | setwd("C:/temp/pajek") | ||
+ | cat("*** CSV2Pajek",date(),"\n") | ||
+ | inp <- file("2_mode_cite_id.csv","r") | ||
+ | date() | ||
+ | n1 <- 0; repeat{L <- readLines(inp,n=1); if(length(L)==0) break; n1 <- n1+1} | ||
+ | close(inp) | ||
+ | n1 <- n1-1 # subtract header line | ||
+ | cat("n1 = ",n1,"\n") | ||
+ | date() | ||
+ | inp <- file("2_mode_cite_id.csv","r") | ||
+ | net <- file("2_mode_cite_id.net","w") | ||
+ | date() | ||
+ | L <- readLines(inp,n=1) | ||
+ | S <- unlist(strsplit(L,";")) | ||
+ | n2 <- length(S)-1 # subtract unit number field | ||
+ | cat("n2 = ",n2,"\n") | ||
+ | n3 <- n1-1 | ||
+ | cat("% *** CSV2Pajek",date(),"\n",file=net) | ||
+ | cat("*vertices",n1+n2,n1,"\n",file=net) | ||
+ | cat(paste(1:n1,paste('"u',1:n1,'"\n',sep='')),file=net) | ||
+ | cat(paste((n1+1):(n1+n2),paste('"v',1:n1,'"\n',sep='')),file=net) | ||
+ | cat("*arcs\n",file=net) | ||
+ | repeat{ | ||
+ | L <- readLines(inp,n=1) | ||
+ | if(length(L)==0) break | ||
+ | S <- as.integer(unlist(strsplit(L,";"))) | ||
+ | u <- S[1] | ||
+ | for(i in 2:length(S)){if(S[i]>0) cat(u,n3+i,"\n",file=net)} | ||
+ | } | ||
+ | close(inp); close(net) | ||
+ | cat("finished\n") | ||
+ | date() | ||
+ | </code> | ||
+ | |||
+ | ===== Titles into Unicode ===== | ||
+ | |||
+ | |||
+ | <code> | ||
+ | > tit <- file("titles.csv","r") | ||
+ | > L <- readLines(tit,n=10) | ||
+ | > close(tit) | ||
+ | > S <- L[8] | ||
+ | > S | ||
+ | [1] "7;[×lerëlânecé D.] Ðrnnócälícl î nâîáîäíuo oóäîclnnâro n îdcnrícle ... | ||
+ | > Encoding(S) | ||
+ | [1] "unknown" | ||
+ | > N <- charToRaw(S) | ||
+ | > N | ||
+ | [1] 37 3b 5b d7 e5 ea e0 eb e5 e2 f1 ea e8 e9 20 cf 2e 5d 20 d0 e0 f1 f1 f3 e6 | ||
+ | ... | ||
+ | > I <- as.integer(N) | ||
+ | > I | ||
+ | [1] 55 59 91 215 229 234 224 235 229 226 241 234 232 233 32 207 46 93 | ||
+ | ... | ||
+ | </code> | ||
+ | I assumed that the text is encoded in some 8-bit encoding. I downloaded some cyrillic fonts and changing the font to ''CourierCTT'' ([[http://www.fontstock.com/softdl/Courier_C.zip|Courier_C.zip]] from | ||
+ | [[http://www.aatseel.org/resources/fonts/windows_cyrillic.htm|Windows cyrillic fonts]]) it dispayed the text using the cyrillic characters: | ||
+ | <code> | ||
+ | "7;[Чекалевский П.] Рассуждение о свободных художествах с описанием некоторых | ||
+ | </code> | ||
+ | Comparing | ||
+ | <code> | ||
+ | [1] " 7 ; [ × l e r ë l â n e c é D . ] | ||
+ | 7 ; [ Ч е к а л е в с к и й П . ] | ||
+ | [1] 37 3b 5b d7 e5 ea e0 eb e5 e2 f1 ea e8 e9 20 cf 2e 5d | ||
+ | [1] 55 59 91 215 229 234 224 235 229 226 241 234 232 233 32 207 46 93 | ||
+ | </code> | ||
+ | with the table in Czyborra's [[http://czyborra.com/charsets/cyrillic.html|The Cyrillic Charset Soup]] I confirmed that the encoding is ''CP1251'' or ''Windows-1251''. | ||
+ | |||
+ | From [[http://docs.python.org/3/howto/unicode.html|Python/How to]] we learn: ... | ||
+ | The Unicode character U+FEFF is used as a byte-order mark (BOM). | ||
+ | In some areas, it is also convention to use a “BOM” at the start of UTF-8 encoded files; the name is misleading since UTF-8 is not byte-order dependent. The mark simply announces that the file is encoded in UTF-8. Use the ‘utf-8-sig’ codec to automatically skip the mark if present for reading such files... | ||
+ | |||
+ | <code> | ||
+ | >>> BOM = '\uFEFF' | ||
+ | >>> print(ord(BOM)) | ||
+ | 65279 | ||
+ | </code> | ||
+ | Now it is easy to get the unicode version of titles: | ||
+ | <code> | ||
+ | inp = open("titles.csv", "r", encoding="Windows-1251", errors="surrogateescape") | ||
+ | out = open("titles.utf8", "w", encoding="UTF-8") | ||
+ | BOM = '\uFEFF'; out.write(BOM) | ||
+ | for L in inp: out.write(L) | ||
+ | inp.close(); out.close() | ||
+ | </code> | ||
+ | |||
+ | ===== Fonts ===== | ||
+ | |||
+ | |||
+ | [[http://www.alanwood.net/unicode/fonts.html|Unicode fonts for Windows]]/[[http://www.alanwood.net/downloads/index.html|Code2000]],[[http://www.smashingmagazine.com/2006/10/11/17-more-free-quality-fonts/|19]], [[http://dejavu-fonts.org/wiki/Main_Page|DejaVu]], [[http://www.fontsaddict.com/font/bitstream-cyberbit.html|Bitstream cyberbit]], [[http://www.iis.ru/cyrillic/about/addlinks.en.html|Cyrillic]]. | ||
+ | |||
+ | |||
+ | |||
+ | ===== Cyrillic and LaTeX ===== | ||
+ | |||
+ | Sinoči me je Dacar po telefonu vprašal, kako v LaTeXu vključiti besedilo v cirilici. | ||
+ | |||
+ | ==== 17/08/2013 18:08 ==== | ||
+ | |||
+ | Vladimir Batagelj wrote: | ||
+ | |||
+ | Tu so zapiski s Sredinega seminarja, kjer sem predstavil [[http://vlado.fmf.uni-lj.si/seminar/14apr10/default.htm|Xe(La)TeX]]. | ||
+ | |||
+ | Tisti \fbox{ } okrog najav na začetku primera je (najbrž) odveč. V cirilici lahko pišeš z uporabo "Character map" (Accessories / System tools) - precej nerodno. Druga možnost so posebni unicodski urejevalniki - na primer | ||
+ | [[http://www.babelstone.co.uk/Software/BabelPad.html|BabelPad]]. | ||
+ | Za test sem ti pripravil kratko datoteko, ki jo pripenjam. | ||
+ | <code> | ||
+ | \documentclass[a4paper]{article} | ||
+ | \usepackage{fontspec} | ||
+ | \usepackage{xunicode} | ||
+ | \usepackage{xltxtra} | ||
+ | |||
+ | \setmainfont{Arial Unicode MS} | ||
+ | \begin{document} | ||
+ | Ljubljana | ||
+ | |||
+ | Љубљана | ||
+ | \end{document} | ||
+ | </code> | ||
+ | Obdelati jo moraš s programom XeLaTeX - moral bi biti na podpodročju TeXa /bin . | ||
+ | Še dva naslova: http://www.xelatex.org/ in http://www.texts.io/support/0002/ . | ||
+ | |||
+ | ==== Wed, August 21, 2013 12:37 ==== | ||
+ | |||
+ | <code> | ||
+ | Subject: Re: Cirilica v LaTeXu | ||
+ | From: "France Dacar" <France.Dacar@ijs.si> | ||
+ | Date: Wed, August 21, 2013 12:37 | ||
+ | To: vladimir.batagelj@fmf.uni-lj.si | ||
+ | </code> | ||
+ | |||
+ | Oi Vlado oi: | ||
+ | |||
+ | Vse te rešitve so v stilu "A je to!" -- če hočeš dodaten napis v ruski cirilici | ||
+ | na poštnem nabiralniku, moraš najprej podreti hišo, postaviti čisto nov | ||
+ | nabiralnik na granitnem kandelabru, nato pa nazaj zgraditi hišo okrog nabiralnika. | ||
+ | Želim napisati le dve-tri reference v ruščini (štiri, pet vrstic od dva tisoč), | ||
+ | ne nameravam pisati celih člankov ali monografij v ruščini... Menda ja nekje | ||
+ | prav v ta namen obstaja nekaj takega: | ||
+ | <code> | ||
+ | \usepackage{porussky} | ||
+ | |||
+ | ... | ||
+ | |||
+ | \begin{thebibliography}{9999} | ||
+ | \bibitem{GantTM} | ||
+ | \begin{porussky} | ||
+ | Feliks Rubimovi\ch\ Gantmaher, \textit{Teori\ya\ Matric}, ... | ||
+ | \end{porussky} | ||
+ | \end{thebibliography} | ||
+ | </code> | ||
+ | Če prav razumem, je AMS svojčas imel nekaj podobnega prav v ta namen -- napisati | ||
+ | par referenc na ruske vire (obstajala je celo izbira: ali v cirilici, ali | ||
+ | transliteracija v latinici). Potem pa so si zavihali rokave in spravili | ||
+ | v LaTeXu skupaj totalno podporo pisanju v cirilici... Gljah. | ||
+ | |||
+ | -- France | ||
+ | |||
+ | ==== Wed, August 21, 2013 18:49 ==== | ||
+ | <code> | ||
+ | Subject: Re: Cirilica v LaTeXu | ||
+ | From: "Vladimir Batagelj" <vladimir.batagelj@fmf.uni-lj.si> | ||
+ | Date: Wed, August 21, 2013 18:49 | ||
+ | To: "France Dacar" <France.Dacar@ijs.si> | ||
+ | </code> | ||
+ | |||
+ | Še malo sem pogooglal. | ||
+ | Zgleda, da gre tudi s PdfLatexom. Vzorec je na pripetih datotekah. | ||
+ | <code> | ||
+ | \documentclass{article} | ||
+ | \usepackage[utf8]{inputenc} | ||
+ | \usepackage[T2A]{fontenc} | ||
+ | \begin{document} | ||
+ | This text is in Russian: проверка. | ||
+ | \end{document} | ||
+ | </code> | ||
+ | Navaditi pa se bo potrebno na Unicode. Tistih nekaj cirilskih | ||
+ | znakov boš že nekako namigal. | ||
+ | |||
+ | AMS-jevska rešitev zgleda, da ni več podprta - nadomestil naj bi | ||
+ | jo paket babel, ki za grščino (pri meni) deluje, za ruščino pa ne. | ||
+ | Glej datoteki cirilica3. | ||
+ | <code> | ||
+ | \documentclass[a4paper,12pt]{article} | ||
+ | |||
+ | \usepackage[UTF8]{inputenc} | ||
+ | \usepackage[T1,T2A]{fontenc} | ||
+ | \usepackage[russian,greek,english]{babel} | ||
+ | |||
+ | \begin{document} | ||
+ | |||
+ | The last language listed will be the active (or default) one. | ||
+ | The others can be chosen for large blocks: | ||
+ | |||
+ | \selectlanguage{russian} | ||
+ | |||
+ | Горбачёв | ||
+ | |||
+ | \selectlanguage{greek} | ||
+ | |||
+ | Ellhnik`o ke`imeno. | ||
+ | |||
+ | \selectlanguage{english} | ||
+ | |||
+ | You can also insert short pieces of text in arbitrary languages, | ||
+ | even within paragraphs of a different language: | ||
+ | |||
+ | The capital of Russia is \foreignlanguage{russian}{Moskva} | ||
+ | and the capital of Greece is \foreignlanguage{greek}{Ajhna.} | ||
+ | |||
+ | \end{document} | ||
+ | </code> | ||
+ | Oba PDF-ja sta dobljena s PdfLatexom. | ||
+ | |||
+ | lp, Vlado | ||
+ | |||
+ | ==== Thu, August 22, 2013 12:21 ==== | ||
+ | |||
+ | <code> | ||
+ | Subject: Re: Cirilica v LaTeXu | ||
+ | From: "France Dacar" <France.Dacar@ijs.si> | ||
+ | Date: Thu, August 22, 2013 12:21 | ||
+ | To: vladimir.batagelj@fmf.uni-lj.si | ||
+ | </code> | ||
+ | |||
+ | Oi Vlado oi: | ||
+ | |||
+ | \foreignlanguage{russian}{...} je tisto kar potrebujem: glej test1.tex | ||
+ | in test1.pdf (pomembna je opcija OT2 v \usepackage[T1,OT2]{fontenc}). | ||
+ | V pmcyr.pdf je tabela, na strani 4. Simpel ko pasulj. | ||
+ | Ni treba čarati z Unicode ali UTF-8. (Uporabljam LaTeX okolje MiKTeX.) | ||
+ | <code> | ||
+ | \documentclass{article} | ||
+ | \usepackage[T1,OT2]{fontenc} | ||
+ | \usepackage[russian,english]{babel} | ||
+ | \begin{document} | ||
+ | This text is in Russian: \foreignlanguage{russian}{proverka}. | ||
+ | |||
+ | The capital of Russia is \foreignlanguage{russian}{Moskva}. | ||
+ | |||
+ | \foreignlanguage{russian}{Druzhba} means friendship. | ||
+ | |||
+ | \foreignlanguage{russian}{Feliks Ruvimovich Gantmaher, \textit{Teoriya Matric}}. | ||
+ | \end{document} | ||
+ | </code> | ||
+ | |||
+ | Zakaj je bilo to tako težko stauhati? To je bil že moj tretji ali četrti poskus. | ||
+ | Koliko sem že prebrodil po internetu gor in dol... | ||
+ | |||
+ | V priročniku za babel (babel.pdf, 450 strani(!)) ni ne mu ne bu o tem, | ||
+ | namreč kako deluje \foreignlanguage{russian}. | ||
+ | Ali je kje kakšen človeško napisan User Manual/Tutorial za babel? | ||
+ | Karkoli že babel.pdf je, uporabniku prijazen uporabniški priročnik vsekakor ni. | ||
+ | |||
+ | -- France | ||
+ | |||
+ | ==== Thu, August 22, 2013 17:16 ==== | ||
+ | <code> | ||
+ | Subject: Še ena tabela za cirilico v OT2 | ||
+ | From: "France Dacar" <France.Dacar@ijs.si> | ||
+ | Date: Thu, August 22, 2013 17:16 | ||
+ | To: "Vladimir Batagelj" <vladimir.batagelj@fmf.uni-lj.si> | ||
+ | </code> | ||
+ | |||
+ | Oi Vlado oi: | ||
+ | |||
+ | Kdor išče ta najde (če ve, kaj išče). | ||
+ | V prilogi je še ena [[http://herbert.the-little-red-haired-girl.org/dvi/pdf/cyrillic.pdf|tabela]] \ukazov in ligatur za cirilico pod OT2. | ||
+ | |||
+ | Prilagam še primer test2, kako se da shajati brez babel-a. | ||
+ | <code> | ||
+ | \documentclass[12pt]{article} | ||
+ | |||
+ | \usepackage[OT2,T1]{fontenc} | ||
+ | \newcommand{\cyr}{\fontencoding{OT2}\fontfamily{wncyr}\selectfont} | ||
+ | \setlength{\parindent}{0pt} | ||
+ | |||
+ | \begin{document} | ||
+ | This text is in Russian: {\cyr proverka}. | ||
+ | |||
+ | The capital of Russia is {\cyr Moskva}. | ||
+ | |||
+ | {\cyr Druzhba} means friendship. | ||
+ | |||
+ | {\cyr Feliks Ruvimovich Gantmaher, | ||
+ | \textit{Teo\-riya~Ma\-tric} \textit{Teo\-riya~Ma\-tric} | ||
+ | \textit{Teo\-riya~Ma\-tric}}. | ||
+ | |||
+ | Compare {\cyr <sovetskii0>} (false ligature \texttt{ts}) | ||
+ | and {\cyr <sovet{}skii0>} (with \texttt{t\{\}s}). | ||
+ | \end{document} | ||
+ | </code> | ||
+ | Če ni babel-a, ni avtomatičnega deljenja besed, a ker vse skupaj potrebujem | ||
+ | le za nekaj referenc, lahko sam nakažem deljenje, če in kadar je treba. | ||
+ | |||
+ | -- France | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ |