Differences

This shows you the differences between two versions of the page.

Link to this comparison view

notes:net:cyr [2015/07/16 22:11] (current)
vlado created
Line 1: Line 1:
 +====== Cyrillic ======
  
 +<code>
 +Subject:   matrix description 
 +From:   "Maria Safonova" <safonovam@yandex.ru> 
 +Date:   Sat, August 10, 2013 10:59 
 +</code> 
 + 
 + 
 +... Matrix represents 2-mode network, 
 +
 +Separator is semicolon (;),
 +
 +8304 vertices of the first mode (columns, their labels are v0001 - v8304) – are
 +citing articles in Russian journals on ethnology and sociology.  
 +
 +89636 vertices of the second mode (rows, their labels are numbers from 1 to 89636) -
 +are cited papers. As the titles are long complete bibliographical descriptions, my
 +collegues replaced them by numbers. But in case you might wish to have a look at, I
 +attach paper titles in a separate file. ...
 +
 +
 +The task was to convert the matrix in CSV format into corresponding two-mode network in Pajek's format and the list of titles containing also text in cyrillic into an unicode file. 
 + 
 + 
 +
 +===== CSV2Pajek =====
 +
 +<code>
 +# CSV2Pajek
 +# by Vladimir Batagelj, August 10, 2013 
 +
 +setwd("C:/temp/pajek")
 +cat("*** CSV2Pajek",date(),"\n")
 +inp <- file("2_mode_cite_id.csv","r")
 +date()
 +n1 <- 0; repeat{L <- readLines(inp,n=1); if(length(L)==0) break; n1 <- n1+1}
 +close(inp)
 +n1 <- n1-1  # subtract header line
 +cat("n1 = ",n1,"\n")
 +date()
 +inp <- file("2_mode_cite_id.csv","r")
 +net <- file("2_mode_cite_id.net","w")
 +date()
 +L <- readLines(inp,n=1)
 +S <- unlist(strsplit(L,";"))
 +n2 <- length(S)-1   # subtract unit number field
 +cat("n2 = ",n2,"\n")
 +n3 <- n1-1
 +cat("% *** CSV2Pajek",date(),"\n",file=net)
 +cat("*vertices",n1+n2,n1,"\n",file=net)
 +cat(paste(1:n1,paste('"u',1:n1,'"\n',sep='')),file=net)
 +cat(paste((n1+1):(n1+n2),paste('"v',1:n1,'"\n',sep='')),file=net)
 +cat("*arcs\n",file=net)
 +repeat{
 +  L <- readLines(inp,n=1)
 +  if(length(L)==0) break
 +  S <- as.integer(unlist(strsplit(L,";")))
 +  u <- S[1]
 +  for(i in 2:length(S)){if(S[i]>0) cat(u,n3+i,"\n",file=net)}
 +}
 +close(inp); close(net)
 +cat("finished\n")
 +date()
 +</code>
 +
 +===== Titles into Unicode =====
 +
 + 
 +<code>
 +> tit <- file("titles.csv","r")
 +> L <- readLines(tit,n=10)
 +> close(tit)
 +> S <- L[8]
 +> S
 +[1] "7;[×lerëlânecé D.] Ðrnnócälícl î nâîáîäíuo oóäîclnnâro n îdcnrícle ...
 +> Encoding(S)
 +[1] "unknown"
 +> N <- charToRaw(S)
 +> N
 +  [1] 37 3b 5b d7 e5 ea e0 eb e5 e2 f1 ea e8 e9 20 cf 2e 5d 20 d0 e0 f1 f1 f3 e6
 +...
 +> I <- as.integer(N)
 +> I
 +  [1]  55  59  91 215 229 234 224 235 229 226 241 234 232 233  32 207  46  93
 +...
 +</code>
 +I assumed that the text is encoded in some 8-bit encoding. I downloaded some cyrillic fonts and changing the font to ''CourierCTT'' ([[http://www.fontstock.com/softdl/Courier_C.zip|Courier_C.zip]] from
 +[[http://www.aatseel.org/resources/fonts/windows_cyrillic.htm|Windows cyrillic fonts]]) it dispayed the text using the cyrillic characters:
 +<code>
 +"7;[Чекалевский П.] Рассуждение о свободных художествах с описанием некоторых 
 +</code>
 +Comparing
 +<code>
 +[1] "   7   ;   [   ×   l   e   r   ë   l   â   n   e   c   é       D   .   ] 
 +        7   ;   [   Ч   е   к   а   л   е   в   с   к   и   й       П   .   ] 
 +[1]    37  3b  5b  d7  e5  ea  e0  eb  e5  e2  f1  ea  e8  e9  20  cf  2e  5d 
 +[1]    55  59  91 215 229 234 224 235 229 226 241 234 232 233  32 207  46  93
 +</code>
 +with the table in Czyborra's [[http://czyborra.com/charsets/cyrillic.html|The Cyrillic Charset Soup]] I confirmed that the encoding is ''CP1251'' or ''Windows-1251''.
 +
 +From [[http://docs.python.org/3/howto/unicode.html|Python/How to]] we learn: ...
 +The Unicode character U+FEFF is used as a byte-order mark (BOM). 
 +In some areas, it is also convention to use a “BOM” at the start of UTF-8 encoded files; the name is misleading since UTF-8 is not byte-order dependent. The mark simply announces that the file is encoded in UTF-8. Use the ‘utf-8-sig’ codec to automatically skip the mark if present for reading such files...
 +
 +<code>
 +>>> BOM = '\uFEFF'
 +>>> print(ord(BOM))
 +65279
 +</code>
 +Now it is easy to get the unicode version of titles:
 +<code>
 +inp = open("titles.csv", "r", encoding="Windows-1251", errors="surrogateescape")
 +out = open("titles.utf8", "w", encoding="UTF-8")
 +BOM = '\uFEFF'; out.write(BOM)
 +for L in inp: out.write(L)
 +inp.close(); out.close()
 +</code>
 +
 +===== Fonts =====
 +
 +
 +[[http://www.alanwood.net/unicode/fonts.html|Unicode fonts for Windows]]/[[http://www.alanwood.net/downloads/index.html|Code2000]],[[http://www.smashingmagazine.com/2006/10/11/17-more-free-quality-fonts/|19]], [[http://dejavu-fonts.org/wiki/Main_Page|DejaVu]], [[http://www.fontsaddict.com/font/bitstream-cyberbit.html|Bitstream cyberbit]], [[http://www.iis.ru/cyrillic/about/addlinks.en.html|Cyrillic]].
 +
 +
 +
 +===== Cyrillic and LaTeX =====
 +
 +Sinoči me je Dacar po telefonu vprašal, kako v LaTeXu vključiti besedilo v cirilici.
 +
 +==== 17/08/2013 18:08 ====
 +
 +Vladimir Batagelj wrote:
 +
 +Tu so zapiski s Sredinega seminarja, kjer sem predstavil [[http://vlado.fmf.uni-lj.si/seminar/14apr10/default.htm|Xe(La)TeX]].
 +
 +Tisti \fbox{ } okrog najav na začetku primera je (najbrž) odveč. V cirilici lahko pišeš z uporabo "Character map" (Accessories / System tools) - precej nerodno. Druga možnost so posebni unicodski urejevalniki - na primer
 +[[http://www.babelstone.co.uk/Software/BabelPad.html|BabelPad]].
 +Za test sem ti pripravil kratko datoteko, ki jo pripenjam.
 +<code>
 +\documentclass[a4paper]{article}
 +\usepackage{fontspec}
 +\usepackage{xunicode}
 +\usepackage{xltxtra}
 +
 +\setmainfont{Arial Unicode MS}
 +\begin{document}
 +Ljubljana
 +
 +Љубљана
 +\end{document}
 +</code>
 +Obdelati jo moraš s programom XeLaTeX - moral bi biti na podpodročju TeXa  /bin .
 +Še dva naslova: http://www.xelatex.org/ in  http://www.texts.io/support/0002/ .
 +
 +==== Wed, August 21, 2013 12:37 ====
 +
 +<code>
 +Subject:   Re: Cirilica v LaTeXu 
 +From:   "France Dacar" <France.Dacar@ijs.si> 
 +Date:   Wed, August 21, 2013 12:37 
 +To:   vladimir.batagelj@fmf.uni-lj.si 
 +</code> 
 + 
 +Oi Vlado oi:
 +
 +Vse te rešitve so v stilu "A je to!" -- če hočeš dodaten napis v ruski cirilici
 +na poštnem nabiralniku, moraš najprej podreti hišo, postaviti čisto nov
 +nabiralnik na granitnem kandelabru, nato pa nazaj zgraditi hišo okrog nabiralnika.
 +Želim napisati le dve-tri reference v ruščini (štiri, pet vrstic od dva tisoč),
 +ne nameravam pisati celih člankov ali monografij v ruščini...  Menda ja nekje
 +prav v ta namen obstaja nekaj takega:
 +<code>
 +\usepackage{porussky}
 +
 +...
 +
 +\begin{thebibliography}{9999}
 +\bibitem{GantTM}
 +\begin{porussky}
 +Feliks Rubimovi\ch\ Gantmaher, \textit{Teori\ya\ Matric}, ...
 +\end{porussky}
 +\end{thebibliography}
 +</code>
 +Če prav razumem, je AMS svojčas imel nekaj podobnega prav v ta namen -- napisati
 +par referenc na ruske vire (obstajala je celo izbira: ali v cirilici, ali
 +transliteracija v latinici).  Potem pa so si zavihali rokave in spravili
 +v LaTeXu skupaj totalno podporo pisanju v cirilici...  Gljah.
 +
 +-- France
 +
 +==== Wed, August 21, 2013 18:49 ====
 +<code>
 +Subject:   Re: Cirilica v LaTeXu 
 +From:   "Vladimir Batagelj" <vladimir.batagelj@fmf.uni-lj.si> 
 +Date:   Wed, August 21, 2013 18:49 
 +To:   "France Dacar" <France.Dacar@ijs.si> 
 +</code>
 +
 +Še malo sem pogooglal.
 +Zgleda, da gre tudi s PdfLatexom. Vzorec je na pripetih datotekah.
 +<code>
 +\documentclass{article}
 +\usepackage[utf8]{inputenc}
 +\usepackage[T2A]{fontenc}
 +\begin{document}
 +This text is in Russian: проверка.
 +\end{document}
 +</code>
 +Navaditi pa se bo potrebno na Unicode. Tistih nekaj cirilskih
 +znakov boš že nekako namigal.
 +
 +AMS-jevska rešitev zgleda, da ni več podprta - nadomestil naj bi
 +jo paket babel, ki za grščino (pri meni) deluje, za ruščino pa ne.
 +Glej datoteki cirilica3.
 +<code>
 +\documentclass[a4paper,12pt]{article}
 +
 +\usepackage[UTF8]{inputenc}
 +\usepackage[T1,T2A]{fontenc}
 +\usepackage[russian,greek,english]{babel}
 +
 +\begin{document}
 +
 +The last language listed will be the active (or default) one.
 +The others can be chosen for large blocks:
 +
 +\selectlanguage{russian}
 +
 +Горбачёв
 +
 +\selectlanguage{greek}
 +
 +Ellhnik`o ke`imeno.
 +
 +\selectlanguage{english}
 +
 +You can also insert short pieces of text in arbitrary languages,
 +even within paragraphs of a different language:
 +
 +The capital of Russia is \foreignlanguage{russian}{Moskva}
 +and the capital of Greece is \foreignlanguage{greek}{Ajhna.}
 +
 +\end{document}
 +</code>
 +Oba PDF-ja sta dobljena s PdfLatexom.
 +
 +lp,  Vlado
 +
 +==== Thu, August 22, 2013 12:21 ====
 +
 +<code>
 +Subject:   Re: Cirilica v LaTeXu 
 +From:   "France Dacar" <France.Dacar@ijs.si> 
 +Date:   Thu, August 22, 2013 12:21 
 +To:   vladimir.batagelj@fmf.uni-lj.si 
 +</code>
 +
 +Oi Vlado oi:
 +
 +\foreignlanguage{russian}{...} je tisto kar potrebujem: glej test1.tex
 +in test1.pdf (pomembna je opcija OT2 v \usepackage[T1,OT2]{fontenc}).
 +V pmcyr.pdf je tabela, na strani 4.  Simpel ko pasulj.
 +Ni treba čarati z Unicode ali UTF-8.  (Uporabljam LaTeX okolje MiKTeX.)
 +<code>
 +\documentclass{article}
 +\usepackage[T1,OT2]{fontenc}
 +\usepackage[russian,english]{babel}
 +\begin{document}
 +This text is in Russian: \foreignlanguage{russian}{proverka}.
 +
 +The capital of Russia is \foreignlanguage{russian}{Moskva}.
 +
 +\foreignlanguage{russian}{Druzhba} means friendship.
 +
 +\foreignlanguage{russian}{Feliks Ruvimovich Gantmaher, \textit{Teoriya Matric}}.
 +\end{document}
 +</code>
 +
 +Zakaj je bilo to tako težko stauhati?  To je bil že moj tretji ali četrti poskus.
 +Koliko sem že prebrodil po internetu gor in dol...
 +
 +V priročniku za babel (babel.pdf, 450 strani(!)) ni ne mu ne bu o tem,
 +namreč kako deluje \foreignlanguage{russian}.
 +Ali je kje kakšen človeško napisan User Manual/Tutorial za babel?
 +Karkoli že babel.pdf je, uporabniku prijazen uporabniški priročnik vsekakor ni.
 +
 +-- France
 +
 +==== Thu, August 22, 2013 17:16 ====
 +<code>
 +Subject:   Še ena tabela za cirilico v OT2 
 +From:   "France Dacar" <France.Dacar@ijs.si> 
 +Date:   Thu, August 22, 2013 17:16 
 +To:   "Vladimir Batagelj" <vladimir.batagelj@fmf.uni-lj.si> 
 +</code>
 +
 +Oi Vlado oi:
 +
 +Kdor išče ta najde (če ve, kaj išče).
 +V prilogi je še ena [[http://herbert.the-little-red-haired-girl.org/dvi/pdf/cyrillic.pdf|tabela]] \ukazov in ligatur za cirilico pod OT2.
 +
 +Prilagam še primer test2, kako se da shajati brez babel-a.
 +<code>
 +\documentclass[12pt]{article}
 +
 +\usepackage[OT2,T1]{fontenc}
 +\newcommand{\cyr}{\fontencoding{OT2}\fontfamily{wncyr}\selectfont}
 +\setlength{\parindent}{0pt}
 +
 +\begin{document}
 +This text is in Russian: {\cyr proverka}.
 +
 +The capital of Russia is {\cyr Moskva}.
 +
 +{\cyr Druzhba} means friendship.
 +
 +{\cyr Feliks Ruvimovich Gantmaher,
 +        \textit{Teo\-riya~Ma\-tric} \textit{Teo\-riya~Ma\-tric} 
 +        \textit{Teo\-riya~Ma\-tric}}.
 +
 +Compare {\cyr <sovetskii0>} (false ligature \texttt{ts})
 +and {\cyr <sovet{}skii0>} (with \texttt{t\{\}s}).
 +\end{document}
 +</code>
 +Če ni babel-a, ni avtomatičnega deljenja besed, a ker vse skupaj potrebujem
 +le za nekaj referenc, lahko sam nakažem deljenje, če in kadar je treba.
 +
 +-- France
 +
 +
 +
 +
 + 
 + 
 + 
 + 
 + 
 + 
 + 
 + 
 +
 + 
 + 
 + 
 +
 + 
 + 
 + 
 + 
notes/net/cyr.txt · Last modified: 2015/07/16 22:11 by vlado
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki