Cyrillic

Subject:   matrix description 
From:   "Maria Safonova" <safonovam@yandex.ru> 
Date:   Sat, August 10, 2013 10:59 

… Matrix represents 2-mode network,

Separator is semicolon (;),

8304 vertices of the first mode (columns, their labels are v0001 - v8304) – are citing articles in Russian journals on ethnology and sociology.

89636 vertices of the second mode (rows, their labels are numbers from 1 to 89636) - are cited papers. As the titles are long complete bibliographical descriptions, my collegues replaced them by numbers. But in case you might wish to have a look at, I attach paper titles in a separate file. …

The task was to convert the matrix in CSV format into corresponding two-mode network in Pajek's format and the list of titles containing also text in cyrillic into an unicode file.

CSV2Pajek

# CSV2Pajek
# by Vladimir Batagelj, August 10, 2013 

setwd("C:/temp/pajek")
cat("*** CSV2Pajek",date(),"\n")
inp <- file("2_mode_cite_id.csv","r")
date()
n1 <- 0; repeat{L <- readLines(inp,n=1); if(length(L)==0) break; n1 <- n1+1}
close(inp)
n1 <- n1-1  # subtract header line
cat("n1 = ",n1,"\n")
date()
inp <- file("2_mode_cite_id.csv","r")
net <- file("2_mode_cite_id.net","w")
date()
L <- readLines(inp,n=1)
S <- unlist(strsplit(L,";"))
n2 <- length(S)-1   # subtract unit number field
cat("n2 = ",n2,"\n")
n3 <- n1-1
cat("% *** CSV2Pajek",date(),"\n",file=net)
cat("*vertices",n1+n2,n1,"\n",file=net)
cat(paste(1:n1,paste('"u',1:n1,'"\n',sep='')),file=net)
cat(paste((n1+1):(n1+n2),paste('"v',1:n1,'"\n',sep='')),file=net)
cat("*arcs\n",file=net)
repeat{
  L <- readLines(inp,n=1)
  if(length(L)==0) break
  S <- as.integer(unlist(strsplit(L,";")))
  u <- S[1]
  for(i in 2:length(S)){if(S[i]>0) cat(u,n3+i,"\n",file=net)}
}
close(inp); close(net)
cat("finished\n")
date()

Titles into Unicode

> tit <- file("titles.csv","r")
> L <- readLines(tit,n=10)
> close(tit)
> S <- L[8]
> S
[1] "7;[×lerëlânecé D.] Ðrnnócälícl î nâîáîäíuo oóäîclnnâro n îdcnrícle ...
> Encoding(S)
[1] "unknown"
> N <- charToRaw(S)
> N
  [1] 37 3b 5b d7 e5 ea e0 eb e5 e2 f1 ea e8 e9 20 cf 2e 5d 20 d0 e0 f1 f1 f3 e6
...
> I <- as.integer(N)
> I
  [1]  55  59  91 215 229 234 224 235 229 226 241 234 232 233  32 207  46  93
...

I assumed that the text is encoded in some 8-bit encoding. I downloaded some cyrillic fonts and changing the font to CourierCTT (Courier_C.zip from Windows cyrillic fonts) it dispayed the text using the cyrillic characters:

"7;[Чекалевский П.] Рассуждение о свободных художествах с описанием некоторых 

Comparing

[1] "   7   ;   [   ×   l   e   r   ë   l   â   n   e   c   é       D   .   ] 
        7   ;   [   Ч   е   к   а   л   е   в   с   к   и   й       П   .   ] 
[1]    37  3b  5b  d7  e5  ea  e0  eb  e5  e2  f1  ea  e8  e9  20  cf  2e  5d 
[1]    55  59  91 215 229 234 224 235 229 226 241 234 232 233  32 207  46  93

with the table in Czyborra's The Cyrillic Charset Soup I confirmed that the encoding is CP1251 or Windows-1251.

From Python/How to we learn: … The Unicode character U+FEFF is used as a byte-order mark (BOM). In some areas, it is also convention to use a “BOM” at the start of UTF-8 encoded files; the name is misleading since UTF-8 is not byte-order dependent. The mark simply announces that the file is encoded in UTF-8. Use the ‘utf-8-sig’ codec to automatically skip the mark if present for reading such files…

>>> BOM = '\uFEFF'
>>> print(ord(BOM))
65279

Now it is easy to get the unicode version of titles:

inp = open("titles.csv", "r", encoding="Windows-1251", errors="surrogateescape")
out = open("titles.utf8", "w", encoding="UTF-8")
BOM = '\uFEFF'; out.write(BOM)
for L in inp: out.write(L)
inp.close(); out.close()

Fonts

Cyrillic and LaTeX

Sinoči me je Dacar po telefonu vprašal, kako v LaTeXu vključiti besedilo v cirilici.

17/08/2013 18:08

Vladimir Batagelj wrote:

Tu so zapiski s Sredinega seminarja, kjer sem predstavil Xe(La)TeX.

Tisti \fbox{ } okrog najav na začetku primera je (najbrž) odveč. V cirilici lahko pišeš z uporabo “Character map” (Accessories / System tools) - precej nerodno. Druga možnost so posebni unicodski urejevalniki - na primer BabelPad. Za test sem ti pripravil kratko datoteko, ki jo pripenjam.

\documentclass[a4paper]{article}
\usepackage{fontspec}
\usepackage{xunicode}
\usepackage{xltxtra}

\setmainfont{Arial Unicode MS}
\begin{document}
Ljubljana

Љубљана
\end{document}

Obdelati jo moraš s programom XeLaTeX - moral bi biti na podpodročju TeXa /bin . Še dva naslova: http://www.xelatex.org/ in http://www.texts.io/support/0002/ .

Wed, August 21, 2013 12:37

Subject:   Re: Cirilica v LaTeXu 
From:   "France Dacar" <France.Dacar@ijs.si> 
Date:   Wed, August 21, 2013 12:37 
To:   vladimir.batagelj@fmf.uni-lj.si 

Oi Vlado oi:

Vse te rešitve so v stilu “A je to!” – če hočeš dodaten napis v ruski cirilici na poštnem nabiralniku, moraš najprej podreti hišo, postaviti čisto nov nabiralnik na granitnem kandelabru, nato pa nazaj zgraditi hišo okrog nabiralnika. Želim napisati le dve-tri reference v ruščini (štiri, pet vrstic od dva tisoč), ne nameravam pisati celih člankov ali monografij v ruščini… Menda ja nekje prav v ta namen obstaja nekaj takega:

\usepackage{porussky}

...

\begin{thebibliography}{9999}
\bibitem{GantTM}
\begin{porussky}
Feliks Rubimovi\ch\ Gantmaher, \textit{Teori\ya\ Matric}, ...
\end{porussky}
\end{thebibliography}

Če prav razumem, je AMS svojčas imel nekaj podobnega prav v ta namen – napisati par referenc na ruske vire (obstajala je celo izbira: ali v cirilici, ali transliteracija v latinici). Potem pa so si zavihali rokave in spravili v LaTeXu skupaj totalno podporo pisanju v cirilici… Gljah.

– France

Wed, August 21, 2013 18:49

Subject:   Re: Cirilica v LaTeXu 
From:   "Vladimir Batagelj" <vladimir.batagelj@fmf.uni-lj.si> 
Date:   Wed, August 21, 2013 18:49 
To:   "France Dacar" <France.Dacar@ijs.si> 

Še malo sem pogooglal. Zgleda, da gre tudi s PdfLatexom. Vzorec je na pripetih datotekah.

\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[T2A]{fontenc}
\begin{document}
This text is in Russian: проверка.
\end{document}

Navaditi pa se bo potrebno na Unicode. Tistih nekaj cirilskih znakov boš že nekako namigal.

AMS-jevska rešitev zgleda, da ni več podprta - nadomestil naj bi jo paket babel, ki za grščino (pri meni) deluje, za ruščino pa ne. Glej datoteki cirilica3.

\documentclass[a4paper,12pt]{article}

\usepackage[UTF8]{inputenc}
\usepackage[T1,T2A]{fontenc}
\usepackage[russian,greek,english]{babel}

\begin{document}

The last language listed will be the active (or default) one.
The others can be chosen for large blocks:

\selectlanguage{russian}

Горбачёв

\selectlanguage{greek}

Ellhnik`o ke`imeno.

\selectlanguage{english}

You can also insert short pieces of text in arbitrary languages,
even within paragraphs of a different language:

The capital of Russia is \foreignlanguage{russian}{Moskva}
and the capital of Greece is \foreignlanguage{greek}{Ajhna.}

\end{document}

Oba PDF-ja sta dobljena s PdfLatexom.

lp, Vlado

Thu, August 22, 2013 12:21

Subject:   Re: Cirilica v LaTeXu 
From:   "France Dacar" <France.Dacar@ijs.si> 
Date:   Thu, August 22, 2013 12:21 
To:   vladimir.batagelj@fmf.uni-lj.si 

Oi Vlado oi:

\foreignlanguage{russian}{…} je tisto kar potrebujem: glej test1.tex in test1.pdf (pomembna je opcija OT2 v \usepackage[T1,OT2]{fontenc}). V pmcyr.pdf je tabela, na strani 4. Simpel ko pasulj. Ni treba čarati z Unicode ali UTF-8. (Uporabljam LaTeX okolje MiKTeX.)

\documentclass{article}
\usepackage[T1,OT2]{fontenc}
\usepackage[russian,english]{babel}
\begin{document}
This text is in Russian: \foreignlanguage{russian}{proverka}.

The capital of Russia is \foreignlanguage{russian}{Moskva}.

\foreignlanguage{russian}{Druzhba} means friendship.

\foreignlanguage{russian}{Feliks Ruvimovich Gantmaher, \textit{Teoriya Matric}}.
\end{document}

Zakaj je bilo to tako težko stauhati? To je bil že moj tretji ali četrti poskus. Koliko sem že prebrodil po internetu gor in dol…

V priročniku za babel (babel.pdf, 450 strani(!)) ni ne mu ne bu o tem, namreč kako deluje \foreignlanguage{russian}. Ali je kje kakšen človeško napisan User Manual/Tutorial za babel? Karkoli že babel.pdf je, uporabniku prijazen uporabniški priročnik vsekakor ni.

– France

Thu, August 22, 2013 17:16

Subject:   Še ena tabela za cirilico v OT2 
From:   "France Dacar" <France.Dacar@ijs.si> 
Date:   Thu, August 22, 2013 17:16 
To:   "Vladimir Batagelj" <vladimir.batagelj@fmf.uni-lj.si> 

Oi Vlado oi:

Kdor išče ta najde (če ve, kaj išče). V prilogi je še ena tabela \ukazov in ligatur za cirilico pod OT2.

Prilagam še primer test2, kako se da shajati brez babel-a.

\documentclass[12pt]{article}

\usepackage[OT2,T1]{fontenc}
\newcommand{\cyr}{\fontencoding{OT2}\fontfamily{wncyr}\selectfont}
\setlength{\parindent}{0pt}

\begin{document}
This text is in Russian: {\cyr proverka}.

The capital of Russia is {\cyr Moskva}.

{\cyr Druzhba} means friendship.

{\cyr Feliks Ruvimovich Gantmaher,
        \textit{Teo\-riya~Ma\-tric} \textit{Teo\-riya~Ma\-tric} 
        \textit{Teo\-riya~Ma\-tric}}.

Compare {\cyr <sovetskii0>} (false ligature \texttt{ts})
and {\cyr <sovet{}skii0>} (with \texttt{t\{\}s}).
\end{document}

Če ni babel-a, ni avtomatičnega deljenja besed, a ker vse skupaj potrebujem le za nekaj referenc, lahko sam nakažem deljenje, če in kadar je treba.

– France

notes/net/cyr.txt · Last modified: 2015/07/16 22:11 by vlado
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki