
Subject:   matrix description 
From:   "Maria Safonova" <> 
Date:   Sat, August 10, 2013 10:59 

… Matrix represents 2-mode network,

Separator is semicolon (;),

8304 vertices of the first mode (columns, their labels are v0001 - v8304) – are citing articles in Russian journals on ethnology and sociology.

89636 vertices of the second mode (rows, their labels are numbers from 1 to 89636) - are cited papers. As the titles are long complete bibliographical descriptions, my collegues replaced them by numbers. But in case you might wish to have a look at, I attach paper titles in a separate file. …

The task was to convert the matrix in CSV format into corresponding two-mode network in Pajek's format and the list of titles containing also text in cyrillic into an unicode file.


# CSV2Pajek
# by Vladimir Batagelj, August 10, 2013 

cat("*** CSV2Pajek",date(),"\n")
inp <- file("2_mode_cite_id.csv","r")
n1 <- 0; repeat{L <- readLines(inp,n=1); if(length(L)==0) break; n1 <- n1+1}
n1 <- n1-1  # subtract header line
cat("n1 = ",n1,"\n")
inp <- file("2_mode_cite_id.csv","r")
net <- file("","w")
L <- readLines(inp,n=1)
S <- unlist(strsplit(L,";"))
n2 <- length(S)-1   # subtract unit number field
cat("n2 = ",n2,"\n")
n3 <- n1-1
cat("% *** CSV2Pajek",date(),"\n",file=net)
  L <- readLines(inp,n=1)
  if(length(L)==0) break
  S <- as.integer(unlist(strsplit(L,";")))
  u <- S[1]
  for(i in 2:length(S)){if(S[i]>0) cat(u,n3+i,"\n",file=net)}
close(inp); close(net)

Titles into Unicode

> tit <- file("titles.csv","r")
> L <- readLines(tit,n=10)
> close(tit)
> S <- L[8]
> S
[1] "7;[×lerëlânecé D.] Ðrnnócälícl î nâîáîäíuo oóäîclnnâro n îdcnrícle ...
> Encoding(S)
[1] "unknown"
> N <- charToRaw(S)
> N
  [1] 37 3b 5b d7 e5 ea e0 eb e5 e2 f1 ea e8 e9 20 cf 2e 5d 20 d0 e0 f1 f1 f3 e6
> I <- as.integer(N)
> I
  [1]  55  59  91 215 229 234 224 235 229 226 241 234 232 233  32 207  46  93

I assumed that the text is encoded in some 8-bit encoding. I downloaded some cyrillic fonts and changing the font to CourierCTT ( from Windows cyrillic fonts) it dispayed the text using the cyrillic characters:

"7;[Чекалевский П.] Рассуждение о свободных художествах с описанием некоторых 


[1] "   7   ;   [   ×   l   e   r   ë   l   â   n   e   c   é       D   .   ] 
        7   ;   [   Ч   е   к   а   л   е   в   с   к   и   й       П   .   ] 
[1]    37  3b  5b  d7  e5  ea  e0  eb  e5  e2  f1  ea  e8  e9  20  cf  2e  5d 
[1]    55  59  91 215 229 234 224 235 229 226 241 234 232 233  32 207  46  93

with the table in Czyborra's The Cyrillic Charset Soup I confirmed that the encoding is CP1251 or Windows-1251.

From Python/How to we learn: … The Unicode character U+FEFF is used as a byte-order mark (BOM). In some areas, it is also convention to use a “BOM” at the start of UTF-8 encoded files; the name is misleading since UTF-8 is not byte-order dependent. The mark simply announces that the file is encoded in UTF-8. Use the ‘utf-8-sig’ codec to automatically skip the mark if present for reading such files…

>>> BOM = '\uFEFF'
>>> print(ord(BOM))

Now it is easy to get the unicode version of titles:

inp = open("titles.csv", "r", encoding="Windows-1251", errors="surrogateescape")
out = open("titles.utf8", "w", encoding="UTF-8")
BOM = '\uFEFF'; out.write(BOM)
for L in inp: out.write(L)
inp.close(); out.close()


Cyrillic and LaTeX

Sinoči me je Dacar po telefonu vprašal, kako v LaTeXu vključiti besedilo v cirilici.

17/08/2013 18:08

Vladimir Batagelj wrote:

Tu so zapiski s Sredinega seminarja, kjer sem predstavil Xe(La)TeX.

Tisti \fbox{ } okrog najav na začetku primera je (najbrž) odveč. V cirilici lahko pišeš z uporabo “Character map” (Accessories / System tools) - precej nerodno. Druga možnost so posebni unicodski urejevalniki - na primer BabelPad. Za test sem ti pripravil kratko datoteko, ki jo pripenjam.


\setmainfont{Arial Unicode MS}


Obdelati jo moraš s programom XeLaTeX - moral bi biti na podpodročju TeXa /bin . Še dva naslova: in .

Wed, August 21, 2013 12:37

Subject:   Re: Cirilica v LaTeXu 
From:   "France Dacar" <> 
Date:   Wed, August 21, 2013 12:37 

Oi Vlado oi:

Vse te rešitve so v stilu “A je to!” – če hočeš dodaten napis v ruski cirilici na poštnem nabiralniku, moraš najprej podreti hišo, postaviti čisto nov nabiralnik na granitnem kandelabru, nato pa nazaj zgraditi hišo okrog nabiralnika. Želim napisati le dve-tri reference v ruščini (štiri, pet vrstic od dva tisoč), ne nameravam pisati celih člankov ali monografij v ruščini… Menda ja nekje prav v ta namen obstaja nekaj takega:



Feliks Rubimovi\ch\ Gantmaher, \textit{Teori\ya\ Matric}, ...

Če prav razumem, je AMS svojčas imel nekaj podobnega prav v ta namen – napisati par referenc na ruske vire (obstajala je celo izbira: ali v cirilici, ali transliteracija v latinici). Potem pa so si zavihali rokave in spravili v LaTeXu skupaj totalno podporo pisanju v cirilici… Gljah.

– France

Wed, August 21, 2013 18:49

Subject:   Re: Cirilica v LaTeXu 
From:   "Vladimir Batagelj" <> 
Date:   Wed, August 21, 2013 18:49 
To:   "France Dacar" <> 

Še malo sem pogooglal. Zgleda, da gre tudi s PdfLatexom. Vzorec je na pripetih datotekah.

This text is in Russian: проверка.

Navaditi pa se bo potrebno na Unicode. Tistih nekaj cirilskih znakov boš že nekako namigal.

AMS-jevska rešitev zgleda, da ni več podprta - nadomestil naj bi jo paket babel, ki za grščino (pri meni) deluje, za ruščino pa ne. Glej datoteki cirilica3.




The last language listed will be the active (or default) one.
The others can be chosen for large blocks:




Ellhnik`o ke`imeno.


You can also insert short pieces of text in arbitrary languages,
even within paragraphs of a different language:

The capital of Russia is \foreignlanguage{russian}{Moskva}
and the capital of Greece is \foreignlanguage{greek}{Ajhna.}


Oba PDF-ja sta dobljena s PdfLatexom.

lp, Vlado

Thu, August 22, 2013 12:21

Subject:   Re: Cirilica v LaTeXu 
From:   "France Dacar" <> 
Date:   Thu, August 22, 2013 12:21 

Oi Vlado oi:

\foreignlanguage{russian}{…} je tisto kar potrebujem: glej test1.tex in test1.pdf (pomembna je opcija OT2 v \usepackage[T1,OT2]{fontenc}). V pmcyr.pdf je tabela, na strani 4. Simpel ko pasulj. Ni treba čarati z Unicode ali UTF-8. (Uporabljam LaTeX okolje MiKTeX.)

This text is in Russian: \foreignlanguage{russian}{proverka}.

The capital of Russia is \foreignlanguage{russian}{Moskva}.

\foreignlanguage{russian}{Druzhba} means friendship.

\foreignlanguage{russian}{Feliks Ruvimovich Gantmaher, \textit{Teoriya Matric}}.

Zakaj je bilo to tako težko stauhati? To je bil že moj tretji ali četrti poskus. Koliko sem že prebrodil po internetu gor in dol…

V priročniku za babel (babel.pdf, 450 strani(!)) ni ne mu ne bu o tem, namreč kako deluje \foreignlanguage{russian}. Ali je kje kakšen človeško napisan User Manual/Tutorial za babel? Karkoli že babel.pdf je, uporabniku prijazen uporabniški priročnik vsekakor ni.

– France

Thu, August 22, 2013 17:16

Subject:   Še ena tabela za cirilico v OT2 
From:   "France Dacar" <> 
Date:   Thu, August 22, 2013 17:16 
To:   "Vladimir Batagelj" <> 

Oi Vlado oi:

Kdor išče ta najde (če ve, kaj išče). V prilogi je še ena tabela \ukazov in ligatur za cirilico pod OT2.

Prilagam še primer test2, kako se da shajati brez babel-a.



This text is in Russian: {\cyr proverka}.

The capital of Russia is {\cyr Moskva}.

{\cyr Druzhba} means friendship.

{\cyr Feliks Ruvimovich Gantmaher,
        \textit{Teo\-riya~Ma\-tric} \textit{Teo\-riya~Ma\-tric} 

Compare {\cyr <sovetskii0>} (false ligature \texttt{ts})
and {\cyr <sovet{}skii0>} (with \texttt{t\{\}s}).

Če ni babel-a, ni avtomatičnega deljenja besed, a ker vse skupaj potrebujem le za nekaj referenc, lahko sam nakažem deljenje, če in kadar je treba.

– France

