© Springer Science+Business Media New York 2014
Reda Alhajj and Jon Rokne Encyclopedia of Social Network Analysis and Mining 10.1007/978-1-4614-6170-8_298

Network Data File Formats

Jernej Bodlaj
(1)
Abelium, d.o.o., Ljubljana, Slovenia
 
Without Abstract

Synonyms

File representation of network data; Network file; Network file format; Network representation and file

Glossary

API
Application Programming Interface
GraphML
Graph Markup Language
JSON
JavaScript Object Notation
NA
Network Analysis
NFF
Network File Format
URL
Uniform Resource Locator
XML
Extensible Markup Language

Definition

Network

A network is a mathematical structure composed of underlying graph and data that are assigned to its nodes or links.
Formally a network is given by \({\mathcal N}\) = ( \({\mathcal V}\), \({\mathcal L}\), \({\mathcal P}\), \({\mathcal W}\)), where \({\mathcal G}\) = ( \({\mathcal V}\), \({\mathcal L}\)) is a graph, \({\mathcal V}\) is a set of nodes, and \({\mathcal L}\) = \({\mathcal A}\)\({\mathcal E}\) a set of links composed of directed links or arcs \({\mathcal A}\) and undirected links or edges \({\mathcal E}\). Their sizes are usually denoted by n = | \({\mathcal V}\)| and m = | \({\mathcal L}\)|, \({\mathcal P}\) is a set of node value functions or attributes p : \({\mathcal V}\)A, and \({\mathcal W}\) is a set of link value functions or weights ω: \({\mathcal L}\)B. Attributes of nodes \({\mathcal P}\) and attributes of links \({\mathcal W}\) can be measured in different (numerical, ordinal, or nominal) scales.

Network File Format: NFF

In computers the data are stored in files. Files are essentially sequences of bits. A file format specifies how information is organized inside the file. A Network File Format – NFF is specifically oriented towards storing network-related data.

Introduction

In this essay we focus on how to store network information to a storage device in a form of file. Many established ways of encoding network- related data exist. They are called Network File Formats – NFFs. NFFs, and file formats in general, reflect various requirements based on their intended use such as human readability, portability, generality, space requirements, speed of encoding and decoding, reading, and writing.
We present a basic division of NFFs and expose some major aspects one should be aware of when dealing with NFFs. Then we compare some of established NFFs in terms of those aspects and describe some important NFFs in more detail. Next we give general advice how to convert NFFs among them and we display possible conversions between known NFFs. We also show the table of network analysis-related tools and their support of established NFFs.

Network Data File Formats

Types of NFFs

We expose two points of view on NFFs. The first is user oriented in terms of network construction and maintenance, and the second is computer oriented in terms of computer resource requirements. Both intertwine to some extent.
The NA usually begins with data acquisition and construction of networks. To store the network data, we have to select an appropriate format. If the amount of data is small enough, the network can be constructed by hand, either in a text editor or in some spreadsheet program like MS Excel. When the amount of data is larger, but still manually acquired, we can manage it within a database. In other cases, when the data are already available in digital form, we may accede to automated processes, either directly with a suitable conversion or with network collection/construction tools or with custom program solution etc. A good example is collecting the data from the Internet.
In computer terms, a network representation depends on network size, intended usage, storage location, etc. It also depends on network density, network element attributes, additional data and desired visual properties of network, etc. For example, a space consumption and speed of reading and writing of a given format are important when we work with very large networks. At the end it is a matter of compromise between user- friendliness and the required hardware resources. While tighter, optimized formats are faster to read and require less space and processing, they are harder to read and edit by the user without special tools.
We distinguish three major groups of NFFs. Text formats focus on readability and portability. They include XML formats (Extensible Markup Language) as a subgroup. XML formats focus more on portability, adaptability, and scalability. The third group associates are binary formats. Their focus lies on performance, compactness, and optimization. However, they are unreadable by users and special software is required to handle them. Compressed formats may be considered binary as well, but they are rare. In addition to these, a database may also be considered as a format to represent networks. A network data transfer using clipboard is supported by few applications only and is meant for transfers of small amounts of the most basic network data. Moreover, many custom formats are defined as an API interface to data on various internet databases, which can be used as network data source.

Text-Based NFFs

Text file formats are represented as text, readable by humans. As long as their structure is known, special software is not needed to read and edit them. We only need a standard text editor. For practical reasons, these NFFs are the most popular. A subgroup of text-based formats is a group of XML-based formats. Besides advantages of text formats, XML formats are the most flexible in terms of incorporating new features. As they all follow the basic XML syntax, writing programs to handle them is simple. They are often used as an interchange format between various programs. The bad side of text-based formats is that they are more prone to errors introduced by manual editing; they require more processing power to encode/decode and take more space.

Binary NFFs

The majority of NA software supports the export and import of at least one text format, and many of them also support some form of a binary format, but binary formats are mainly proprietary and specific. The intent of binary formats is usually in smaller size and no conversion of the stored data, which results in faster reading and writing operations. A space consumption of a wasteful XML representation might be easily in the range of 10-fold increase in required space or worse in comparison to a tightly packed binary representation of the same data. As a compromise, some NA tools can read XML files directly from ZIP compressed files. Spending more on processing, space consumption can be alleviated a lot even for XML NFFs. But in general, when space becomes an issue, for instance, when dealing with huge networks, which by a definition can no longer be stored in a computer memory as a whole, a binary format provides the only practical way to manipulate them, both in terms of storage, which today could take terabytes of disk space, and in terms of processing, which is usually also much more efficient with binary formats. Binary formats are usually closely related to data representation in the computer memory. When the file content is transferred into memory, they require small amounts, if any, of computationally demanding data parsing. A bad side of binary formats is in lack of human readability. Except by using them in dedicated software tools, we cannot do much with them. They are unpractical to work with, and without a complete specification, they tend to be quite incomprehensible. They have also a very poor adaptability. Incorporating new features into binary formats, yet maintaining backward compatibility could be extremely challenging. Binary formats mainly serve as software-specific and optimized intermediate solutions. The main input and output data formats are usually text based.

Database Network Representations

Some NA applications enable users to import and export data through database connection adapters. This applies especially to corporate environments, where many users work on the same data, accessible from central database. Huge networks may be provided in this way also. These formats are actually database schemes, which describe the structure of network data stored in a database file and managed by server software providing a connection string access to client applications, for example, Advanced Visualization and Analysis (Tom Sawyer Perspectives). Although schemes of data on various databases may be different, we will group them together and refer to them simply as database format.

API to Online Services: Providing Network-Related Data

Some online services like Amazon, YouTube, and Google Data provide clients with an API to their data. The API is an interface, which provides a programmatic access to service database, and is usually implemented in terms of web service with a description, provided by WSDL (web service description language) or similar schema. In other words, web service description gives the description of the transfer format between server and a client. The base level data is usually exchanged in XML or alternatively in JSON format, but the inner structure is dependent on the API implementation. Both are used as exchange formats of the data, serialized into structured objects for transmission over the Internet. The user however can, in most cases, avoid learning the structure of exchange format, as a client side code can be usually generated automatically from web service descriptions. User can therefore access data in terms of function calls, without paying attention to underlying exchange format.

Aspects of NFFs

Graph Data Representation in Files

To represent a network in a file, we have to select a suitable representation of sets \({\mathcal V}\) and \({\mathcal L}\) in network definition. The set of nodes \({\mathcal V}\) either can be listed explicitly or is constructed implicitly from the specification of links – dispersed representation. Links (set \({\mathcal L}\)) are usually given by a set of node pairs. As a link is defined, its nodes are defined simultaneously on the fly. A problem with implicit node definition is how to specify isolated nodes as they are not encountered in any link definition. If nodes are represented with successive numbers, for example, only their number can be given in advance. Isolated nodes do not need to be defined additionally in this case. (This representation could be also treated as explicit.) An advantage of explicit definition over implicit is that if the same set of nodes is used in various sets of links (to define networks on the same set of nodes), we do not need to redefine nodes every time, especially when they have embedded attributes. Attribute embedding is described in section “GraphViz (DOT).”
Nodes are represented by their unique keys (identifiers). Identifiers can be either artificial, as successive numbers as already mentioned, or native, as node names or their attribute values, for example, as long they are distinct among all nodes. A standard approach to define links is with a sequence of pairs of their end-node identifiers. This is common in text formats, especially in XML formats as nodes are represented by XML elements, which contain identifiers. If the network contains only undirected/directed links, the order of link end nodes is insignificant/significant. If the network contains both directed and undirected links, links can be defined in two separate sections in a file (one for undirected and one for directed links). Contrary they can be defined in a single section as well, but then they have to be tagged with attributes which specify their direction. While first approach is common in text formats and less in XML formats, the second is more frequent in XML formats. Similarly as with direction attribute, links can have weight attributes assigned as well. Links can also be defined in two more compact forms. They can be specified by sequences of identifiers of nodes which lie on a path in the network.
Alternatively they can be specified by star structures, defined by identifier of a common central node followed by a list of identifiers of nodes adjacent to source node. The first approach is useful for representing links of networks, which contain many long disjoint paths, and second for more dense networks, which contain many stars – nodes with many adjacent nodes, respectively. In worst case, both approaches can degenerate into standard form. They are rarer and are hardly found in XML formats as the space consumption is usually not an issue in XML formats. Using all three approaches, nodes can be defined implicitly. Every time a link is added, the corresponding end nodes are added to the set of nodes, if not already present.
Another representation of network is by adjacency matrix. It is suitable for relatively small or very dense networks. For dense networks, a matrix representation is the most space efficient. It is essentially a list of n lists of n values. A value in j th position in i th list represents the weight of link (arc) ij. When the network consists of edges only, a triangular matrix representation can be used, as the matrix is symmetrical. Some NFFs also allow omitting the matrix diagonal. Most text and binary NFFs support the matrix representation.
A special evolutionary network representation uses a stream of structure-changing events like adding, deleting, or changing attributes of network elements. This approach is suitable when we want to follow network dynamics. A typical example is a Graph Stream format (The DGS File Format Specification). A stream, that contains only grouped additions of nodes and grouped additions of edges where the order of elements in each group is unimportant results into two lists, a list of nodes and a list of edges. This is a common way of specifying network elements in most text formats.

Attributes of Network Nodes and Links

The information about attributes of network’s elements either can be attached to the corresponding elements or provided in separate tables.

Storing Clustering, Partitions, Permutations, and Hierarchies

A nonempty subset C\({\mathcal V}\) is called a cluster (group, class). A nonempty set of clusters C = { C i } forms a clustering.
A clustering C = { C i } is a partition if and only if ⋃ C = ⋃ i C i = \({\mathcal V}\) and for ij the corresponding clusters are disjoint C i C j = ∅.
A clustering C = { C i } is a permutation if and only if C is a partition and | C i | = 1.
A clustering C = { C i } is a hierarchy if and only if C i C j ∈ {∅, C i , C j }.
A hierarchy C = { C i } is complete if and only if ⋃ C = \({\mathcal V}\) and is basic if for all υ ∈ ⋃ C also { υ} ∈ C.
Clustering, partitions, permutations, and hierarchies can all be stored in a form of an integer vector. While values of individual cluster in clustering may not have any specific meaning, the values of elements in partition belong to clusters. In partition p [ υ] = i − vertex υ belongs to cluster C i . In permutation p [ υ] = i − vertex υ is in the i th position. In hierarchies each element points to its parent. Root of hierarchy is usually determined with a negative value or with 0 or some other predefined value.

Support of Temporal Networks

Temporal or dynamic networks (Batagelj 2009) are networks with time as an additional component. They change their structure and property values in time. Using temporal networks we can represent their evolution.
Temporal network \({\mathcal N}\) T = ( \({\mathcal V}\), \({\mathcal L}\), \({\mathcal P}\), \({\mathcal W}\), \({\mathcal T}\)) is obtained if the time \({\mathcal T}\) is attached to an ordinary network. \({\mathcal T}\) is a set of time points. In temporal network nodes υ\({\mathcal V}\) and links l\({\mathcal L}\) are not necessarily present or active in all time points. If a link l ( u, υ) is active in time point t, then also its end nodes u and υ should be active in time t. The network consisting of links and nodes active in time t\({\mathcal T}\) is denoted by \({\mathcal N}\) ( t) and is called the time slice in time point t. The notion of time slice can be extended to time intervals.

Support of Two-Mode (Bipartite) Networks

In a two-mode network (Batagelj 2009) \({\mathcal N}\) = (( \({\mathcal V}\) 1 \({\mathcal V}\) 2), \({\mathcal L}\), \({\mathcal P}\), \({\mathcal W}\)), the set of nodes \({\mathcal V}\) = \({\mathcal V}\) 1\({\mathcal V}\) 2 consists of two disjoint sets of nodes \({\mathcal V}\) 1 and \({\mathcal V}\) 2, and all the links from \({\mathcal L}\) have one end node in \({\mathcal V}\) 1 and the other in \({\mathcal V}\) 2.
A two-mode network can also be described by a rectangular matrix \(A = {[{a_{u\upsilon }}]_{{{\cal V}_1} \times {{\cal V}_2}}}\): \({a_{u\upsilon }} = \left\{ {\matrix{ {{w_{u\upsilon }}} \hfill & {(u,\upsilon ) \in {\cal L}} \hfill \cr 0 \hfill & {(u,\upsilon ) \notin {\cal L}} \hfill \cr } } \right.\)
where w is the weight of link ( u, υ).
Examples: a network of papers × authors, businessmen × boards of directors, etc. Also the network of formats and tools supporting them in section “An Overview of NFFs and Related NA Software” is a two-mode network.
n-mode networks also exist, but they are rarely used in practice. An n-mode network can be specified using an additional partition of nodes – the mode partition.

Support of Multiple or Multi-relational Networks

Multiple(x) or multi-relational networks (de Nooy et al. 2012) are networks which have only one set of nodes and different sets of links on them \({\mathcal L}\) = ( \({\mathcal L}\) 1, \({\mathcal L}\) 2, …, \({\mathcal L}\) r ). Examples of such networks are the following: lines of transportation system in a city (stations, bus lines), semantic networks (words, semantic relations: synonymy, antonymy, hyponymy, meronymy), etc.

Support for Hypergraphs

A hypergraph (Berge 1989) is a generalization of a graph in which an edge can link any number of nodes. Formally, a hypergraph \({\mathcal H}\) is a pair \({\mathcal H}\) = ( \({\mathcal V}\), ℰ h ) where \({\mathcal V}\) is a set of nodes, and \({\mathcal E}\) h is a set of nonempty subsets of \({\mathcal V}\) called hyperedges or simply edges. Therefore, \({\mathcal E}\) h is a subset of \( {\mathcal P(\mathcal V)}\, \backslash \, \{\emptyset\}\), where \( {\mathcal P(\mathcal V)} \) is the power set of \({\mathcal V}\). In some formats, generalized networks based on hypergraphs are supported. Hypergraphs can be represented by their incidence matrix that is essentially a two-mode network, hyperedges × nodes – we will not consider this as a NFF support for hypergraphs.

Support of Comments

An important aspect of text formats is their ability to incorporate comments. Comments can be embedded also in binary formats, but they are specifically encoded and accessible only in compatible software. Since XML has a built-in commenting, all XML-based formats support it. Comments may appear at random places throughout the entire XML file. Comments are supported also in most plain-text formats, but their location and the way of specifying them varies. While more compact text formats allow them only, if at all, in predefined sections of a file and in a limited form (no line breaks, for instance), the extensive ones allow them in a more general form and anywhere.

Multiple Structures Within a Single File or Joined by a Project File

A network is described by its underlying graph structure and various data structures in form of property tables, partitions, clusters, hierarchies etc. In some formats each data structure is stored in a separate file. Saving each structure in its own file is fast and allows us to keep many versions of the same, loosing minimal space. This is convenient while working on the problem, but later we tend to forget what each file is for. The approach requires the user to be well organized and keep a special documentation about his or her work. Other formats allow the user to put many networks and other complementary data together and store them into a single file. This may be practical when we want to keep our work as a whole, although saving everything into a single file might take more time. XML-based formats are the most appropriate for the task, because of their natural tree structure. An additional level (envelope) usually suffices to join content of many XML files together into a single file without much of additional effort to adapt the software to handle joint files. Another, and probably the best, option is a hybrid way, keeping advantages of single file approach and also advantages of multiple disjoint files. A project file, which joins together all the other files in an organized way, is added to the network data files. Now the user may edit separate files by hand, for instance, or modify all of them at the same time inside a dedicated analysis environment when working on the problem.

Graphical Layouts and Style

To represent a network on a picture, we usually need additional information and/or graphical arrangement of the network, the so-called layout (Batagelj 2009). Many formats provide the option to incorporate a single static layout of a network, typically in a form of arrangement and appearance of network elements, given by coordinates for every node, colors of nodes and links, shapes of nodes, etc. On the other hand, only few formats enable the user to include multiple layouts – with a style in a limited form. For example, in GraphML (The GraphML File Format) one can specify many layouts for the same network. The style defines an arbitrary number of layouts in a parametrical or even programmatic fashion. While no established format supports this kind of style definition, most of XML formats are flexible enough for a suitable extension to enable it without a need to modify the existent software.

Network Composition

Some networks are obtained by combining smaller subnetworks often from a fixed collection of “bricks” or by transformations replacing parts of the network with other parts (inductive classes of graphs). There also exist several decompositions of networks. The information about the composition of a network can be used in more efficient algorithms for some problems. An attempt to describe networks by their composition was made in NetML (Batagelj and Mrvar 1995).

Metadata Support

Metadata are the data about the data which describe the main data content in some way. They help the user to manage files and may be used to optimize the flow of analysis as well. Examples are the creator or author of the network data, time and date of creation, sources of data, means of creation, legal rights, etc. Although formats may store metadata in form of comments, XML formats and some other formats provide the genuine support for metadata.

Saving General Network Properties

Some formats enable the user to store general network properties determined by the user or from computations. Examples are the information whether the network is planar- or scale-free, maximum degree of nodes, number of connected components, etc. The idea is to save the results of time-consuming algorithms that do not take too much space. This data can be later used to speed up other algorithms. It can be used also by search procedures in collections of networks.

Aspects of Some Popular NFFs

In Table 1 we present an overview of basic properties (aspects) of some popular NFFs. Explanations of these properties are given here: By multiple structures in a single file, we mean at least the possibility of incorporation of more than one network into a single file. By references to other files or URLs, we mean that it is possible to include data from other sources – files on local media or on the Internet. Matrix representation specifies the option that also a matrix can be used to represent the network. Another way of specifying networks, structure event support, is incrementally with network modifying events. While vectors, partitions, permutations, and hierarchies are autonomous structures, attributes are bound to individual elements of the network. However, vectors, partitions, permutations, and hierarchies usually fit to the network. Attributes in time can have time-dependent values. While category of multirelational networks points out formats which can have multiple sets of links, a support for relations only means that only one set of links is present, but we can assign an attribute of relation to each link. Relation attribute is usually defined the same as any other link attribute. Subcomponents show the possibility to specify subnetworks in the format using a partition or with some specific dedicated way.
Unicode support is required when one wants to use special characters in stored names. The majority of text formats support UTF-8, which is one of Unicode encodings. XML files support Unicode and other encodings. Newer versions of binary formats usually support Unicode as well.
Support for namespaces designates the possibility to put each structure into a desired namespace to enable a selective way of combining structures, for example, to specify whether a name denotes the same or different nodes in given two networks. Human friendliness is the “measure” of how hard it is for a newcomer to get familiar with the format, how much work he has to put in to manually construct a reasonably complex structure, how hard it is to read the format when manually drawing a network based on data in a file on a sheet of paper, or, for instance, the ability to use real names right away for network element definitions, contrary to only abstract ids, etc. The estimated space efficiency shows how efficient are formats in comparison with others in terms of space, taken by the same set of common data. If a format from the “good” category takes one unit of space, a format in the unmarked category takes somewhere from ten units up.

Most Widely Used NFFs

GraphML

One of the most popular graph (network) file interchange formats is GraphML (The GraphML File Format). It is a comprehensive and easy- to-use file format. It consists of a language core to describe the structural properties of a graph and a flexible extension mechanism to add application-specific data. Its main features include support of directed, undirected, and mixed graphs, hypergraphs, hierarchical graphs, and graphical representations, references to external data, application-specific attribute data, and lightweight parsers.
Network Data File Formats, Table 1
Various aspects of some popular NFFs
Format properties
Format
TXT, CSV
Graph Stream
GraphViz
Dynet Markup Language
Guess
Graph Exchange
Graph Modelling Language
Graph Markup Language
Graph Exchange XML
Pajek PAJ
SoNIA
Tulip
UCINET DL
Extensible Graph Markup, Modeling
Multiple structures in single file
     
 
 
References to other files or URLs
       
 
 
       
 
Matrix representation
 
   
   
 
Structure event support
   
     
     
     
Arbitrary vectors
 
 
 
 
Attributes
 
 
Partitions, permutations
 
 
 
 
Hierarchy representation
     
 
 
 
Temporal (dynamic) networks
   
 
 
 
Attributes in time
       
 
 
 
   
Two-mode networks
 
 
Multi-relational networks
       
   
 
   
Support for relations
 
 
 
Hypergraph representation
     
   
       
Comment support
   
 
   
Layout or style support
     
   
   
Subcomponents
 
 
 
Support for namespaces
       
 
 
       
Unicode support
   
 
   
Human friendliness
 
✓✓✓
✓✓
✓✓
✓✓
✓✓✓
✓✓
✓✓✓
✓✓
✓✓
✓✓
✓✓
✓✓
✓✓
Space efficiency
 
✓✓
✓✓
✓✓
✓✓
✓✓✓
✓✓✓
✓✓✓
✓✓✓
Unlike most other NFFs, GraphML is based on XML and therefore ideally suited as a common denominator for all kinds of services generating, archiving, or processing networks.

Pajek NET

Another widely used NFF is Pajek NET format (Mrvar and Batagelj). It is text based, and since Pajek was developed for analysis of large networks, it tries to be compact yet clearly readable. The underlying graph can be described as link list, list of lists of neighbors, or adjacency matrix (Stein et al. 2009; Mehlhorn and Sanders 2008). Nodes/links may have various attributes as label, color, shape/pattern, size/width, position/value, etc. To save space the groups of successively defined nodes or links may have attributes defined only once. NET format supports two types of descriptions of temporal networks. The first description is based on time intervals of presence in which the node/link is active in network and the second on events (show edge, add node, change node property, etc.). The Pajek NET format also supports multi-relational networks, two-mode networks, p-graph representation of genealogies, and Petri nets. A line preceded by % is a comment.

UCINET DL

UCINET DL format (Borgatti et al. 2002) is the most common file format used by UCINET software. It is text based and is defined by three main sub-formats:
  • Full matrix format is the default for DL file extension and is very similar to adjacency matrix as described in section “UCINET DL.”
  • Link (edge) list: for sparse networks it is often more convenient to enter just the pairs of nodes that are linked as described in section “Link List (TXT, CSV).” In the link list format, each line of data is an ordered pair of nodes, optionally followed by a value indicating the strength of the relationship.
  • Node list: in the node list format edges are defined by a name or number of the source node at the beginning of each line, followed by the list of names or numbers of adjacent nodes.
This format is more compact than link list as source node in every line is given only once.

Adjacency Matrix (TXT, CSV)

Frequently used NFF is the representation of network by its adjacency matrix (Stein et al. 2009; Mehlhorn and Sanders 2008). Each matrix row is represented as a line of its values separated by spaces, tabulators, or some other delimiter (comma or semicolon in CSV – comma separated values). Using some additional information, provided by a user, .TXT or .CSV adjacency matrix files may contain alternating rows and/or columns of link attributes and header rows/columns of node attributes. The main advantage of this format over other formats is in its simplicity and efficiency for dense networks. It is not appropriate for large sparse networks as it requires too much space.

Link List (TXT, CSV)

It is the simplest NFF, also popular in many NA applications. A network is represented by a list of links – which are in separate rows described by pairs of their end nodes represented by their indices or names. Graph nodes are often specified implicitly. Link attributes can simply be added at the end of the row. Node attributes can also be defined but only if nodes are specified explicitly in a separate list. The user has to provide some additional information in this case. The simplicity of the format is its main benefit.

Graph Modeling Language (GML)

There are many different programs that work with graphs and networks, but almost all of them use their own file format. As a consequence, exchanging networks between different programs requires conversion. Simple tasks like exchange of data, externally reproducible results, or a common benchmark suite are much harder than necessary. GML was developed to overcome these problems. It supports attaching arbitrary information to graphs, nodes, and edges and is therefore able to emulate almost any other format.

GraphViz(DOT)

DOT (Graphviz) is a plain-text graph description language. It is a simple way of describing graphs that both humans and computer programs can use. Main features of DOT language are undirected and directed graphs which may have arbitrary node and link attributes. Attributes are embedded after each node’s id. For example, link ab, where node b should have attribute color equal to blue, can be specified as ac[ color = blue]. Using attributes one can also define a visual layout. Single line and multiline commenting is supported.
A distinct feature of DOT language is the ability to specify links in a more compact way than with a more common adjacency list (Stein et al. 2009; Mehlhorn and Sanders 2008) approach. For example, lines ab and bc between nodes a, b, and c can be specified by a single string abc. At the same time definition of nodes is accomplished.

Conversion Between Formats

The user may find himself or herself in a situation in which he has a network file in a format which is, at least seemingly, not compatible with a program he intends to use. He has to consult the program documentation about the supported input formats. If unsuccessful, he may check online program’s wiki, page or forum, Google, mailing lists, etc., whether an appropriate plug in or script to perform the task exists. Many programs have custom import functionalities and options for specific import plug ins, and/or are script enabled. If those features are not present, the next step is to make a conversion, done with an appropriate converter program either by hand or programmatically. A small number of dedicated converter programs exist. They can be found on Fig. ter program either by hand or programmatically. A small number of dedicated converter programs exist. They can be found on Fig. 1 on the left, denoted by C. Another way is to use the general NA programs. We open the provided file and save or export it into a required format. Sometimes we need to make more than one conversion step – see Figs. 2 and 3. The most reliable but unfortunately the most demanding and potentially time-expensive way is doing the conversion by a custom program or by hand.
978-1-4614-6170-8_14_Part_Fig1-298_HTML.gif 978-1-4614-6170-8_14_Part_Fig4-298_HTML.gif
Network Data File Formats, Fig. 1
A matrix of NFFs and related NA software
978-1-4614-6170-8_14_Part_Fig2-298_HTML.gif
Network Data File Formats, Fig. 2
Two-mode network of NFFs vs. network-related software. Color code of nodes is the same as in Fig. 1. Arcs, leaving format nodes, and entering software nodes represent software input formats, and those leaving the software nodes and entering format nodes represent software output formats. Size of nodes is proportional to the number of arcs entering or leaving the node (a similar diagram was designed by Mark Round before)
978-1-4614-6170-8_14_Part_Fig3-298_HTML.gif
Network Data File Formats, Fig. 3
A matrix of network formats, which can all be converted into each other using tools we list in this essay. One tool, black (excluding squares on matrix diagonal); two tools, dark gray; three tools, light gray; or four or more tools, white has to be used to transform the row format into the column format
Majority of file formats are text based. In many instances text formats are quite similar. In the extreme cases, the only difference between two text formats is in separator characters and/or “command lines” delimiting different parts (node list, link list) of data. The conversion between them may be performed in text editor. For more complex text formats, including XML formats, before writing a custom conversion program, one should consider a text editor with support of regular expressions. There exist also XML editors. Before writing a custom program, we have to look first after libraries that provide support for our formats and try to join them into a program to perform a conversion. If they are not available, our last resort is coding, but even there some programming languages are better than others. Python, for instance, is a useful language as it provides fast development, supports string and XML-related routines, and is easy to set up. A disadvantage of Python is when working with very large datasets as it is memory consuming and slow. For a more efficient conversion, the Java, C#, or even C might be considered.

File Formats for Network Output

2D Image Formats

NA tools, especially visually oriented, enable the user to export snapshots of the network in a form of an image. Images are often treated as the end product of the analysis and show the resulting network in the most informative way. Images however cannot be (easily) translated back into the original network automatically. We distinguish two major groups of image formats. The first is a group of raster image formats, where image is represented by a matrix of pixels. The second group consists of vector formats, where predefined parameterized graphical elements are used to construct the image, e.g., circle, line, polygon, and spline. While raster formats are good for complex, dispersed, blurred, and photography-like representations, vector formats are useful for sharp, crisp, and more detailed representations. From the user point of view, the major difference between raster and vector images is in resolution. It is fixed for raster images and is arbitrary for vector images. Common examples of raster formats are portable network graphics, PNG; Joint Photographic Experts Group, JPEG; bitmap, BMP; tagged image file format, TIFF; etc. Examples of vector-based images are scalable vector graphics, SVG; encapsulated postscript, EPS; postscript, PS format; etc. Some vector formats may also contain raster elements. For example, SVG can link external raster images with vector- based graphics. The vector formats that allow embedding of raster and vector elements in a single file are the portable network document, PDF; CorelDraw documents, CDR; Adobe Illustrator documents, AI; etc.

3D Model Formats

Some visually oriented NA tools provide the export of networks into 3D models. Molecules as networks are a good example where it is convenient to export them in 3D. In comparison to 2D network representations, 3D models of networks are almost universally vector based. Well-known 3D formats are 3D version of scalable vector graphics, SVG; virtual reality markup language, VRML and its successor X3D; AutoCAD, DXF; 3D Studio, 3DS; and others. In chemistry and biology Kinemages, KIN, and MDL Molfile are more popular.

An Overview of NFFs and Related NA Software

Conversion Possibilities Between Formats

We will try to answer the following question: “Is it possible to transform a given format into another one using any of existing tools?” In Fig. 3 the matrix of a subset of 56 out of 86 network formats from Fig. 1 is displayed. These 56 formats are all convertible into one another using tools from Fig. 1. Among 30 excluded formats are two pairs of mutually convertible proprietary formats of m Draw and Mind Manager and the 26 remaining formats that cannot be mutually converted into any other format. From the matrix in Fig. 3, we can see that the majority of file formats can be converted one into another in at most two steps. Except few exceptions, all formats can be transformed in three steps. Under the assumption that we have an access to all tools, it is relatively easy to convert almost any format into a desired format. But attention, some conversions are only partial. Certain formats are more general than others and conversion to a more specialized format may omit some data. For instance, the conversion from GEDCOM or from Molecule MOL format into Pajek NET format is complete (with respect to graph structure), whereas the opposite is not the case (not every network can be represented as genealogy or as molecule). This could be the reason why a unified converter, able to perform the majority of conversions, does not exist (Fig. 3).

Remarks

In constructing the overview tables for this essay, we tried to be as exact as possible with characteristics of NFFs, but wrong evaluations could happen. We intend to keep the tables updated as the online supplement (Essay Supplement).

Future Directions

As formats have many different features, the universal and total conversion is impossible. Despite this fact, a decent conversion tool could be developed. For reasonably small networks, it could even be available as an online service. On the other hand, as computers are becoming faster and disk capacities grow bigger, the NFF evolution may slowly turn towards the standardization of a universal network representation language, embedded in some common NFF, most probably XML based.
Graphical languages, used to programmatically describe a visual layout of data, are evolving recently. For example, Data-Driven Documents D3 (Data-Driven Documents) uses declarative programming approach to enable data-driven visualization. Graphical languages can be used for style definition of network-related data as well. For now, no NFF allows storing style information in this way, still less a standardized approach exists. Along, many network-related graphical elements are still to be invented.

Acknowledgments

This work has been partially financed by Slovenian Research Agency (ARRS) within the EUROCORES Programme EUROGIGA (project GReGAS) of the European Science Foundation (ARRS grant number N1–0011) and by the European Union, European Social Fund.

Cross-References

References
Batagelj V (2009) Visualization of complex networks. Encyclopedia of complexity and system science. Springer, Berlin, Heidelberg, pp 1253-1268
Batagelj V (2009) Social network analysis, large-scale. Encyclopedia of complexity and systems science. Springer, Berlin, Heidelberg, pp 8245-8265
Berge C (1989) Hypergraphs: combinatorics of finite sets. North-Holland Mathematical Library, Amsterdam MATH
Borgatti SP, Everett MG, Freem an LC (2002) UCINET 6 for Windows: software for social network analysis. USER’S GUIDE, Harvard, Massachusetts, USA
de Nooy W, Mrvar A, Batagelj V (2012) Exploratory social network analysis with pajek, 2nd edn. Cambridge University Press, Cambridge
Mehlhorn K, Sanders P (2008) Data structures and algorithms, the basic toolbox. Springer-Verlag, Berlin, Heidelberg
Stein C, Cormen TH, Leiserson CE, Rivest RL (2009) Introduction to algorithms, 3rd edn. The MIT Press, Cambridge, Massachusetts MATH
Web References
Batagelj V, Mrvar A (1995) Towards NetML Network Markup Language. In: International social network conference, London. http://​vlado.​fmf.​uni-lj.​si/​pub/​networks/​netml/​snetml.​pdf
Data-Driven Documents. http://​d3js.​org/​
Extensible Markup Language (XML). http://​www.​w3.​org/​TR/​REC-xml/​
Mark Round. MDRound@qinetiq.com
Mrvar A, Batagelj V, Pajek – program for large network analysis. http://​pajek.​imfm.​si
Recommended Reading
For a detailed inspection of figures from this essay, look at the online supplement (Essay Supplement).