RC1. Develop a generalized framework for the analysis of complex data, producing more detailed/precise results from aggregated data than traditional methods.
RC2. Develop common data representation models to allow for integrated approaches in the analysis of aggregate, complex data, and for the use of different software programs for the data analysis.
RC3. Address the issues of data privacy, namely on the internet, and in official statistics, by proposing criteria for data aggregation and subsequent analysis, and provide the resulting methodology to relevant stakeholders, in particular National Statistical Offices (NSO’s) and industrial partners.
RC4. Develop visual analytics tools for aggregated complex data, including symbolic data, compositional data, and functional data.
RC5. Develop appropriate methodology for the aggregation of large surveys and their analysis at macro level, as well as for the combination of independent surveys, only possible at the aggregate level, and provide the resulting methodology to relevant stakeholders, e.g. NSO’s.
RC6. Develop appropriate methodology, relying on different combined approaches, for the aggregation and analysis of sensor data, internet flow data, complex network data, and for the spatial and/or temporal analysis of complex data, appearing in the form of data streams, symbolic data, and compositional data.
open big data sources (The GDELT Project, Bike Share Data Systems, Bureau of Transportation Statistics, The European Social Survey (ESS), etc.) and provide tools to transform them into rich data frames or networks.
The objectives are the elaboration of the concept of complex data and theoretical discussion of the role and properties of the aggregation process as a way for obtaining complex data. Tasks: T1-Identification of data summarization/aggregation models and study of their properties. T2- Definition of criteria for data aggregation.T3-Extension of the collection of complex data types and foundations of complex data analysis.
ABCData aims at defining criteria and guidelines concerning methodology for the analysis of complex data, addressing the questions of which approaches should be considered in a particular problem at hand, and how to combine different methods. Moreover, focus will be put on the process of data aggregation to define criteria, both on the form of aggregation and on the granularity. This may be approached in two ways. One is to try to capture as much information in the original dataset as possible, while restricting to a fixed number of elements in the aggregate (complex) dataset. This is a matter of minimising within-symbol variability and maximising between symbol variability. The other approach is to do that, but specifically focusing on a given model, aiming to achieve certain characteristics of model parameters (unbiasedness, highest precision etc.) or predictive qualities. Existing approaches are certainly sub-optimal in this respect, and in this Action effort will be put in defining appropriate criteria and designing more efficient aggregation techniques.
For storage and exchange of such data we will develop a special JSON based format RDF - rich data format. We will establish a repository of rich data sets as a test bed for the developed rich data analysis methods. The newly developed methods will be implemented in R, Python or Julia. The format RDF will also enable the interoperability of the developed software.
An overview of the main ideas and results in this field.
Overview of complex data types and aggregation models; proposal of a complex data framework (a theoretical basis for RDF); Overview of the literature and study of data aggregation process; A workshop on this topic after first 6 months; Deliverable: a report; T2-On the basis of T1 recommendations for data aggregation will be prepared; Deliverable: a document with recommendations; T3-Research on complex data types and approaches to complex data analysis;
The temporal and spatial dimensions may also be included. For example, an analysis of traffic patterns in a network with aggregated information about the bike rides between stations in a bike sharing system (such us Santander C. London, Citi Bike New York, Capital Washington DC). Another example is a regionalization problem: clustering of complex data (for example describing population pyramids for NUTS or world countries such that the obtained clusters form contiguous regions for units neighbouring relation – the network is determining a relational constraint. This is only the tip of the iceberg of possibilities. How to present and visualize complex data together with relations (network links) as well as how to explore such data are the questions that are first to be tackled. Summaries of networks, such as density or centrality measures, are currently used to compare and evaluate different networks and must be adapted to networks based on complex data.