Network analysis is a great way for non-technical users to explore their data and pursue investigative leads. This was demonstrated with the recent leak of the Panama Papers, where investigate journalists used graph database and network visualisation technology as a means of uncovering offshore tax-haven activity. Following the revelations from 11.5 million documents and 2.6 terabytes of information drawn from Mossack Fonseca’s internal database, more people are aware of the power of analytics in general, and in this blog, I’m sharing my experience as a software engineer who works with business users to apply similar techniques for them to use in their jobs, every day.
Choosing cost effective tools
There are enterprise-ready network analysis tools available on the market, however they tend to be expensive and/or lack extensibility. The good news is that modern web and Big Data technologies make it possible to deliver a scalable network analysis capability at a more reasonable cost. The challenge here lies in integrating various components into an end-to-end solution, and developing a front-end application which is robust and user-friendly.
Understanding user requirements
On a recent project delivering a large scale network analysis capability on top of a relational database containing billions of records, I worked with business users to understand how they use their search results as the starting point. This led to the development of a network exploration solution which needed to be accessible on an internal network via a web browser but able to handle queries very rapidly. In this case, we chose to develop our app in Angular.js, KeyLines, Node.js and Express, knowing that by using web technologies we could easily integrate our app with back-end databases capable of meeting our performance requirements. We used Neo4j to deliver those quick responses over billions of relationships and Elasticsearch to enable instant searching over large volumes of unstructured data.
One of our first challenges was getting all of our relational data out of the MPP database and into our search index and graph database. We actually found that the simplest way to do this was to dump it out to CSVs using an ETL tool, and then load Elasticsearch and Neo4j from the CSVs. Neo4j has an import tool exactly for this purpose and it took around 5 hours to build a graph from the CSV files, with another 2-3 hours to build schema indexes for fast lookups. Neo4j’s “property graph” model enabled us to translate our relational dataset into a graph with little effort. For Elasticsearch we used Python to read the CSV data into a Pandas DataFrame a chunk at a time, and then write each chunk out to our search index as JSON. However, this was quite slow by comparison, if we were to do it again we would probably try using Embulk to improve performance. That said, the data load was successful and both of our data stores were populated relatively simply, ready for analytics development.
Quick search results from an initial prototype
The search functionality was one of the core requirements of our web app, and it was quite straight forward to implement with our chosen technologies. With some front-end web development and Node.js/Express routing we created a search bar to submit queries to the Elasticsearch REST API, enabling users rapidly to search over billions of records, with a data table to filter the results. This was already hugely beneficial for business users who don’t know SQL and in any case would have been waiting a lot longer to get the same results from the relational database. However, our web app really started to take shape when we added the network visualisation.
Combining search tools for greater speed
Having already done some prototyping with Neo4j and KeyLines, we knew this was a winning combination for large scale network visualisation and that we could develop features quickly. KeyLines talks to Neo4j by submitting Cypher queries to the REST API and rendering the resulting network on an HTML5 canvas. To implement our basic network exploration functionality, we used click events to return the immediate connections of each node on a double click. We then integrated the search feature with the network visualisation so that users could click on search results to seed the network for exploration. This combination proved to be incredibly powerful - two specialised NoSQL data stores with an application layer that plays to the strengths of each. When we first tested this we were blown away by the speed and ease with which we could search over all the data!
Turning sceptical users into advocates
With the core functionality of our app in place, we moved on to developing more advanced features such as network metrics, time bar filtering and chart saving functionality. Meanwhile, our business users were testing for us and providing feedback and additional feature requests. Over the course of the Alpha phase, our initially sceptical users became increasingly enthusiastic as they realised how efficiently they could pursue investigative leads and deliver analytical outputs. This in turn led to increased interest from the wider business, and a strong desire to progress to a Beta phase.
Our next challenge is the transition to a Beta service. That means scaling to more users, adding new features, and supporting high frequency data ingestion.
Next Friday my colleague Dave da Silva will be blogging about procuring open source software.
I joined Capgemini as a graduate two years ago, and get to work on interesting projects as part of a large team of data scientists. We’re currently recruiting, so if you’d like to be part of Capgemini’s data science team, you can apply to join now! Here are the links to our Data Science, Data Engineer and Data Visualisation job specs.
Software Engineer at Capgemini