Data Science 101:-Getting started with Neo4j and Gephi Tool
Executing queries in the Neo4j graph database, Performing the operations of loading csv data, running graph statistics scripts and displaying various graphical layouts in Gephi tool
Neo4j Tool
Neo4j is a graph database management system. In Neo4j, everything is stored in the form of an edge, node, or attribute. Each node and edge can have any number of attributes. Both nodes and edges can be labelled. Labels can be used to narrow searches.
Below I have created a simple Neo4j project using Movies dataset provided in Neo4j and performed various queries to visualize data. the various queries performed and their output are as follows:
- Show movies that are released after the year 2005.
Query:
MATCH (m:Movie) where m.released > 2005 RETURN m
2. Query movies released after 2002 and limit the movie count upto 5 only
MATCH (m:Movie) where m.released > 2002 RETURN m limit 5
3. The below query returns name of the person, director and movie name that are released after year 2007 upto limit of 5 and represents the relation between the nodes through edges in the graphical form.
MATCH (p:Person)-[d:DIRECTED]-(m:Movie) where m.released > 2007 RETURN p,d,m limit 5
4. If we want to know the list of the persons that are available in the database we can use the following which queries list of person but limits the output upto 20 person only.
MATCH (p:Person) RETURN p limit 20
5. Next query is is to get the list of the movies name and their release year. The data is displayed in the tabular form.
MATCH (m:Movie) RETURN m.title, m.released
6. If one wants to search whether a movie with a particular name is present or not the following query is used which is used to search for a movie name A Few Good Men.
MATCH (m:Movie {title: 'A Few Good Men'}) RETURN m
7. We can also make a query to list movies which have release year within a particular interval of time, like below example list movies released between the year 2010 amd 2017.
MATCH (m:Movie) where m.released >= 2010 and m.released<=2017 RETURN m
Advantages of Neo4j databases
- Performance: In relational databases, performance suffers as the number and depth of relationships increase. In graph databases like Neo4j, performance remains high even if the amount of data grows significantly.
- Flexibility: Neo4j is flexible, as the structure and schema of a graph model can be easily adjusted to the changes in an application. Also, you can easily upgrade the data structure without damaging existing functionality.
- Agility: The structure of a Neo4j database is easy to upgrade, so the data store can evolve along with your application.
Gephi Tool
Gephi is an open-source network analysis and visualization software package. It is mainly used for visualizing, manipulating, and exploring networks and graphs from raw edge and node graph data. It is an excellent tool for data analysts and data science enthusiasts to explore and understand graphs.
In this demo I have chosen a simple karate.gml dataset and performed some basic gephi operations on it. So lets get started.
- Open Gephi and click on New Project. Then choose File->Open and load the dataset of your choice as shown below. On loading the dataset it would show the number of nodes and edges present in the dataset as well as the type of the graph.
2. Below is how all the nodes and edges are displayed when initially dat is loaded.
3. Now we can represent the data in various layout. In he left pane choose the layout option and choose the layout of your choice and click on Run. In the below image I have chosen the ForceAtlas 2 layout which displays the data in the following form.
4. Next we can differentiate the nodes based on various ranking like there In-Degree, Out-Degree or Degree and show them in different color. For this in the left pane on top side choose Nodes->Ranking there choose the ranking like in below image In-Degree is chosen, where pink color nodes have lower in-degree compared to white and green node has highest in-degree rankings.
5. More clear visualizations can also be made by displaying the nodes in various sizes. For instance in the below image nodes having higher degree are larger in size compared to nodes having less degree i.e nodes in green have high value of degree compared to nodes in white and pink colour.
For displaying in various size in left pane in Appearance section select the Size option and then mention minimum and maximum size of nodes you want to display. I have given the Min size to be 5 and Max size to be 20.
6. Next we generate a Degree Distribution graph for Degree, In-Degree and Out-Degree and also get the Average Degree value for all the nodes. To generate the graph simply in the right pane choose Statistics tab and there run Average Degree in the Network Overview section.
A report will be generated as well the column for degree will be added to the dataset table.
To see the Data Table in the top Menu Bar select Window->Data Table and you would be able to see your table like as in above image where after running the Average Degree function columns for In-degree, Out-Degree and Degree is added for each node present.
7. If one wants to calculate and generate chart for the Average Path Length between the edges, it can be done so by running the Avg. Path length in the Edge Overview section of the Statistics pane.
8. Now we can try and different functionalities as well as try various layouts in the Gephi tool. In the below image I have used the Force Atlas Layout.
To display the labels of the node labels choose the Show Node Labels icon present in the bottom bar.
Conclusion:
Learn More Here about Neo4j and Gephi