Need:
- Make Intelligent Data Similarity Suggestions.
- Reduce Mutual redundancy of data.
- Leverage Data caching for faster process and retrievals
Solution:
- Create virtual data indices.
- Ingest and transform data using Logstash.
- Create Flask end-points to host user requests.
- Run query DSL in Elasticsearch and feed it to python dfflib.
- Format and present data to end users.
Benefits:
- Reduced mutually redundant data score by 23%.
- Reduce rework, encouraging the end users to reuse existent data per their needs.
- Reduce time taken for document creation by 38%.
Business Challenge
Need:
The system traverses a data load of about 2.5 billion records on an average in a day. At present, the application is used by multiple financial businesses across the globe, of which 58% payload is from businesses in the United States. When you talk about large scale data, you talk about large scale mutual redundancy. Out of all the data available, the company, predicts an average data similarity score of about 62.7%. The company would like to leverage the availability of such a large-scale data into making intelligent real-time data similarity suggestions to its online users.
Solution:
Since this was a real-time comprehensive data analysis need, the challenge was to procure a technology stack that not only did the deal but did it fast. After several thick brainstorming sessions, the Elastic stack came out to be the most apt. Elastic stack is Reliable and securely takes data from any source, in any format, and searches, analyzes, and visualizes it in real time. Elasticsearch is a search engine based on Apache Lucene. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is released as open source under the terms of the Apache License. The idea was to implement an “intellisense” like plugin to the rich text editor that gave real-time data similarity suggestions. To achieve this, a quick data caching mechanism would be developed that would then be traversed for similarity checks, the output of which would be streamed to the front-end. In practicality, here’s how this was achieved:
- Create a virtual data index
- Extract data from the remote data server using Logstash and feed it to the index
- Create a Flask end-point, that will host the user requests
- Flask then runs the query DSL in Elasticsearch and extract the required data from the
- appropriate index and sends the data to a python dfflib library, which calculates the similarity
- score and displays the records along with the differences, this data is then formatted and
- presented to the user.
From The End User’s Perspective:
The process starts when the query string gets fed to the parser. The parser goes as far as possible and then returns an error when it reaches the incomplete part of the query string. Based on this error and the entered text the custom JavaScript then suggests a list of terms that can come next in the sequence. When the query is complete it can be sent to the Elasticsearch in the form of a query string. If it helps usability the query can be formatted as a custom “Datasense Query” and then translated into an elasticsearch query string.
Benefits:
This will help in eradicating two major system pain-points:
- The final phase of this implementation should reduce the mutually redundant data score by 23%
- It will reduce rework, encouraging the end users to reuse existent data per their needs. This would eventually end up into reducing time taken for document creation by 38%.