Collection engine Tornado
The beginning of Bigdata processing can be described as generation or collection of data. In a traditional database (DB) environment, the data is generated in an application, DB’s front end, instead of data being imported from outside and the processing initiates, whereas in the big data, the data is brought in from outside, instead of generated internally, and the processing initiates. In the big data environment, the data processing starts from the data collection.
The big data collection engine (Tornado) is of both active and passive approaches, and is a strong big data processing engine which could perform real-time automatic, and parallel collection of big data in users’ preference from big data generated in various industries such as deep web, SNS, shopping site, IoT and streaming data. TORNADO provides an optimized big data collection environment for real-time analysis of social big data, competitors, markets and products, risk management, and customer voice recognition.
Considering the prevention of data loss and duplication, data compression, data structuring, encryption of stored data, flawless validation, and user convenience, it also extracts, converts and stores big data automatically from hidden web pages, along with more powerful web collection functions. TORNADO, is the most powerful big data collection engine in the world which can collect social big data such as news, RSS, Twitter, Facebook, and Weibo, etc.
< big data collection engine concept map >
- Built-in features for collecting various bigdata
Various types of collection features (collection based on user scenarios, RSS collection web collection, collection deep web, social collection, collection based on OpenAPI) are built in for various types of internal and external big data collection that users need.
- Collection rule editor (workbench) is built in to ensure data extracting performance.
Through a web-based collection rule editor which considers user’s usability, collection rule editor is built in to easily extract and collect data from various types of dynamic websites such as JS and AJAX.
- Supports various OS systems such as parallel distributed collection
It can simultaneously collect a large amount of data using various set rules through the distributed parallel method much faster and in a more stable way, and it can also installed and operated in various operating systems (UNIX, Window, etc.).
- Features of collection simulation and preview of user collection
For user convenience, it provides a feature to confirm the quality of data collected by data collection simulation in advance with a previously generated collection rules through preview before collecting the user data.
- Management tools that are easily and conveniently managed
It provides a feature for operator/manager to easily and quickly check the current status through integrated dashboard feature which could monitor the overall condition of the collection engine, as well as the operation management tool which could always monitor in real-time the collection policies and schedule setting per collection source.
Main features and specifications
To handle various types of internal and external data collection processes required for intelligent integration analysis of structured and big data, Big Data Suite’s Big Data Collection Engine (Tornado) provides such collection functions as user scenario based collection, RSS based collection, deep web collection, meta search collection, social media collection, OpenAPI collection. The collection engine’s internal simulator can be used to verify if the user-defined collection task is carried out as intended. It provides scheduling function, status monitoring function, and operation manager function to monitor the collection result in real time while the collection is run during actual operation.
< Collection engine operation procedures >
Scenario-based collection feature
Based on scenarios created by users from various sites such as news, blogs, shopping malls, and general homepages, data about the collection target is extracted and collected. It provides scheduling capabilities to set collection cycles and view collection status history to view collection status within the workbench.
Deep web collection feature
It could easily collect the information within websites by collecting site-wide information based on URL, or by filtering with URL patterns or keywords, and provides the scheduling feature which could also set the collection cycle. It also provides the collection status history view feature to check the collection status.
Social media collection feature
It provides a scheduling function to easily collect various types of social data such as Twitter, public Facebook page, and Weibo timeline, and set the collection cycle for the collection target, and the collection status history view to check the status.
Collection engine operation management feature
① Operation management feature – status monitoring dashboard feature
② User(manager) management feature
③ Management feature per collection target (project)
RSS collection feature
It reads RSS (Really Simple Syndication) feed and it not only extracts the data within the collection target feed, but also the linked original data. It provides scheduling function which could set the collection cycle and collection status history features in which the collection status can be checked even in workbench.
Open API-based collection feature
It provides a scheduling function to easily collect various open data documents and data, including domestic and overseas public data and local government public data, and set the collection cycle for the collection target, and provides the collection status history view to check the status.
Meta search collection feature
t has a keyword-based collection feature that sends user keywords to a variety of search engines, including Google, Bing, Daum, Naver, and Yahoo, to consolidate search results into a single list. It also provides a scheduling function that allows the user to easily collect and set the collection cycle for the collection target, and provides the collection status history view to check the status.