Data Generation, Analysis, and Usage – Current Scenario
Last decade has seen an exponential increase in the data being generated from across traditional as well as non-traditional data sources. International Data Corporation (IDC)report says that, data generated in the year 2020 alone will be a staggering 40 zettabytes which would constitute a 50-fold growth from 2010. The data generated per second has increased to 2.5 Quintillion bytes and with the advent of latest innovations like the Internet of Things; it is poised to grow even more rapidly. This increase in data generation coupled with growing ability to store various types of data that is being generated has ensued in a vast repository of data which is now available for scrutiny.
According to reports by wealth management firm Merrill Lynch,among all these data,80 percent of business-relevant information originates in unstructured form. Now unstructured data refers to information which either does not tailor to a pre-defined data model or is not organized in a pre-defined manner. These could be images, videos, emails, social media data or even sonar readings. Essentially these are data points which cannot be captured in our traditional relational databases.
Analysis of Unstructured Data
As the ability to store varied data increased so did our ability to analyze and derive actionable insights from it. Thecompanies started realizing the significance of analyzing unstructured data along with structured data,started investing more into it andas a result, thepotential benefits which could be harnessed from these previously useless data became more apparent.The personalized loan offerings from banks or the customized offers from e-commerce sites or exclusive loyalty discounts offered by retail chains are just a few examples of how organizations have started deep diving into the unstructured data to come up with tailored offerings.
This blog post brings out the significance of the data storage Repositories namely Data Warehouse and Data Lake, does a comparative analysis and suggests on the different approaches to be adopted based on the implementation decision and architecture.
Traditional Data Warehouse Challenges
Storage and Performance:
A Data Warehouse is a conceptual architecture that helps to store structured, subject-oriented, time variant, non-volatile data for decision making. Historical as well as real-time data from various sources are transformedto load to a structured form.
While a traditional Data warehouse can act as a master repository for all the structured data across the organization, its inability to store unstructured data prevents it from acting as a unified data source for analytics thereby hampering its ability to successfully garner value from such hugedata. Because unstructured data constitutes such large chunk of business-related information, enterprises can no longer afford to neglect it, and leaving this data out of the purview of analytics could prove detrimental for companies.
Also with the exponential increase in the data being generated each day, storing these data in traditional databases could prove expensive for organizations. And as a result of such humongous data being stored, the performance also suffers unless we invest more heavily in the hardware configurations.
From an implementation standpoint one of the main challenges a data warehousing project poses is pertaining to the data quality. Often when we try to combine inconsistent data from disparate sources it would result in duplicates, inconsistent data, missing data and logical conflicts. Varied level of standardizations across different databases also adds to the issue. These would create a problem at a later stage and will result in faulty reporting and analytics thereby affecting optimal decision making.
By the virtue of having data from across different databases, data warehouse projects often cater to varied reports and analytics as per user demand.Data warehouses being ‘schema on-write’,such reporting and analytics need to be taken into design considerations upfront as we need to define the schema before loading data into the databases. However, envisioning all such reports at the onset might be difficult for business users who are not exposed to the capabilities of the tools and will often result in rework for the technical team.
Because data warehouse projects are structure driven, it does not adapt itself easily to changes. The effort and resource required to adapt to any such changes are invariably exorbitant and will most likely drive up the cost significantly. For instance, if a new business requirement emerges at a later point, which fundamentally changes the original data structure, it would necessitate remodeling of Data Warehouse and this can be extremely time-consuming.