Friday, May 1, 2020

Briefing Paper Marking Guide - Big Data

Introduction In this briefing paper, the ICT topic Big Data will be reviewed from literatures. Big data is the new IT buzzword that refers to the voluminous data processed in different business processes in different industries around us. There is explosive growth of the volume of structured and unstructured data in last decades. In the early days of implementation of information technology in different kind of organizations had implemented databases for working with data related to their business processes, but in last few years, emergence of social media and ecommerce have accelerated the growth of data outside organization and from individuals. For example, over social media like Facebook, people uploads and shares heavy volumes of images, texts, videos from different parts of the world at the same time their location details, machine details etc. are also being circulated. Analysis of these type of data reveals several interesting information about peoples lifestyles, choices etc. and busines ses are very interesting for these type of information. Working with these kind of data using typical database management software and information technologies is difficult. Big data and technologies have given rise of a new dimension in this case. It covers all technologies that helps to address the volume and complexities of these voluminous data. (Madden, 2012) In this briefing paper, it will discuss about the technology, current research and trends on big data in details. The Problem As, it has been already told that, processing of volumes of structured and unstructured data, using traditional database management systems or data processing systems were difficult. This was the problem that led to the invention of big data concept and related technologies. (Madden, 2012) There is common confusion around the term big data, that is, whether it refers to technology or volume of data. When vendors use the term big data they generally refers to the technologies like processes and tools that helps in working with volumes of data efficiently. So, the term encompasses any collection of complex and larger data set that is difficult to handle by typical database management or data processing applications. It also covers up the collection of tools to support processing of such data through different kind of operations like searching, curation, sharing, transfer, storage, visualization etc. (Zikopoulos, 2011) Bid Data: Characteristics There are some characteristics of a dataset that makes it a Big data data set. Those characteristics are, A. Volume Volume refers to the quantity of data, generated from a process or system. The potential and value of a dataset is directly proportional to its volume. This characteristic is the first criteria to classify a dataset as big data or not. Even this concept has been reflected on the term Big data itself. (Marz Warren, 2014) B. Velocity This characteristic is related to the rate of generation of data or how fast data is getting generated or processed to provide desired outcome. C. Variety Big data takes data from heterogeneous sources into consideration while processing the same. Variety in data sets helps to analyze it from different aspects and in deriving different outcomes. D. Veracity Veracity of data sets refers to the captured quality of those. Veracity of a dataset plays significant role in the accuracy of the outcomes from analysis of the data sets. (Zikopoulos, 2011) E. Variability Variability refers to the levels on inconsistencies present in the data and that show up any time during processing. This may be problematic for data analysts. F. Complexity Managing the processing of big data, is a complex process by itself. It becomes more complex when data comes from heterogeneous sources and larger in volume. These kind of data is needed to be interlinked, correlated and connected. Otherwise it is difficult to work on these data. Big Data: Technologies Computation power and storage for larger datasets are not a serious problem now a days. Advancement in electronics and digital technologies have made these solutions more efficient, easily available and cheaper. This has helped in emergence of big data. There is a paradigm shift from computer architecture to the mechanisms in data processing. There is a growing demand for data mining and analysis applications for big data. (Barlow, 2013) There are wide range of tools and technologies that supports the concept of big data and analysis, processing of the same. There are technologies like crowdsourcing, A/B testing, data fusion etc. along with machine learning, natural language processing, time series analysis, integration, simulation, genetic algorithms, signal processing, visualization etc. Tensors are the representative of multidimensional big data. Tensor based technologies and computation methods like multi- linear sub space learning helps in this case. Other than that there are database related technologies, parallel processing support, search based application, distributed file systems and databases, data mining, cloud computing etc. and Internet that supports big data revolution. There are big data analytics that processes the big data and helps in finding out different patters out of it. These patterns gives critical insights into data sets. Storage is an important issue for big data. A proposed solution is distributed and shared storage. Storage area network or SANs, Networked Area storage or NAS etc. come into these categories. However, big data practitioners are not quite interested in these solutions. There are RDBMS based storage solutions for big data that is capable of storing petabytes of data. (Madden, 2012) All these technologies supports big data in analysis of data from web, analysis of network monitoring logs, click stream analysis etc. There are data science applications like simulations for massive scale analysis of data, deployment of sensors etc. Parallel database systems like Vertica, Teradta, Greenplum etc. are powerful but expensive and hard to administer. There are lack of fault tolerance levels in case of longer queries. Hadoop is a popular big data technology accepted worldwide. (Roebuck, 2011) Big Data: Process There are number of phases in data processing in big data. Those are explained as, 1. Data Acquisition Big data takes data that are evolving from different industries and scientific researches, demographics, social media and ecommerce. However, all data is not equally important for a particular goal so after collecting data, it will be filtered. Data are collected from systems, social media and numerous sources. There can be operational or transactional data, structured and unstructured data. When it comes to big data them all types of data irrespective or format and type are collected. Later on these data are filtered and compressed before processing. The most challenging part of data acquisition is, filtering out the unnecessary data. It must be done in a way so that useful information dont get discarded. Data science deals with numerous issues that helps to define different filters to ensure, accuracy and relevancy of collected data. (Marz Warren, 2014) For streaming data from online sources, it is not always possible to store and process those data to filter those later on. Rather it needs an on the fly approach to work on such streamlines of data from web. There are online data analytics applications and systems that helps in filtering and collected data from online streaming data. Next big challenge in to create metadata from acquired data. This is not easy. Meta data should give details about the sources and structure of data. There are metadata acquisition systems that can automatically record metadata without any human intervention. However, there are lots of things to do with metadata after recording those correctly. There is a pipeline for analysis of big data. Metadata is required in every stage of the pipeline. Thus acquisition of data refers to the collection of technologies, tools and processes of collecting data, filtering it and recording metadata of data at the same time without storing and processing data every time. 2. Cleaning and Extraction of information Data analysis needs some level of uniformity of data. Thus, after acquiring data, it is needed to be cleaned and ready for processing. Data analysis will require data in correct formant otherwise the results of the analysis will not be accurate and effective. It needs an information extraction process that will bring out the required information from the piles of data from heterogeneous sources. Then it should present the extracted data in a structured form. The process is technically challenging. For example, there are data like images and videos. Extracting information from these formats of data and presenting the same in structured format are really hard. A common misconception is, big data always provides truth. This is not the case all the time. The truthfulness of big data and analysis depends on these extraction steps. It depends on how effectively truth is getting extracted from raw data. There are different constraints on valid data and error models that are well recognized. However, till now there are many domain of big data where these constraints are still not available. 3. Integration, Aggregation and Representation of Data It has been already discussed that data comes from different heterogeneous sources. Those are no structured and in right format even. It is not possible to acquire and clean data then store the same in data repositories. There are processes like integration, aggregation of those data and then representing those in the right format to sore and process in future. Data analysis is a complex process. For large scale data analysis it is needed to have effective analysis and the process should be automated. In data analysis process, different semantics and data structures are needed to be expressed in correct formats that are readable by computers and can be resolved by robots. Data integration is important and there are additional works for making the data error free using automated system. There are different alternative solutions for storing data other than databases. Each of these alternatives have its own advantages and disadvantages. Designing database or correct storage solution is needed to be done very carefully. There are many decision making tools to provide assistance in designing databases. 4. Processing of Query, data modeling and analysis of data Making query in traditional databases and processing of query in big data, are fundamentally different. Big data contain volumes of dynamic, interrelated, heterogeneous data. These forms larger networks of interrelated data. There are higher level of data redundancy. These redundancies can be explored through validation, crosschecking etc. There are inherent clusters and these clusters reveals relationships among collections of data. (Roebuck, 2011) Data mining is a related topic here. It required, cleaned, integrated, trustworthy, easily accessible and effective data that can help in declarative query through data mining interfaces and computing environments. Big data supports provisions of interactive data analysis in real time applications. Scaling of complex queries is also supported. However, there is a problem with analysis of big data. That is lack of co- ordination in the systems that stores data, support SQL queries and analytics for performing non-SQL data processing, for example statistical analysis, data mining etc. 5. Interpretation Obtaining only results from analysis is not enough. It needs to explain or provide enough explanatory details about those results so that someone can interpret the results from analysis. There are visualizations used in this case. (Marz Warren, 2014) Challenges in Big data There are number of challenges in big data. Some of those are already explained in related contexts. Still, most prevailing challenges are, Heterogeneous sources and nature of data. Incompleteness in data. Problem with effective cleaning and extraction of data. Scale and volume of data. Timeliness of data. Privacy of data. (ene Polonetsky, 2013) Human collaboration needed in certain phases and lack of it in some cases. Lack of suitable and effective system architecture for big data only. Conclusion In this briefing paper, there is a discussion on an emerging topic in ICT, called big data. After the introduction, there is the problem statement that had given rise to the concept of big data. In the sub sequent sections there are discussions on different characteristics, technology etc. related to big data, finally a detailed description of processes in processing of big data. In the end there is a summary of challenges in big data. References Barlow, M., 2013. Real-Time Big Data Analytics: Emerging Architecture. s.l.:O'Reilly Media, Inc.. Boyd, D. Crawford, K., 2011. Six Provocations for Big Data, s.l.: SSRN. ene, O. Polonetsky, J., 2013. Big Data for All: Privacy and User Control in the Age of Analytics. Northwestern Journal of Technology and Intellectual Property , XI(5). Leskovec, J., Rajaraman, A. Ullman, J. D., 2014. Mining of Massive Datasets. s.l.:Cambridge University Press. Madden, S., 2012. From Databases to Big Data. IEEE Computer Society, 16(3), pp. 4-6. Marz, N. Warren, J., 2014. Big Data: Principles and Best Practices of Scalable Realtime Data Systems. s.l.:Manning Publications Company. Roebuck, K., 2011. Storing and Managing Big Data - NoSQL, Hadoop and More: High-impact Strategies - What You Need to Know: Definitions, Adoptions, Impact, Benefits, Maturity, Vendors. s.l.:Emereo Pty Limited. Zikopoulos, P., 2011. Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. s.l.:McGraw Hill Professional.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.