Learning How to Prepare Data


   It has been about a month since I have last worked on my thesis project. I need to schedule a phone-call meeting with my mentor to discuss my research. Until then, I can not work with the equipment serial numbers. In the mean time, I will continue to deepen my understanding on data science.

   I read the section in the CRISP-DM Process Model about data preparation. My mentor warned me that I would spend a bulk of my time cleaning up the raw data so that I can work with it. Data scientists spend a majority of their time preparing the data for analysis.

   According to the model, there are five steps in preparing the data: Select Data, Clean Data, Construct Data, Integrate Data, and Format Data.

      1. Select Data- which data am I going to include or exclude? The criteria are "relevance to data mining goals, quality, and technical constraints such as limits on data volumes or data types."
      2. Clean Data- "raise the data quality to the level required by selected analysis techniques." I might have to fill in missing data through modeling or find appropriate subsets of the data. Then, I have to report what actions I took and what decisions I made, what possible impact they might have for analysis, and how I "transformed" the data.
      3. Construct Data- now I have to plan what operations I would use to carry out my goal. "This task includes... production of derived attributes*, entire new records, or transformed values for existing attributes." Basically, I need to create a record of trends or patterns I see in the data.
      4. Integrate Data- "information is combined from multiple tables or records to create new records or values."
      5.  Format Data- "syntactic modifications made to the data that do not change its meaning, but might be required by the modeling tool." I might have to change the order of the records or attributes, or even remove commas to meet character count.
               
   There are two ways to understand my data: syntactic understanding and semantic understanding. You use both in data preparation. Syntactic understanding finds itself in how the data is stored. Semantic understanding is "understanding based on how the data is used rather than how it is stored."

    I researched different data structures in the R programming language. There are vectors, lists,  matrices, factors, and data frames. Not only did I learned the advantages and disadvantages of each, I learned how to code them in R. The book even gave some example codes for each. I anticipate that for my project, I might use matrices or data frames, which is "part matrix and part list."  It is "a table of observations." Think of it like an Excel spreadsheet.




VOCAB:
derived attributes- new attributes that are constructed from one or more existing attributes in             the same record. Example: area = length x width
merging tables- joining together two or more tables that have different information about the same object.
aggregation- operations where new values are computed by summarizing together information from multiple works and/or tables
true type- a label applied to data points xi, such that xi are mutually comparable.
data density- indication of how data is clumped together. It is an assumption underlying any conclusions drawn from the data.
Recycling Rule- used in R for vector math and functions. Processes different-length vectors in pairs. When shorter vector runs out of spaces, R goes back to the beginning to that vector and "recycle" its elements until the operation is finished.

Sources Used:
Chapman, P., Kerber, R., Clinton, J., Khabaza, T., Reinartz, T., & Wirth, R. (1999). The CRISP-DM Process Model [PDF file]. Discussion Paper. Retrieved from My Mentor.

Teetor, P. (2011). R Cookbook [PDF file]. Sebastopol, California: O'Reilly Media, Inc. Retrieved from My Mentor.

   https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#697d2c756f63 

https://www.slideshare.net/HadoopSummit/data-preparation-of-data-science

http://aisel.aisnet.org/cgi/viewcontent.cgi?article=1092&context=icis2001

https://www.datasciencecentral.com/profiles/blogs/in-big-data-preparing-the-data-is-most-of-the-work

https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?_r=0

https://en.wikibooks.org/wiki/Data_Science:_An_Introduction/Data_Preparation_and_Metadata

http://ucanalytics.com/blogs/master-art-data-preparation-data-science/

Comments