My mentor told me that I have clearance to work with equipment data from the United States Marine Corps. Any data that she sends outside of the Marine Corps to me must be cleared by the security department. She gave me a list of possible topics to do. They are:
"- Top cost drivers in field level maintenance of a specific equipment set. This could just be basic data analysis and summary statistics. If we have time, we can try some linear regression models based on usage data and using unit characteristics. We can focus on either LAVs or MTVRs. Google them and see if you have a preference. This will also depend on the amount of data available on each. I've attached an article that a friend of mine worked on that explains some of the things we've done in the past for different equipment sets.
- Clustering analysis to understand which repair parts are frequently ordered together. Again, we would probably look at LAVs, or we could look at AAVs for this one.
- Develop a function to link two disparate datasets by equipment serial numbers. This one is actually much more difficult than it sounds, but we have a high demand for it. "
I'm leaving the list of topics on this blog post just in case I need to change my topic later on, and I can have easy access to these topics. As for now, I am planning on doing the third topic. I am excited to see where this will lead me.
First, I read the PDF article written by one of my mentor's friends. Even though I am not researching cost drivers for my research (at the moment), it does provide me an example of how to conduct a statistical analysis. They also included a formula for combinatorial identity.
Combinatorics- a study in mathematics concerning with different methods of counting definite data structures. There are different methods used. I've also included a list of combinatorial identities. There are a lot of formulas on the list. I may need them later.
I read the MIT lecture about combinatorics. I do not know yet how I'm going to link the data sets, but these formulas should help give me an idea. Continuing on data mining, I began reading the "Data Mining Problem Types" in The CRISP-DM Process Model. In that section, it discusses data description and summarization. They describe the "characteristics of the data, typically in an elementary and aggregated form." Those are usually ignored when beginning a data mining project. They do help a lot in the middle and end of said project. You can understand the data you are working with and can determine what you do with the data as you go along in the project.
CRISP-DM method- "a hierarchical process model, consisting of sets of tasks described at four levels of abstraction (from general to specific): phase, generic task, specialized task, and process instance."
The types of data mining problems are segmentation, concept descriptions, classification, prediction, and dependency analysis.
segmentation- "aims at the separation of data into interesting and meaningful subgroups or classes. All members of a subgroup share common characteristics."
"Appropriate Techniques: clustering techniques, neural nets, visualization"
concept description- "aims at an understandable description of concepts or classes. The purpose is... to gain insights."
"Appropriate Techniques: rule induction methods, concept clustering"
classification- "assumes that there is a set of objects- characterized by some attributes or features- which belong to different classes. The class label is a discrete(symbolic value and is known for each object."
"Appropriate Techniques: discriminant analysis, rule induction methods, neural nets, k Nearest Neighbor, case-based reasoning, genetic algorithms "
prediction- "very similar to classification.... It means that the aim of prediction is to find the numerical value of the target attribute for unseen objects."
"Appropriate Techniques: regression analysis, regression trees, neural nets, k Nearest Neighbor, Box-Jenkins methods, genetic algorithms"
dependency analysis- "consists of finding a model which describes significant dependencies (or associations) between data items or events."
"Appropriate Techniques: correlation analysis, regression analysis, association rules, Bayesian networks, Inductive Logic Programming, visualization techniques"
At this point, class is about to end. Next time, I will study time series. It is in the last section of the R Cookbook.
Sources Used:
Bagley, B., Bodden, H., DeGrange, W., DeZwarte, C., Reitter, N., Schwamm, H., & Vinyard, B. (2016). Cost Driver in Vehicle Maintenance An Analytic Perspective [PDF file]. USMC: Phalanx. Retrieved from My Mentor.
https://artofproblemsolving.com/wiki/index.php?title=Combinatorial_identity
http://www.math.wvu.edu/~gould/Vol.4.PDF
http://people.qc.cuny.edu/faculty/christopher.hanusa/courses/636fa13/Documents/636fa13ch21.pdf
http://math.mit.edu/~fox/MAT307-lecture01.pdf
Chapman, P., Kerber, R., Clinton, J., Khabaza, T., Reinartz, T., & Wirth, R. (1999). The CRISP-DM Process Model [PDF file]. Discussion Paper. Retrieved from My Mentor.
Teetor, P. (2011). R Cookbook [PDF file]. Sebastopol, California: O'Reilly Media, Inc. Retrieved from My Mentor.
Comments
Post a Comment