Introduction
Introduction
Statistics is the art of collecting, organizing, summarizing, analyzing, and drawing conclusions from available information. In other words, Statistics is a branch of knowledge that sets out general problems of collection, measurement, monitoring, analysis of massive information and their comparison; analyzes the quantitative aspect of social phenomena in numerical form. The information used in Statistics is called data.
This book is designed to teach key fundamental principles of Statistics. It will briefly cover three main branches of Statistics: Descriptive Analysis, Probability and Inferential Analysis.
Descriptive Analysis presents and describes, sorts out and analyzes collected data using graphing tools and numerical characteristics. These topics will be explained in Chapters 1 (graphing analysis) and 2 (numerical characteristics).
Probability tells us how often an event may happen. As a theoretical method, probability is largely used in mathematical applications to real-life problems. Probability is the most mathematical part of our book. We will cover essential concepts of probability in Chapters 3-5.
Inferential Analysis allows us to make predictions and inferences based on the analysis of the collected data. Various methods can be used for this purpose. Inferential Analysis is a crucial part of our book; it will be covered in Chapters 6-10.
Why should you study Statistics?
Professionals need to be able to understand the way studies are performed and how data is presented in order to determine whether conclusions of reported studies should be trusted. Appropriate decisions can then be made. However, non-professional also can be benefited from the results of statistical studies.
The word “statistics” comes from the Latin “status” – a state of affairs. A specialist in the field of statistics is called a Statistician. In science, the term “statistics” was introduced by the German scientist Gottfried Achenwall in 1746, proposing to replace the name of the course “State Studies”, taught in German universities, to “Statistics”, thereby laying the foundation for the development of statistics as a science and academic discipline. Despite this, statistical accounting was conducted much earlier: population censuses were carried out in ancient China, the military potential of states was compared, the property of citizens was kept in ancient Rome, and the like. Statistics develops a special methodology for research and processing of materials: mass statistical observations, the method of groupings, averages, indices, the balance method, the method of graphic images, cluster, discriminant, factorial and component analyzes, optimization and other methods of analyzing statistical data.
Basic Definitions
The objectives of Statistics include:
- Collecting data
- Organizing the collected data
- Analyzing the data
- Developing conclusions and making decisions
Censuses, regularly organized by governments, are good examples of collecting data. One of the first documented censuses was conducted in the year 1 B.C. by Roman Emperor Augustus. He ordered all the inhabitants of the Roman Empire to go to their town of birth to be counted. With the fall of the Roman Empire, from 400 to 1800, there was no real population census in Europe due to the lack of administrative resources. The census conducted in the XIII century by Mongol Emperor Mögke [ ] counted households, the number of men aged 15-60 and the number of fields, livestock, and vineyards.
There exist many smart ways of collecting data. In Statistics, there is a branch called the design of experiments, which is specially dedicated to developing data collecting methods.
The collected data needs to be sorted prior to the analysis because, in most cases, it is difficult to see the main features of the initial observations. To distinguish the initial observations and organized data, we call the first one as raw data.
Example I.1
Tom recorded the ages of his classmates for his Statistics course project. His records show the following: Camila, 21 years old; Madison, 27 years old; Emily, 19 years old; Boris, 21 years old; Ali, 32 years old.
We can consider this as raw data. This data can be organized using a table, indicating ages from youngest to oldest:
|
Students |
Emily |
Boris |
Camila |
Madison |
Ali |
|
Ages |
19 |
21 |
21 |
27 |
32 |
Another option for data organization would be the alphabetical order of the first names of students.
|
Students |
Ali |
Boris |
Camila |
Emily |
Madison |
|
Ages |
32 |
21 |
21 |
19 |
27 |
The organized data can be analyzed depending on the purpose of the study. For instance, we might need to find the average age of students collected in Example I.1:
![]()
We could also determine the range of the collected data:
Range – Largest Observation – Smallest Observation = 32 – 19 = 13
Now we can make several conclusions about the ages of students:
- The youngest student is 19 years old.
- The oldest student is 32 years old.
- The average student is 24 years old.
- 0.6 or 60% of students are in the age below the average.
- 0.4 or 40% of students are in the age above the average.
Later, we will provide more sophisticated statistical inference procedures to organize and analyze the collected data in this book.
Data can be collected by doing surveys or studies. There exist various survey technologies, such as by phone, email, mail, in-person interviews. One can use existing records, too, if they are available. Within the observational studies, researchers collect the data by observing the outcomes of the studies.
We use the term population to denote the full collection of all observations as population. In Statistics, the population is not restricted by the biological and demographical senses only. For example, later in this book, we will analyze the principles of quality control in light bulb production. We will consider all light bulbs produced during a shift as a population.
Usually, collecting the data from the whole population is too time-consuming or expensive; sometimes, simply impossible. For instance, the Canadian government spent about $650 million dollars for the 2016 Census. Besides, not always collecting the data over the entire population provides sufficient accuracy due to the inconsistency of responses or possible bias. In the light bulb example mentioned above, during the testing, the bulbs can be destroyed, which would compromise the idea of collecting information about the entire population.
In order to resolve these problems, statisticians chose a subset of the population, collect the data relating to this subset, analyze the collected data and make a conclusion about the population basing on this analysis. The selected subset of the population is called a sample.
Example I.2.
The authors of this book conducted a research project in Prince Albert, Saskatchewan, in 2017, to study the dependence between the educational attainment, employment and income in Indigenous and Non-Indigenous populations of the city. The reliable analysis required answers from participants to a large number of questions. It was impossible to conduct a survey of the entire city population in a timely manner. The population survey would take so long that some answers (for instance, participants’ employment status) would change during the survey. Therefore, the researchers randomly selected 95 Indigenous and 105 Non-Indigenous residences for the survey. The collected data were organized and analyzed. Then, researchers made conclusions about the entire population of the city based on the analyses performed for these two samples.
The use of Statistics in making the right decisions is the most exciting part of this subject. Later in this book, we will discuss the ideas of developing conclusions and making decisions in more detail.