Welcome to Data Science: An Introduction

I'm codebeast and what we are going to do in this course is We are going to havea brief, accessible and non-technical overview of the field of Data Science

Now, some peoplewhen they hear Data Science, they start thinking things like: Data and think about piles ofequations and numbers and then throw on top of that Science and think about people workingin their lab and they start to say eh, that's not for me

I'm not really a technical personand that just seems much too techy

Well, here's the important thing to know

Whilea lot of people get really fired up about the technical aspects of Data Science theimportant thing is that Data Science is not so much a technical discipline, but creative

And, really, that's true

The reason I say that is because in Data Science you use toolsthat come from coding and statistics and from math But you use those to work creativelywith data

The idea is there's always more than one way to solve a problem or answera question But most importantly to get insight Because the goal, no matter how you go aboutit, is to get insight from your data

and what makes Data Science unique, compared toso many other things is that you try to listen to all of your data, even when it doesn'tfit in easily with your standard approaches and paradigms you're trying to be much moreinclusive in your analysis and the reason you want to do that is because everythingsignifies

everything carries meaning and everything can give you additional understandingand insight into what's going on around you and so in this course what we are trying todo is give you a map to the field of Data Science and how you can use it and so nowyou have the map in your hands and you can get ready to get going with Data Science

Welcome back to Data Science: An Introduction

And we're going to begin this course by definingdata science

That makes sense

But, we are going to be doing it in kind of a funny way

The first thing I am going to talk about is the demand for data science

So, let's takea quick look

Now, data science can be defined in a few ways

I am going to give you someshort definitions

Take one on my definition is that data science is coding, math, andstatistics in applied settings

That's a reasonable working definition

But, if you want to bea little more concise, I've got take two on a definition

That data science is the analysisof diverse data, or data that you didn't think would fit into standard analytic approaches

A third way to think about it is that data science is inclusive analysis

It includesall of the data, all of the information that you have, in order to get the most insightfuland compelling answer to your research questions

Now, you may say to yourself, "Wait

that'sit?" Well, if you're not impressed, let me show you a few things

First off, let's takea look at this article

It says, "Data Scientist: the Sexiest Job of the 21st Century

" Andplease note, this is coming from Harvard Business Review

So, this is an authoritative sourceand it is the official source of this saying: that data science is sexy! Now, again, youmay be saying to yourself, "Sexy? I hardly think so

" Oh yeah, it's sexy

And the reasondata science is sexy is because first, it has rare qualities, and second it has highdemand

Let me say a little more about those

The rare qualities are that data science takesunstructured data, then finds order, meaning, and value in the data

Those are important,but they're not easy to come across

Second, high demand

Well, the reason it's in highdemand is because data science provides insight into what's going on around you and critically,it provides competitive advantage, which is a huge thing in business settings

Now, letme go back and say a little more about demand

Let's take a look at a few other sources

So, for instance the McKinsey Global Institute published a very well-known paper, and youcan get it with this URL

And if you go to that webpage, this is what's going to comeup

And we're going to take a quick look at this one, the executive summary

It's a PDFthat you can download

And if you open that up, you will find this page

And let's takea look at the bottom right corner

Two numbers here, I'm going to zoom in on those

The firstone, is they are projecting a need in the next few years for somewhere between 140 and190,000 deep analytical talent positions

So, this means actual practicing data scientists

That's a huge number; but almost ten times as high is 1

5 million more data-savvy managerswill be needed to take full advantage of big data in the United States

Now, that's peoplewho aren't necessarily doing the analysis but have to understand it, who have to speakdata

And that's one of the main purposes of this particular course, is to help peoplewho may or may not be the practicing data scientists learn to understand what they canget out of data, and some of the methods used to get there

Let's take a look at anotherarticle from LinkedIn

Here is a shortcut URL for it and that will bring you to thiswebpage: "The 25 hottest job skills that got people hired in 2014

" And take a look atnumber one here: statistical analysis and data mining, very closely related to datascience

And just to be clear, this was number one in Australia, and Brazil, and Canada,and France, and India, and the Netherlands, and South Africa, and the United Arab Emirates,and the United Kingdom


And if you need a little more, let's take a lookat Glassdoor, which published an article this year, 2016, and it's about the "25 Best Jobsin America

" And look at number one right here, it's data scientist

And we can zoomon this information

It says there is going to be 1,700 job openings, with a median basesalary of over $116,000, and fabulous career opportunities and job scores

So, if you wantto take all of this together, the conclusion you can reach is that data science pays

AndI can show you a little more about that

So for instance, here's a list of the top tenhighest paying salaries that I got from US News

We have physicians (or doctors), dentists,and lawyers, and so on

Now, if we add data scientist to this list, using data from O'Reilly

com,we have to push things around

And goes in third with an average total salary (not thebase we had in the other one, but the total compensation) of about $144,000 a year


So in sum, what do we get from all this? First off, we learn that there isa very high demand for data science

Second, we learn that there is a critical need forboth specialists; those are the sort of practicing data scientists; and for Generalists, thepeople who speak the language and know what can be done

And of course, excellent pay

And all together, this makes Data Science a compelling career alternative and a wayof making you better at whatever you are doing

Back here in data science, we're going tocontinue our attempt to define data science by looking at something that's really wellknown in the field; the Data Science Venn Diagram

Now if you want to, you can thinkof this in terms of, "What are the ingredients of data science?" Well, we're going to firstsay thanks to Drew Conway, the guy who came up with this

And if you want to see the originalarticle, you can go to this address

But, what Drew said is that data science is madeof three things

And we can put them as overlapping circles because it is the intersection that'simportant

Here on the top left is coding or computer programming, or as he calls it:hacking

On the top right is stats or, stats or mathematics, or quantitative abilitiesin general

And on the bottom is domain expertise, or intimate familiarity with a particularfield of practice: business, or health, or education, or science, or something like that

And the intersection here in the middle, that is data science

So it's the combination ofcoding and statistics and math and domain knowledge

Now, let's say a little more aboutcoding

The reason coding is important is because it helps you gather and prepare thedata

Because a lot of the data comes from novel sources and is not necessarily readyfor you to gather and it can be in very unusual formats

And so coding is important becauseit can require some real creativity to get the data from the sources to put it into youranalysis

Now, a few kinds of coding that are important; for instance, there is statisticalcoding

A couple of major languages in this are R and Python

Two open-source free programminglanguages

R, specifically for data

Python is general-purpose, but well adapted to data

The ability to work with databases is important too

The most common language there is SQL,usually pronounced "Sequel," which stands for Structured Query Language, because that'swhere the data is

Also, there is the command line interface, or if you are on a Mac, peoplejust call it "the terminal

" Most common language there is Bash, which actually stands for Bourne-againshell

And then searching is important and regex, or regular expressions

While thereis not a huge amount to learn there (it's a small little field), it's sort of like super-poweredwildcard searching that makes it possible for you to both find the data and reformatit in ways that are going to be helpful for your analysis

Now, let's say a few thingsabout the math

You're going to need things like a little bit of probability, some algebra,of course, regression (very common statistical procedure)

Those things are important

Andthe reason you need the math is: because that is going to help you choose the appropriateprocedures to answer the question with the data that you have

And probably even moreimportantly; it is going to help you diagnose problems when things don't go as expected

And given that you are trying to do new things with new data in new ways, you are probablygoing to come across problems

So the ability to understand the mechanics of what is goingon is going to give you a big advantage

And the third element of the data science VennDiagram is some sort of domain expertise

Think of it as expertise in the field thatyou're in

Business settings are common

You need to know about the goals of that field,the methods that are used, and the constraints that people come across

And it's importantbecause whatever your results are, you need to be able to implement them well

Data scienceis very practical and is designed to accomplish something

And your familiarity with a particularfield of practice is going to make it that much easier and more impactful when you implementthe results of your analysis

Now, let's go back to our Venn Diagram here just for a moment

Because this is a Venn, we also have these intersections of two circles at a time

Atthe top is machine learning

At the bottom right is traditional research

And on thebottom left hand is what Drew Conway called, "the danger zone

" Let me talk about eachof these

First off, machine learning, or ML

Now, you think about machine learningand the idea here is that it represents coding, or statistical programming and mathematics,without any real domain expertise

Sometimes these are referred to as "black box" models

They kind of throw data in and you don't even necessarily have to know what it means orwhat language it is in, and it will just kind of crunch through it all and it will giveyou some regularities

That can be very helpful, but machine learning is considered slightlydifferent from data science because it doesn't involve the particular applications in a specificdomain

Also, there's traditional research

This is where you have math or statisticsand you have domain knowledge; often very intensive domain knowledge but without thecoding or programming

Now, you can get away with that because the data that you use intraditional research is highly structured

It comes in rows and columns, and is typicallycomplete and is typically ready for analysis

Doesn't mean your life is easy, because nowyou have to expand an enormous amount of effort in the methods and the designing of the projectand the interpretation of the data

So, still very heavy intellectual cognitive work, butit comes from a different place

And then finally, there is what Conway called, "thedanger zone

" And that's the intersection of coding and domain knowledge, but withoutmath or statistics

Now he says it is unlikely to happen, and that is probably true

On theother hand, I can think of some common examples, what are called "word counts," where you takea large document or a series of documents, and you count how many times a word appearsin there

That can actually tell you some very important things

And also, drawing mapsand showing how things change across place and maybe even across time

You don't necessarilyhave to have the math, but it can be very insightful and helpful

So, let's think abouta couple of backgrounds where people come from here

First, is coding

You can havepeople who are coders, who can do math, stats, and business

So, you get the three things(and this is probably the most common), most the people come from a programming background

On the other hand, there is also stats, or statistics

And you can get statisticianswho can code and who also can do business

That's less common, but it does happen

Andfinally, there is people who come into data science from a particular domain

And theseare, for instance, business people who can code and do numbers

And they are the leastcommon

But, all of these are important to data science

And so in sum, here is whatwe can take away

First, several fields make up Data Science

Second, diverse skills andbackgrounds are important and they are needed in data science

And third, there are manyroles involved because there are a lot of different things that need to happen

We'llsay more about that in our next movie

The next step in our data science introductionand our definition of data science is to talk about the Data Science Pathway

So I liketo think of this as, when you are working on a major project, you have got to do onestep at a time to get it from here to there

In data science, you can take the varioussteps and you can put them into a couple of general categories

First, there are the stepsthat involve planning

Second, there's the data prep

Third, there's the actual modelingof the data

And fourth, there's the follow-up

And there are several steps within each ofthese; I'll explain each of them briefly

First, let's talk about planning

The firstthing that you need to do, is you need to define the goals of your project so you knowhow to use your resources well, and also so you know when you are done

Second, you needto organize your resources

So you might have data from several different sources; you mighthave different software packages, you might have different people

Which gets us to thethird one: you need to coordinate the people so they can work together productively

Ifyou are doing a hand-off, it needs to be clear who is going to do what and how their workis going to go together

And then, really to state the obvious, you need to schedulethe project so things can move along smoothly and you can finish in a reasonable amountof time

Next is the data prep, where you are taking like food prep and getting theraw ingredients ready

First of course, is you need to get the data

And it can frommany different sources and be in many different formats

You need to clean the data and, thesad thing is, this tends to be a very large part of any data science project

And thatis because you are bringing in unusual data from a lot of different places

You also wantto explore the data; that is, really see what it looks like, how many people are in eachgroup, what the shape of the distributions are like, what is associated with what

Andyou may need to refine the data

And that means choosing variables to include, choosingcases to include or exclude, making any transformations to the data you need to do

And of coursethese steps kind of can bounce back and forth from one to the other

The third group ismodeling or statistical modeling

This is where you actually want to create the statisticalmodel

So for instance, you might do a regression analysis or you might do a neural network

But, whatever you do, once you create your model, you have to validate the model

Youmight do that with a holdout validation

You might do it really with a very small replicationif you can

You also need to evaluate the model

So, once you know that the model isaccurate, what does it actually mean and how much does it tell you? And then finally, youneed to refine the model

So, for instance, there may be variables you want to throw out;maybe additional ones you want to include

You may want to, again, transform some ofthe data

You may want to get it so it is easier to interpret and apply

And that getsus to the last part of the data science pathway

And that's follow up

And once you have createdyour model, you need to present the model

Because it is usually work that is being donefor a client, could be in house, could be a third party

But you need to take the insightsthat you got and share them in a meaningful way with other people

You also need to deploythe model; it is usually being done in order to accomplish something

So, for instance,if you are working with an e-commerce site, you may be developing a recommendation enginethat says, "people who bought this and this might buy this

" You need to actually stickit on the website and see if it works the way that you expected it to

Then you needto revisit the model because a lot of the times, the data that you worked on is notnecessarily all of the data, and things can change when you get out in the real worldor things just change over time

So, you have to see how well your model is working

Andthen, just to be thorough, you need to archive the assets, document what you have, and makeit possible for you or for others to repeat the analysis or develop off of it in the future

So, those are the general steps of what I consider the data science pathway

And insum, what we get from this is three things

First, data science isn't just a technicalfield, it is not just coding

Things like, planning and presenting and implementing arejust as important

Also, contextual skills, knowing how it works in a particular field,knowing how it will be implemented, those skills matter as well

And then, as you gotfrom this whole thing, there are a lot of things to do

And if you go one step at atime, there will be less backtracking and you will ultimately be more productive inyour data science projects

We'll continue our definition of data science by lookingat the roles that are involved in data science

The way that different people can contributeto it

That's because it tends to be a collaborative thing, and it's nice to be able to say thatwe are all together, working together towards a single goal

So, let's talk about some ofthe roles involved in data science and how they contribute to the projects

First off,let's take a look at engineers

These are people who focus on the back end hardware

For instance, the servers and the software that runs them

This is what makes data sciencepossible, and it includes people like developers, software developers, or database administrators

And they provide the foundation for the rest of the work

Next, you can also have peoplewho are Big Data specialists

These are people who focus on computer science and mathematics,and they may do machine learning algorithms as a way of processing very large amountsof data

And they often create what are called data products

So, a thing that tells youwhat restaurant to go to, or that says, "you might know these friends," or provides waysof linking up photos

Those are data products, and those often involve a huge amount of verytechnical work behind them

There are also researchers; these are people who focus ondomain-specific research

So, for instance, physics, or genetics, or whatever

And thesepeople tend to have very strong statistics, and they can use some of the procedures andsome of the data that comes from the other people like the big data researchers, butthey focus on the specific questions

Also in the data science realm, you will find analysts

These are people who focus on the day-to-day tasks of running a business

So for instance,they might do web analytics (like Google analytics), or they might pull data from a SQL database

And this information is very important and good for business

So, analysts are key tothe day-to-day function of business, but they may not be, exactly be Data Science proper,because most of the data they are working with is going to be pretty structured

Nevertheless,they play a critical role in business in general

And then, speaking of business

You have theactual business people; the men and women who organize and run businesses

These peopleneed to be able to frame business-relevant questions that can be answered with the data

Also, the business person manages the project and the efforts and the resources of others

And while they may not actually be doing the coding, they must speak data; they must knowhow the data works, what it can answer, and how to implement it

You can also have entrepreneurs

So, you might have a data startup; they are starting their own little social network,their own little web search platform

An entrepreneur needs data and business skills

And truthfully,they have to be creative at every step along the way

Usually because they are doing itall themselves at a smaller scale

Then we have in data science something known as "thefull stack unicorn

" And this is a person who can do everything at an expert level

They are called a unicorn because truthfully, they may not actually exist

I will have moreto say about that later

But for right now, we can sum up what we got out of this videoby three things

Number one, data science is diverse

There's a lot of different peoplewho go into it, and they have different goals for their work, and they bring in differentskills and different experiences and different approaches

Also, they tend to work in verydifferent contexts

An entrepreneur works in a very different place from a businessmanager, who works in a very different place from an academic researcher

But, all of themare connected in some way to data science and make it a richer field

The last thingI want to say in "Data Science: An Introduction" where I am trying to define data science,is to talk about teams in data science

The idea here is that data science has many differenttools, and different people are going to be experts in each one of them

Now, you have,for instance, coding and you have statistics

Also, you have what feels like design, orbusiness and management that are involved

And the question, of course, is: "who cando all of it? Who's able to do all of these things at the level that we need?" Well, that'swhere we get this saying (I have mentioned it before), it's the unicorn

And just likein ancient history, the unicorn is a mythical creature with magical abilities

In data science,it works a little differently

It is a mythical Data Scientist with universal abilities

Thetrouble is, as we know from the real world, there are really no unicorns (animals), andthere are really not very many unicorns in data science

Really, there are just people

And so we have to find out how we can do the projects even though we don't have this oneperson who can do everything for everybody

So let's take a hypothetical case, just fora moment

I am going to give you some fictional people

Here is my fictional person Otto,who has strong visualization skills, who has good coding, but has limited analytic or statisticalability

And if we graph his stuff out, his abilities

So, here we have five thingsthat we need to have happen

And for the project to work, they all have to happen at least,a level of eight on the zero-to-ten

If we take his coding ability, he is almost there

Statistics, not quite halfway

Graphics, yes he can do that

And then, business, eh, alright

And project, pretty good

So, what you can see here is, in only one of these five areasis Otto sufficient on his own

On the other hand, let's pair him up with somebody else

Let's take a look at Lucy

And Lucy has strong business training, has good tech skills, buthas limited graphics

And if we get her profile on the same thing that we saw, there is coding,pretty good

Statistics, pretty good

Graphics, not so much

Business, good

And projects,OK

Now, the important thing here is that we can make a team

So let's take our twofictional people, Otto and Lucy, and we can put together their abilities

Now, I actuallyhave to change the scale here a little bit to accommodate the both of them

But our criterionstill is at eight; we need a level of eight in order to do the project competently

Andif we combine them: oh look, coding is now past eight

Statistics is past eight

Graphicsis way past

Business way past

And then the projects, they are too

So when we combinetheir skills, we are able to get the level that we need for everything

Or to put itanother way, we have now created a unicorn by team, and that makes it possible to dothe data science project

So, in sum: you usually can't do data science on your own

That's a very rare individual

Or more specifically: people need people, and in data science youhave the opportunity to take several people and make collective unicorns, so you can getthe insight that you need in your project and you can get the things done that you want

In order to get a better understanding of data science, it can be helpful to look atcontrasts between data science and other fields

Probably the most informative is with BigData because these two terms are actually often confused

It makes me think of situationswhere you have two things that are very similar, but not the same

Like we have here in thePiazza San Carlo here in Italy

Part of the problem stems from the fact that data scienceand big data both have Venn Diagrams associated with them

So, for instance, Venn number onefor data science is something we have seen already

We have three circles and we havecoding and we have math and we have some domain expertise, that put together get data science

On the other hand, Venn Diagram number two is for Big Data

It also has three circles

And we have the high volume of data, the rapid velocity of data, and the extreme varietyof data

Take those three v's together and you get Big Data

Now, we can also combinethese two if we want in a third Venn Diagram, we call Big Data and Data Science

This timeit is just two circles

With Big Data on the left and Data Science on the right

And theintersection in the middle, there is Big Data Science, which actually is a real term

But,if you want to do a compare and contrast, it kind of helps to look at how you can haveone without the other

So, let's start by looking at Big Data without Data Science

So, these are situations where you may have the volume or velocity or variety of databut don't need all the tools of data science

So, we are just looking at the left side ofthe equation right now

Now, truthfully, this only works if you have Big Data without allthree V's

Some say you have to have the volume, velocity, and variety for it to count as BigData

I basically say anything that doesn't fit into a standard machine is probably BigData

I can think of a couple of examples here of things that might count as Big Data,but maybe don't count as Data Science

Machine learning, where you can have very large datasets and probably very complex, doesn't require very much domain expertise, so that may notbe data science

Word counts, where you have an enormous amount of data and it's actuallya pretty simple analysis, again doesn't require much sophistication in terms of quantitativeskills or even domain expertise

So, maybe/maybe not data science

On the other hand, to doany of these you are going to need to have at least two skills

You are going to needto have the coding and you will probably have to have some sort of quantitative skills aswell

So, how about data science without Big Data? That's the right side of this diagram

Well, to make that happen you are probably talking about data with just one of the threeV's from Big Data

So, either volume or velocity or variety, but singly

So for instance, geneticsdata

You have a huge amount of data and it comes in very set structure and it tends tocome in at once

So, you have got a lot of volume and it is a very challenging thingto work with

You have to use data science, but it may or may not count as Big Data

Similarly,streaming sensor data, where you have data coming in very quickly, but you are not necessarilysaving it; you are just looking at these windows in it

That is a lot of velocity, and it isdifficult to deal with, and it takes Data Science, the full skill set, but it may notrequire Big Data, per se

Or facial recognition, where you have enormous variety in the databecause you are getting photos or videos that are coming in

Again, very difficult to dealwith, requires a lot of ingenuity and creativity may or may not count as Big Data, dependingon how much of a stickler you are about definitions

Now, if you want to combine the two, we cantalk about Big Data Science

In that case, we are looking right here at the middle

Thisis a situation where you have volume, and velocity, and variety in your data and truthfully,if you have the three of those, you are going to need the full Data Science skill set

Youare going to need coding, and statistics, and math, and you are going to have to havedomain expertise

Primarily because of the variety you are dealing with, but taken alltogether you do have to have all of it

So in sum, here is what we get

Big Data is notequal to, it is not identical to data science

Now, there is common ground, and a lot ofpeople who are good at Big Data are good at data science and vice versa, but they areconceptually distinct

On the other hand, there is the shared middle ground of Big DataScience that unifies the two separate fields

Another important contrast you can make intrying to understand data science is to compare it with coding or computer programming

Now,this is where you are trying to work with machines and you are trying to talk to thatmachine, to get it to do things

In one sense you can think of coding as just giving taskinstructions; how to do something

It is a lot like a recipe when you're cooking

Youget some sort of user input or other input, and then maybe you have if/then logic, andyou get output from it

To take an extremely simple example, if you are programming inPython version 2, you write: print, and then in quotes, "Hello, world!" will put the words"Hello, world!" on the screen

So, you gave it some instructions and it gave you someoutput

Very simple programming

Now, coding and data gets a little more complicated

So,for instance, there is word counts, where you take a book or a whole collection of books,you take the words and you count how many there are in there

Now, this is a conceptuallysimple task, and domain expertise and really math and statistics are not vital

But tomake valid inferences and generalizations in the face of variability and uncertaintyin the data you need statistics, and by extension, you need data science

It might help to comparethe two by looking at the tools of the respective trades

So for instance, there are tools forcoding or generic computer programming, and there are tools that are specific for datascience

So, what I have right here is a list from the IEEE of the top ten programming languagesof 2015

And it starts at Java and C and goes down to Shell

And some of these are alsoused for data science

So for instance, Python and R and SQL are used for data science, butthe other ones aren't major ones in data science

So, let's, in fact, take a look at a differentlist of most popular tools for data science and you see that things move around a littlebit

Now, R is at the top, SQL is there, Python is there, but for me what is the most interestingon the list is that Excel is number five, which would never be considered programming,per se, but it is, in fact, a very important tool for data science

And that is one ofthe ways that we can compare and contrast computer programming with data science

Insum, we can say this: data science is not equal to coding

They are different things

On the other hand, they share some of the tools and they share some practices specificallywhen coding for data

On the other hand, there is one very big difference in that statistics,statistical ability is one of the major separators between general purpose programming and datascience programming

When we talk about data science and we are contrasting with some fields,another field that a lot of people get confused and think they are the same thing is datascience and statistics

Now, I will tell you there is a lot in common, but we can talka little bit about the different focuses of each

And we also get into the issue of definitionalismthat data science is different because we define it differently, even when there isan awful lot in common between the two

It helps to take a look at some of the thingsthat go on in each field

So, let's start here about statistics

Put a little circlehere and we will put data science

And, to borrow a term from Steven J

Gould, we cancall these non-overlapping magisteria; NOMA

So, you think of them as separate fields thatare sovereign unto themselves with nothing to do with each other

But, you know, thatdoesn't seem right; and part of that is that if we go back to the Data Science Venn Diagram,statistics is one part of it

There it is in the top corner

So, now what do we do?What's the relationship? So, it doesn't make sense to say these are totally separate areas,maybe data science and statistics because they share procedures, maybe data scienceis a subset or specialty of statistics, more like this

But, if data science were justa subset or specialty within statistics then it would follow that all data scientists wouldfirst be statisticians

And interestingly that's just not so

Say, for instance, wetake a look at the data science stars, the superstars in the field

We go to a ratherintimidating article; it's called "The World's 7 Most Powerful Data Scientists" from Forbes


You can see the article if you go to this URL

There's actually more than seven people,because sometimes he brings them up in pairs

Let's check their degrees, see what theiracademic training is in

If we take all the people on this list, we have five degreesin computer science, three in math, two in engineering, and one each in biology, economics,law, speech pathology, and one in statistics

And so that tells us, of course, these majorpeople in data science are not trained as statisticians

Only one of them has formaltraining in that

So, that gets us to the next question

Where do these two fields,statistics and data science, diverge? Because they seem like they should have a lot in common,but they don't have a lot in training

Specifically, we can look at the training

Most data scientistsare not trained, formally, as statisticians

Also, in practice, things like machine learningand big data, which are central to data science, are not shared, generally, with most of statistics

So, they have separate domains there

And then there is the important issue of context

Data scientists tend to work in different settings than statisticians

Specifically,data scientists very often work in commercial settings where they are trying to get recommendationengines or ways of developing a product that will make them money

So, maybe instead ofhaving data science a subset of statistics, we can think of it more as these two fieldshave different niches

They both analyze data, but they do different things in differentways

So, maybe it is fair to say they share, they overlap, they have analysis in commonof data, but otherwise, they are ecologically distinct

So, in sum: what we can say hereis that data science and statistics both use data and they analyze it

But the people ineach tend to come from different backgrounds, and they tend to function with different goalsand contexts

And in that way, render them to be conceptually distinct fields despitethe apparent overlap

As we work to get a grasp on data science, there is one more contrastI want to make explicitly, and that is between data science and business intelligence, orBI

The idea here is that business intelligence is data in real life; it's very, very appliedstuff

The purpose of BI is to get data on internal operations, on market competitors,and so on, and make justifiable decisions as opposed to just sitting in the bar anddoing whatever comes to your mind

Now, data science is involved with this, except, youknow, really there is no coding in BI

There's using apps that already exist

And the statisticsin business intelligence tend to be very simple, they tend to be counts and percentages andratios

And so, it's simple, the light bulb is simple; it just does its one job thereis nothing super sophisticated there

Instead the focus in business intelligence is on domainexpertise and on really useful direct utility

It's simple, it's effective and it providesinsight

Now, one of the main associations with business intelligence is what are calleddashboards, or data dashboards

They look like this; it is a collection of charts andtables that go together to give you a very quick overview of what is going on in yourbusiness

And while a lot of data scientists may, let's say, look down their nose upondashboards, I'll say this, most of them are very well designed and you can learn a hugeamount about user interaction and the accessibility information from dashboards

So really, wheredoes data science come into this? What is the connection between data science and businessintelligence? Well, data science can be useful to BI in terms of setting it up

Identifyingdata sources and creating or setting up the framework for something like a dashboard ora business intelligence system

Also, data science can be used to extend it

Data sciencecan be used to get past the easy questions and the easy data, to get the questions thatare actually most useful to you; even if they require really sometimes data that is hardto wrangle and work with

And also, there is an interesting interaction here that goesthe other way

Data science practitioners can learn a lot about design from good businessintelligence applications

So, I strongly encourage anybody in data science to lookat them carefully and see what they can learn

In sum: business intelligence, or BI, is verygoal oriented

Data science perhaps prepares the data and sets up the form for businessintelligence, but also data science can learn a lot about usability and accessibility frombusiness intelligence

And so, it is always worth taking a close look

Data science hasa lot of real wonderful things about it, but it is important to consider some ethical issues,and I will specifically call this "do no harm" in your data science projects

And for thatwe can say thanks to Hippocrates, the guy who gave us the Hippocratic Oath of Do NoHarm

Let's specifically talk about some of the important ethical issues, very briefly,that come up in data science

Number one is privacy

That data tells you a lot about peopleand you need to be concerned about the confidentiality

If you have private information about people,their names, their social security numbers, their addresses, their credit scores, theirhealth, that's private, that's confidential, and you shouldn't share that information unlessthey specifically gave you permission

Now, one of the reasons this presents a specialchallenge in data science because, we will see later, a lot of the sources that are usedin data science were not intended for sharing

If you scrape data from a website or fromPDFs, you need to make sure that it is ok to do that

But it was originally createdwithout the intention of sharing, so privacy is something that really falls upon the analystto make sure they are doing it properly

Next, is anonymity

One of the interesting thingswe find is that it is really not hard to identify people in data

If you have a little bit ofGPS data and you know where a person was at four different points in time, you have abouta 95% chance of knowing exactly who they are

You look at things like HIPAA, that's theHealth Insurance Portability and Accountability Act

Before HIPAA, it was really easy to identifypeople from medical records

Since then, it has become much more difficult to identifypeople uniquely

That's an important thing for really people's well-being

And then also,proprietary data; if you are working for a client, a company, and they give you theirown data, that data may have identifiers

You may know who the people are, they arenot anonymous anymore

So, anonymity may or may not be there, but major efforts to makedata anonymous

But really, the primary thing is even if you do know who they are, thatyou still maintain the privacy and confidentiality of the data

Next, there is the issue aboutcopyright, where people try to lock down information

Now, just because something is on the web,doesn't mean that you are allowed to use it

Scraping data from websites is a very commonand useful way of getting data for projects

You can get data from web pages, from PDFs,from images, from audio, from really a huge number of things

But, again the assumptionthat because it is on the web, it's ok to use it is not true

You always need to checkcopyright and make sure that it is acceptable for you to access that particular data

Next,and our very ominous picture, is data security and the idea here is that when you go throughall the effort to gather data, to clean up and prepare for analysis, you have createdsomething that is very valuable to a lot of people and you have to be concerned abouthackers trying to come in and steal the data, especially if the data is not anonymous andit has identifiers in it

And so, there is an additional burden to place on the analystto ensure to the best of their ability that the data is safe and cannot be broken intoand stolen

And that can include very simple things like a person who is on their projectbut is no longer, but took the data on a flash drive

You have to find ways to make surethat that can't happen as well

There's a lot of possibilities, it's tricky, but itis something that you have to consider thoroughly

Now, two other things that come up in termsof ethics, but usually don't get addressed in these conversations

Number one is potentialbias

The idea here is that the algorithms or the formulas that are used in data scienceare only as neutral or bias-free as the rules and the data that they get

And so, the ideahere is that if you have rules that address something that is associated with, for instance,gender or age or race or economic standing, you might unintentionally be building in thosefactors

Which, say for instance, say for title nine, you are not supposed to

You mightbe building those into the system without being aware of it, and an algorithm has thissheen of objectivity, and people say they can place confidence in it without realizingthat it is replicating some of the prejudices that may happen in real life

Another issueis overconfidence

And the idea here is that analyses are limited simplifications

Theyhave to be, that is just what they are

And because of this, you still need humans inthe loop to help interpret and apply this

The problem is when people run an algorithmto get a number, say to ten decimal places, and they say, "this must be true," and treatit as written-in-stone absolutely unshakeable truth, when in fact, if the data were biasedgoing in; if the algorithms were incomplete, if the sampling was not representative, youcan have enormous problems and go down the wrong path with too much confidence in yourown analyses

So, once again humility is in order when doing data science work

In sum:data science has enormous potential, but it also has significant risks involved in theprojects

Part of the problem is that analyses can't be neutral, that you have to look athow the algorithms are associated with the preferences, prejudices, and biases of thepeople who made them

And what that means is that no matter what, good judgment is alwaysvital to quality and success of a data science project

Data Science is a field that is stronglyassociated with its methods or procedures

In this section of videos, we're going toprovide a brief overview of the methods that are used in data science

Now, just as a quickwarning, in this section things can get kind of technical and that can cause some peopleto sort of freak out

But, this course is a non-technical overview

The technical handson stuff is in the other courses

And it is really important to remember that tech issimply the means to doing data science

Insight or the ability to find meaning in your data,that's the goal

Tech only helps you get there

And so, we want to focus primarily on insightand the tools and the tech as they serve to further that goal

Now, there's a few generalcategories we are going to talk about, again, with an overview for each of these

The firstone is sourcing or data sourcing

That is how to get the data that goes into data science,the raw materials that you need

Second is coding

That again is computer programmingthat can be used to obtain and manipulate and analyze the data

After that, a tiny bitof math and that is the mathematics behind data science methods that really form thefoundations of the procedures

And then stats, the statistical methods that are frequentlyused to summarize and analyze data, especially as applied to data science

And then thereis machine learning, ML, this is a collection of methods for finding clusters in the data,for predicting categories or scores on interesting outcomes

And even across these five things,even then, the presentations aren't too techie-crunchy, they are basically still friendly

Really,that's the way it is

So, that is the overview of the overviews

In sum: we need to rememberthat data science includes tech, but data science is greater than tech, it is more thanthose procedures

And above all, that tech while important to data science is still simplya means to insight in data

The first step in discussing data science methods is to lookat the methods of sourcing, or getting data that is used in data science

You can thinkof this as getting the raw materials that go into your analyses

Now, you have got afew different choices when it comes to this in data science

You can use existing data,you can use something called data APIs, you can scrape web data, or you can make data

We'll talk about each of those very briefly in a non-technical manner

For right now,let me say something about existing data

This is data that already is at hand and itmight be in-house data

So if you work for a company, it might be your company records

Or, you might have open data; for instance, many governments and many scientific organizationsmake their data available to the public

And then there is also third party data

Thisis usually data that you buy from a vendor, but it exists and it is very easy to plugit in and go

You can also use APIs

Now, that stands for Application Programming Interface,and this is something that allows various computer applications to communicate directlywith each other

It's like phones for your computer programs

It is the most common wayof getting web data, and the beautiful thing about it is it allows you to import that datadirectly into whatever program or application you are using to analyze the data

Next isscraping data

And this is where you want to use data that is on the web, but they don'thave an existing API

And what that means, is usually data that's in HTML web tablesand pages, maybe PDFs

And you can do this either with using specialized applicationsfor scraping data or you can do it in a programming language, like R or Python, and write thecode to do the data scraping

Or another option is to make data

And this lets you get exactlywhat you need; you can be very specific and you can get what you need

You can do somethinglike interviews, or you can do surveys, or you can do experiments

There is a lot ofapproaches, most of them require some specialized training in terms of how to gather qualitydata

And that is actually important to remember, because no matter what method you use forgetting or making new data, you need to remember this one little aphorism you may have heardfrom computer science

It goes by the name of GIGO: that actually stands for "GarbageIn, Garbage Out," and it means if you have bad data that you are feeding into your system,you are not going to get anything worthwhile, any real insights out of it

Consequently,it is important to pay attention to metrics or methods for measuring and the meaning - exactlywhat it is that they tell you

There's a few ways you can do this

For instance, you cantalk about business metrics, you can talk about KPIs, which means Key Performance Indicators,also used in business settings

Or SMART goals, which is a way of describing the goals thatare actionable and timely and so on

You can also talk about, in a measurement sense, classificationaccuracy

And I will discuss each of those in a little more detail in a later movie

But for right now, in sum, we can say this: data sourcing is important because you needto get the raw materials for your analysis

The nice thing is there's many possible methods,many ways that you can use to get the data for data science

But no matter what you do,it is important to check the quality and the meaning of the data so you can get the mostinsight possible out of your project

The next step we need to talk about in data sciencemethods is coding, and I am going to give you a very brief non-technical overview ofcoding in data science

The idea here is that you are going to get in there and you aregoing to King of the Jungle/master of your domain and make the data jump when you needit to jump

Now, if you remember when we talked about the Data Science Venn Diagram at thebeginning, coding is up here on the top left

And while we often think about sort of peopletyping lines of code (which is very frequent), it is more important to remember when we talkabout coding (or just computers in general), what we are really talking about here is anytechnology that lets you manipulate the data in the ways that you need to perform the proceduresyou need to get the insight that you want out of your data

Now, there are three verygeneral categories that we will be discussing here on datalab

The first is apps; theseare specialized applications or programs for working with data

The second is data; orspecifically, data formats

There's special formats for web data, I will mention thosein a moment

And then, code; there are programming languages that give you full control overwhat the computer does and how you interact with the data

Let's take a look at each onevery briefly

In terms of apps, there are spreadsheets, like Excel or Google Sheets

These are the fundamental data tools of probably a majority of the world

There are specializedapplications, like Tableau for data visualization, or SPSS, it is a very common statistical packagein the social sciences and in businesses, and one of my favorites, JASP, which is afree open source analog of SPSS, which actually I think is a lot easier to use and replicateresearch with

And, there are tons of other choices

Now, in terms of web data, it ishelpful to be familiar with things like HTML, and XML, and JSON, and other formats thatare used to encapsulate data on the web, because those are the things that you are going tohave to be programming about to interact with when you get your data

And then there areactual coding languages

R is probably the most common, along with Python; general purposelanguage, but it has been well adapted for data use

There's SQL, the structured querylanguage for databases, and very basic languages like C, C++, and Java, which are used morein the back-end of data science

And then there is Bash, the most common command lineinterface, and regular expressions

And we will talk about all of these in other courseshere at datalab

But, remember this: tools are just tools

They are only one part ofthe data science process

They are a means to the end, and the end, the goal is insight

You need to know where you are trying to go and then simply choose the tools that helpyou reach that particular goal

That's the most important thing

So, in sum, here's afew things: number one, use your tools wisely

Remember your questions need to drive theprocess, not the tools themselves

Also, I will just mention that a few tools is usuallyenough

You can do an awful lot with Excel and R

And then, the most important thingis: focus on your goal and choose your tools and even your data to match the goal, so youcan get the most useful insights from your data

The next step in our discussion of datascience methods is mathematics, and I am going to give a very brief overview of the mathinvolved in data science

Now, the important thing to remember is that math really formsthe foundation of what we're going to do

If you go back to the Data Science Venn Diagram,we've got stats up here in the right corner, but really it's math and stats, or quantitativeability in general, but we'll focus on the math part right here

And probably the mostimportant question is how much math is enough to do what you need to do? Or to put it anotherway, why do you need math at all, because you have got a computer to do it? Well, Ican think of three reasons you don't want to rely on just the computer, but it is helpfulto have some sound mathematical understanding

Here they are: number one, you need to knowwhich procedures to use and why

So you have your question, you have your data and youneed to have enough of an understanding to make an informed choice

That's not terriblydifficult

Two, you need to know what to do when things don't work right

Sometimes youget impossible results

I know that statistics you can get a negative adjusted R2; that'snot supposed to happen

And it is good to know the mathematics that go into calculatingthat so you can understand how something apparently impossible can work

Or, you are trying todo a factor analysis or principal component and you get a rotation that won't convert

It helps to understand what it is about the algorithm that's happening, and why that won'twork in that situation

And number three, interestingly, some procedures, some mathis easier and quicker to do by hand than by firing up the computer

And I'll show youa couple of examples in later videos, where that can be the case

Now, fundamentally thereis a nice sort of analogy here

Math is to data science as, for instance, chemistry isto cooking, kinesiology is to dancing, and grammar is to writing

The idea here is thatyou can be a wonderful cook without knowing any chemistry, but if you know some chemistryit is going to help

You can be a wonderful dancer without know kinesiology, but it isgoing to help

And you can probably be a good writer without having an explicit knowledgeof grammar, but it is going to make a big difference

The same thing is true of datascience; you will do it better if you have some of the foundational information

So,the next question is: what kinds of math do you need for data science? Well, there's afew answers to that

Number one is algebra; you need some elementary algebra

That is,the basically simple stuff

You can have to do some linear or matrix algebra because thatis the foundation of a lot of the calculations

And you can also have systems of linear equationswhere you are trying to solve several equations all at once

It is a tricky thing to do, intheory, but this is one of the things that is actually easier to do by hand sometimes

Now, there's more math

You can get some Calculus

You can get some big O, which has to do withthe order of a function, which has to do with sort of how fast it works

Probability theorycan be important, and Bayes' theorem, which is a way of getting what is called a posteriorprobability can also be a really helpful tool for answering some fundamental questions indata science

So in sum: a little bit of math can help you make informed choices when planningyour analyses

Very significantly, it can help you find the problems and fix them whenthings aren't going right

It is the ability to look under the hood that makes a difference

And then truthfully, some mathematical procedures, like systems of linear equations, that caneven be done by hand, sometimes faster than you can do with a computer

So, you can saveyourself some time and some effort and move ahead more quickly toward your goal of insight

Now, data science wouldn't be data science and its methods without a little bit of statistics

So, I am going to give you a brief statistics overview here of how things work in data science

Now, you can think of statistics as really an attempt to find order in chaos, find patternsin an overwhelming mess

Sort of like trying to see the forest and the trees

Now, let'sgo back to our little Venn Diagram here

We recently had math and stats here in the topcorner

We are going to go back to talking about stats, in particular

What you are tryingto do here; one thing is to explore your data

You can have exploratory graphics, becausewe are visual people and it is usually easiest to see things

You can have exploratory statistics,a numerical exploration of the data

And you can have descriptive statistics, which arethe things that most people would have talked about when they took a statistics class incollege (if they did that)

Next, there is inference

I've got smoke here because youcan infer things about the wind and the air movement by looking at patterns in smoke

The idea here is that you are trying to take information from samples and infer somethingabout a population

You are trying to go from one source to another

One common versionof this is hypothesis testing

Another common version is estimations, sometimes called ConfidenceIntervals

There are other ways to do it, but all of these let you go beyond the dataat hand to making larger conclusions

Now, one interesting thing about statistics isyou're going to have to be concerned with some of the details and arranging things justso

For instance, you get to do something like feature selection and that's pickingvariables that should be included or combinations and there are problems that can come up thatare frequent problems and I will address some of those in later videos

There's also thematter of validation

When you create a statistical model you have to see if it is actually accurate

Hopefully, you have enough data that you can have a holdout sample and do that, or youcan replicate the study

Then, there is the choice of estimators that you use; how youactually get the coefficients or the combinations in your model

And then there's ways of assessinghow well your model fits the data

All of these are issues that I'll address brieflywhen we talk about statistical analysis at greater length

Now, I do want to mentionone thing in particular here, and I just call this "beware the trolls

" There are peopleout there who will tell you that if you don't do things exactly the way they say to do it,that your analysis is meaningless, that your data is junk and you've lost all your time

You know what? They're trolls

So, the idea here is don't listen to that

You can makeenough of an informed decision on your own to go ahead and do an analysis that is stilluseful

Probably one of the most important things to think about in this is this wonderfulquote from a very famous statistician and it says, "All models or all statistical modelsare wrong, but some are useful

" And so the question isn't whether you're technicallyright, or you have some sort of level of intellectual purity, but whether you have something thatis useful

That, by the way, comes from George Box

And I like to think of it basically asthis: as wave your flag, wave your "do it yourself" flag, and just take pride in whatyou're able to accomplish even when there are people who may be criticizing it

Go ahead,you're doing something, go ahead and do it

So, in sum: statistics allow you to exploreand describe your data

It allows you to infer things about the population

There is a lotof choices available, a lot of procedures

But no matter what you do, the goal is usefulinsight

Keep your eyes on that goal and you will find something meaningful and usefulin your data to help you in your own research and projects

Let's finish our data sciencemethods overview by getting a brief overview of Machine Learning

Now, I've got to admitwhen you say the term "machine learning," people start thinking something like, "therobot overlords are going to take over the world

" That's not what it is

Instead, let'sgo back to our Venn Diagram one more time, and in the intersection at the top betweencoding and stats is Machine Learning or as it's commonly called, just ML

The goal ofMachine Learning is to go and work in data space so you can, for instance, you can takea whole lot of data (we've got tons of books here), and then you can reduce the dimensionality

That is, take a very large, scattered, data set and try to find the most essential partsof that data

Then you can use these methods to find clusters within the data; like goeswith like

You can use methods like k-means

You can also look for anomalies or unusualcases that show up in the data space

Or, if we go back to categories again, I talkedabout like for like

You can use things like logistic regression or k-nearest neighbors,KNN

You can use Naive Bayes for classification or Decision Trees or SVM, which is SupportVector Machines, or artificial neural nets

Any of those will help you find the patternsand the clumping in the data so you can get similar cases next to each other, and getthe cohesion that you need to make conclusions about these groups

Also, a major elementof machine learning is predictions

You're going to point your way down the road

Themost common approach here; the most basic is linear regression, multiple regression

There is also Poisson regression, which is used for modeling count or frequency data

And then there is the issue of Ensemble models, where you create several models and you takethe predictions from each of those and you put them together to get an overall more reliableprediction

Now, I will talk about each of these in a little more detail in later courses,but for right now I mostly just want you to know that these things exist, and that's whatwe mean when we refer to Machine Learning

So, in sum: machine learning can be used tocategorize cases and to predict scores on outcomes

And there's a lot of choices, manychoices and procedures available

But, again, as I said with statistics, and I'll also sayagain many times after this, no matter what, the goal is not that "I'm going to do an artificialneural network or a SVM," the goal is to get useful insight into your data

Machine learningis a tool, and use it to the extent that it helps you get that insight that you need

In the last several videos I've talked about the role in data science of technical things

On the other hand, communicating is essential to the practice, and the first thing I wantto talk about there is interpretability

The idea here is that you want to be able to leadpeople through a path on your data

You want to tell a data-driven story, and that's theentire goal of what you are doing with data science

Now, another way to think about thisis: when you are doing your analysis, what you're trying to do is solve for value

You'remaking an equation

You take the data, you're trying to solve for value

The trouble isthis: a lot of people get hung up on analysis, but they need to remember that analysis isnot the same thing as value

Instead, I like to think of it this way: that analysis timesstory is equal to value

Now, please note that's multiplicative, not additive, and soone consequence of that is when you go back to, analysis times story equals value

Well,if you have zero story you're going to have zero value because, as you recall, anythingtimes zero is zero

So, instead of that let's go back to this and say what we really wantto do is, we want to maximize the story so that we can maximize the value that resultsfrom our analysis

Again, maximum value is the overall goal here

The analysis, the tools,the tech, are simply methods for getting to that goal

So, let's talk about goals

Forinstance, an analysis is goal-driven

You are trying to accomplish something that'sspecific, so the story, or the narrative, or the explanation you give about your projectshould match those goals

If you are working for a client that has a specific questionthat they want you to answer, then you have a professional responsibility to answer thosequestions clearly and unambiguously, so they know whether you said yes or no and they knowwhy you said yes or no

Now, part of the problem here is the fact the client isn't you andthey don't see what you do

And as I show here, simply covering your face doesn't makethings disappear

You have to worry about a few psychological abstractions

You haveto worry about egocentrism

And I'm not talking about being vain, I'm talking about the ideathat you think other people see and know and understand what you know

That's not true;otherwise, they wouldn't have hired you in the first place

And so you have to put itin terms that the client works with, and that they understand, and you're going to haveto get out of your own center in order to do that

Also, there's the idea of false consensus;the idea that, "well everybody knows that

" And again, that's not true, otherwise, theywouldn't have hired you

You need to understand that they are going to come from a differentbackground with a different range of experience and interpretation

You're going to have tocompensate for that

A funny little thing is the idea about anchoring

When you givesomebody an initial impression, they use that as an anchor, and then they adjust away fromit

So if you are going to try to flip things over on their heads, watch out for givinga false impression at the beginning unless you absolutely need to

But most importantly,in order to bridge the gap between the client and you, you need to have clarity and explainyourself at each step

You can also think about the answers

When you are explainingthe project to the client, you might want to start in a very simple procedure: statethe question that you are answering

Give your answer to that question, and if you needto, qualify as needed

And then, go in order top to bottom, so you're trying to make itas clear as possible what you're saying, what the answer is, and make it really easy tofollow

Now, in terms of discussing your process, how you did this all

Most of the time itis probably the case of they don't care, they just want to know what the answer is and thatyou used a good method to get that

So, in terms of discussing processes or the technicaldetails, only when absolutely necessary

That's something to keep in mind

The process hereis to remember that analysis, which means breaking something apart

This, by the way,is a mechanical typewriter broken into its individual component

Analysis means to takesomething apart, and analysis of data is an exercise in simplification

You're takingthe overall complexity, sort of the overwhelmingness of the data, and you're boiling it down andfinding the patterns that make sense and serve the needs of your client

Now, let's go toa wonderful quote from our friend Albert Einstein here, who said, "Everything should be madeas simple as possible, but not simpler

" That's true in presenting your analysis

Or, if youwant to go see the architect and designer Ludwig Mies van der Rohe, who said, "Lessis more

" It is actually Robert Browning who originally said that, but Mies van der Rohepopularized it

Or, if you want another way of putting a principle that comes from myfield, I'm actually a psychological researcher; they talk about being minimally sufficient

Just enough to adequately answer the question

If you're in commerce you know about a minimalviable product, it is sort of the same idea within analysis here, the minimal viable analysis

So, here's a few tips: when you're giving a presentation, more charts, less text, great

And then, simplify the charts; remove everything that doesn't need to be in there

Generally,you want to avoid tables of data because those are hard to read

And then, one more timebecause I want to emphasize it, less text again

Charts, tables can usually carry themessage

And so, let me give you an example here

I'm going to give a very famous datasetat Berkeley admissions

Now, these are not stairs at Berkeley, but it gives the ideaof trying to get into something that is far off and distant

Here's the data; this isgraduate school admissions in 1973, so it's over 40 years ago

The idea is that men andwomen were both applying for graduate school at the University of California Berkeley

And what we found is that 44 percent of the men who applied were admitted, that's theirpart in green

And of the women, only 35 percent of women were admitted when they applied

So, really at first glance this is bias, and it actually led to a lawsuit, it was a majorissue

So, what Berkeley then tried to do was find out, "well which programs are responsiblefor this bias?" And they got a very curious set of results

If you break the applicationsdown by program (and here we are calling them A through F), six different programs

Whatyou find, actually, is that in each of these male applicants on the left female applicantsare on the right

If you look at program A, women actually got accepted at a higher rate,and the same is true for B, and the same is true for D, and the same is true for F

Andso, this is a very curious set of responses and it is something that requires explanation

Now in statistics, this is something that is known as Simpson's Paradox

But here isthe paradox: bias may be negligible at the department level

And in fact, as we saw infour of the departments, there was a possible bias in favor of women

And the problem isthat women applied to more selective programs, programs with lower acceptance rates

Now,some people stop right here and say therefore, "nothing is going on, nothing to complainabout

" But you know, that's still ending the story a little bit early

There are otherquestions that you can ask, and as producing a data-driven story, this is stuff that youwould want to do

So, for instance, you may want to ask, "why do the programs vary inoverall class size? Why do the acceptance rates differ from one program to the other?Why do men and women apply to different programs?" And you might want to look at things likethe admissions criteria for each of the programs, the promotional strategies, how they advertisethemselves to students

You might want to look at the kinds of prior education the studentshave in the programs, and you really want to look at funding level for each of the programs

And so, really, you get one answer, at least more questions, maybe some more answers, andmore questions, and you need to address enough of this to provide a comprehensive overviewand solution to your client

In sum, let's say this: stories give value to data analysis

And when you tell the story, you need to make sure that you are addressing your client's'goals in a clear, unambiguous way

The overall principle here is be minimally sufficient

Get to the point, make it clear

Say what you need to, but otherwise be concise andmake your message clear

The next step in discussing data science and communicatingis to talk about actionable insights, or information that can be used productively to accomplishsomething

Now, to give sort of a bizarre segue here, you look at a game controller

It may be a pretty thing, it may be a nice object, but remember: game controllers existto do something

They exist to help you play the game and to do it as effectively as possible

They have a function, they have a purpose

Same way data is for doing

Now, that's aparaphrase for one of my favorite historical figures

This is William James, the fatherof American Psychology, and pragmatism is philosophy

And he has this wonderful quote,he said, "My thinking is first and last and always for the sake of my doing

" And theidea applies to analysis

Your analysis and your data is for the sake of your doing

So,you're trying to get some sort of specific insight in how you should proceed

What youwant to avoid is the opposite of this from one of my other favorite cultural heroes,the famous Yankees catcher Yogi Berra, who said, "We're lost, but we're making good time

"The idea here is that frantic activity does not make up for lack of direction

You needto understand what you are doing so you can reach the particular goal

And your analysisis supposed to do that

So, when you're giving your analysis, you're going to try to pointthe way

Remember, why was the project conducted? The goal is usually to direct some kind ofaction, reach some kind of goal for your client

And that the analysis should be able to guidethat action in an informed way

One thing you want to do is, you want to be able togive the next steps to your client

Give the next steps; tell them what they need to donow

You want to be able to justify each of those recommendations with the data and youranalysis

As much as possible be specific, tell them exactly what they need to do

Makesure it's doable by the client, that it's within their range of capability

And thateach step should build on the previous step

Now, that being said, there is one reallyfundamental sort of philosophical problem here, and that's the difference between correlationand causation

Basically, it goes this way: your data gives you correlation; you knowthat this is associated with that

But your client doesn't simply want to know what'sassociated; they want to know what causes something

Because if they are going to dosomething, that's an intervention designed to produce a particular result

So, really,how do you get from the correlation, which is what you have in the data, to the causation,which is what your client wants? Well, there's a few ways to do that

One is experimentalstudies; these are randomized, controlled trials

Now, that's theoretically the simplestpath to causality, but it can be really tricky in the real world

There are quasi-experiments,and these are methods, a whole collection of methods

They use non-randomized data,usually observational data, adjusted in particular ways to get an estimate of causal inference

Or, there's the theory and experience

And this is research-based theory and domain-specificexperience

And this is where you actually get to rely on your client's information

They can help you interpret the information, especially if they have greater domain expertisethan you do

Another thing to think about are the social factors that affect your data

Now, you remember the data science Venn Diagram

We've looked at it lots of times

It has gotthese three elements

Some proposed adding a fourth circle to this Venn diagram, andwe'll kind of put that in there and say that social understanding is also important, criticalreally, to valid data science

Now, I love that idea, and I do think that it's importantto understand how things are going to play out

There are a few kinds of social understanding

You want to be aware of your client's mission

You want to make sure that your recommendationsare consistent with your client's mission

Also, that your recommendations are consistentwith your client's identity; not just, "This is what we do," but, "This is really who weare

" You need to be aware of the business context, sort of the competitive environmentand the regulatory environment that they're working in

As well as the social context;and that can be outside of the organization, but even more often within the organization

Your recommendations will affect relationships within the client's organization

And youare going to try to be aware of those as much as you can to make it so that your recommendationscan be realized the way they need to be

So, in sum: data science is goal focused, andwhen you're focusing on that goal for your client you need to give specific next stepsthat are based on your analysis and justifiable from the data

And in doing so, be aware ofthe social, political, and economic context that gives you the best opportunity of gettingsomething really useful out of your analysis

When you're working in data science and tryingto communicate your results, presentation graphics can be an enormously helpful tool

Think of it this way: you are trying to paint a picture for the benefit of your client

Now, when you're working with graphics there can be a couple of different goals; it dependson what kind of graphics you're working with

There's the general category of exploratorygraphics

These are ones that you are using as the analyst

And for exploratory graphics,you need speed and responsiveness, and so you get very simple graphics

This is a basehistogram in R

And they can get a little more sophisticated and this is done in ggplot2

And you can break it down into a couple other histograms, or you can make it a differentway, or make it see-through, or split them apart into small multiples

But in each case,this is done for the benefit of you as the analyst understanding the data

These arequick, they're effective

Now, they are not very well-labeled, and they are usually foryour insight, and then you do other things as a result of that

On the other hand, presentationgraphics which are for the benefit of your client, those need clarity and they need anarrative flow

Now, let me talk about each of those characteristics very briefly

Clarityversus distraction

There are things that can go wrong in graphics

Number one is color

Colors can actually be a problem

Also, three-dimensional or false dimensions are nearly always a distraction

One that gets a little touchy for some people is interaction

We think of interactive graphicsas really cool, great things to have, but you run the risk of people getting distractedby the interaction and start playing around with it

Going, like, "Ooh, I press here itdoes that

" And that distracts from the message

So actually, it may be important to not haveinteraction

And then the same thing is true of animation

Flat, static graphics can oftenbe more informative because they have fewer distractions in them

Let me give you a quickexample of how not to do things

Now, this is a chart that I made

I made it in Excel,and I did it based on some of the mistakes I've seen in graphics submitted to me whenI teach

And I guarantee you, everything in here I have seen in real life, just not necessarilycombined all at once

Let's zoom in on this a little bit, so we can see the full badnessof this graphic

And let's see what's going on here

We've got a scale here that startsat 8 goes to 28% and is tiny; doesn't even cover the range of the data

We've got thisbizarre picture on the wall

We've got no access lines on the walls

We come down here;the labels for educational levels are in alphabetical order, instead of the more logical higherlevels of education

Then we've got the data represented as cones, which are difficultto read and compare, and it's only made worse by the colors and the textures

You know,if you want to take an extreme, this one for grad degrees doesn't even make it to the floorvalue of 8% and this one for high school grad is cut off at the top at 28%

This, by theway, is a picture of a sheep, and people do this kind of stuff and it drives me crazy

If you want to see a better chart with the exact same data, this is it right here

Itis a straight bar chart

It's flat, it's simple, it's as clean as possible

And this is betterin many ways

Most effective here is that it communicates clearly

There's no distractions,it's a logical flow

This is going to get the point across so much faster

And I cangive you another example of it; here's a chart previously about salaries for incomes

I havea list here, I've got data scientist in it

If I want to draw attention to it, I havethe option of putting a circle around it and I can put a number next to it to explain it

That's one way to make it easy to see what's going on

We don't even have to get fancy

You know, I just got out a pen and a post-it note and I drew a bar chart of some real dataabout life expectancy

This tells the story as well, that there is something terriblyamiss in Sierra Leone

But, now let's talk about creating narrative flow in your presentationgraphics

To do this, I am going to pull some charts from my most cited academic paper,which is called, A Third Choice: A Review of Empirical Research on the PsychologicalOutcomes of Restorative Justice

Think of it as mediation for juvenile crimes, mostlyjuvenile

And this paper is interesting because really it's about fourteen bar charts withjust enough text to hold them together

And you can see there's a flow

The charts arevery simple; this is judgments about whether the criminal justice system was fair

Thetwo bars on the left are victims; the two bars on the right are offenders

And for eachgroup on the left are people who participated in restorative justice, so more victim-offendermediation for crimes

And for each set on the right are people who went through standardcriminal procedures

It says court, but it usually means plea bargaining

Anyhow, it'sreally easy to see that in both cases the restorative justice bar is higher; peoplewere more likely to say it was fair

They also felt that they had an opportunity totell their story; that's one reason why they might think it's fair

They also felt theoffender was held accountable more often

In fact, if you go to court on the offenders,that one's below fifty percent and that's the offenders themselves making the judgment

Then you can go to forgiveness and apologies

And again, this is actually a simple thingto code and you can see there's an enormous difference

In fact, one of the reasons thereis such a big difference is because instead of court preceding, the offender very rarelymeets the victim

It also turns out I need to qualify this a little bit because a bunchof the studies included drunk driving with no injuries or accidents

Well, when we takethem out, we see a huge change

And then we can go to whether a person is satisfied withthe outcome

Again, we see an advantage for restorative justice

Whether the victim isstill upset about the crime, now the bars are a little bit different

And whether theyare afraid of revictimization and that is over a two to one difference

And then finallyrecidivism for offenders or reoffending; and you see a big difference there

And so whatI have here is a bunch of charts that are very very simple to read, and they kind offlow in how they're giving the overall impression and then detailing it a little bit more

There'snothing fancy here, there's nothing interactive, there's nothing animated, there's nothingkind of flowing in seventeen different directions

It's easy, but it follows a story and it tellsa narrative about the data and that should be your major goal with the presentation graphics

In sum: presenting, or the graphics you use for presenting, are not the same as the graphicsyou use for exploring

They have different needs and they have different goals

But nomatter what you are doing, be clear in your graphics and be focused in what you're tryingto tell

And above all create a strong narrative that gives different level of perspectiveand answers questions as you go to anticipate a client's questions and to give them themost reliable solid information and the greatest confidence in your analysis

The final elementof data science and communicating that I wanted to talk about is reproducible research

Andyou can think of it as this idea; you want to be able to play that song again

And thereason for that is data science projects are rarely "one and done;" rather they tend tobe incremental, they tend to be cumulative, and they tend to adapt to these circumstancesthat they're working in

So, one of the important things here, probably, if you want to summarizeit very briefly, is this: show your work

There's a few reasons for this

You may haveto revise your research at a later date, your own analyses

You may be doing another projectand you want to borrow something from previous studies

More likely you'll have to hand itoff to somebody else at a future point and they're going to have to be able to understandwhat you did

And then there's the very significant issue in both scientific and economic researchof accountability

You have to be able to show that you did things in a responsibleway and that your conclusions are justified; that's for clients funding agencies, regulators,academic reviewers, any number of people

Now, you may be familiar with the conceptof open data, but you may be less familiar with the concept of open data science; that'smore than open data

So, for instance, I'll just let you know there is something calledthe Open Data Science Conference and ODSC


And it meets three times a year in differentplaces

And this is entirely, of course, devoted to open data science using both open data,but making the methods transparent to people around them

One thing that can make thisreally simple is something called the Open Science Framework, which is at OSF


It'sa way of sharing your data and your research with an annotation on how you got throughthe whole thing with other people

It makes the research transparent, which is what weneed

One of my professional organizations, the Association for Psychological Sciencehas a major initiative on this called open practices, where they are strongly encouragingpeople to share their data as much as is ethically permissible and to absolutely share theirmethods before they even conduct a study as a way of getting rigorous intellectual honestyand accountability

Now, another step in all of this is to archive your data, make thatinformation available, put it on the shelf

And what you want to do here is, you wantto archive all of your datasets; both the totally raw before you did anything with itdataset, and every step in the process until your final clean dataset

Along with that,you want to archive all of the code that you used in the process and analyzed the data

If you used a programming language like R or Python, that's really simple

If you useda program like SPSS you need to save the syntax files, and then that can be done that way

And again, no matter what, make sure to comment liberally and explain yourself

Now, partof that is you have to explain the process, because you are not just this lone personsitting on the sofa working by yourself, you're with other people and you need to explainwhy you did it the way that you did

You need to explain the choices, the consequences ofthose choices, the times that you had to backtrack and try it over again

This also works intothe principle of future-proofing your work

You want to do a few things here

Number one;the data

You want to store the data in non-proprietary formats like a CSV or Comma Separated Valuesfile because anything can read CSV files

If you stored it in the proprietary SPSS

savformat, you might be in a lot of trouble when somebody tries to use it later and they can'topen it

Also, there's storage; you want to place all of your files in a secure, accessiblelocation like GitHub is probably one of the best choices

And then the code, you may wantto use something like a dependency management package like Packrat for R or Virtual Environmentfor Python as a way of making sure that the packages that you use; that there are alwaysversions that work because sometimes things get updated and it gets broken

This is away of making sure that the system that you have will always work

Overall, you can thinkof this too: you want to explain yourself and a neat way to do that is to put your narrativein a notebook

Now, you can have a physical lab book or you can also do digital books

A really common one, especially if you're using Python, is Jupyter with a "y" therein the middle

Jupyter notebooks are interactive notebooks

So, here's a screenshot of onethat I made in Python, and you have titles, you have text, you have the graphics

If youare working in R, you can do this through something called RMarkdown

Which works inthe same way you do it in RStudio, you use Markdown and you can annotate it

You canget more information about that at rmarkdown



And so for instance, here's an R analysisI did, and as you can see the code on the left and you see the markdown version on theright

What's neat about this is that this little bit of code here, this title and thistext and this little bit of R code, then is displayed as this formatted heading, as thisformatted text, and this turns into the entire R output right there

It's a great way todo things

And if you do RMarkdown, you actually have the option of uploading the documentinto something called RPubs; and that's an online document that can be made accessibleto anybody

Here's a sample document

And if you want to go see it, you can go to thisaddress

It's kind of long, so I am going to let you write that one down yourself

But,in sum: here's what we have

You want to do your work and archive the information in away that supports collaboration

Explain your choices, say what you did, show how you didit

This allows you to future-proof your work, so it will work in other situations for otherpeople

And as much as possible, no matter how you do it, make sure you share your narrativeso people understand your process and they can see that your conclusions are justifiable,strong and reliable

Now, something I've mentioned several times when talking about data science,and I'll do it again in this conclusion, is that it's important to give people next steps

And I'm going to do that for you right now

If you're wondering what to do after havingwatched this very general overview course, I can give you a few ideas

Number one, maybeyou want to start trying to do some coding in R or Python; we have courses for those

You might want to try doing some data visualization, one of the most important things that youcan do

You may want to brush up on statistics and maybe some math that goes along with it

And you may want to try your hand at machine learning

All of these will get you up androlling in the practice of data science

You can also try looking at data sourcing, findinginformation that you are going to do

But, no matter what happens try to keep it in context

So, for instance, data science can be applied to marketing, and sports, and health, andeducation, and the arts, and really a huge number of other things

And we will have courseshere at datalab

cc that talk about all of those

You may also want to start gettinginvolved in the community of data science

One of the best conferences that you can goto is O'Reilly Strata, which meets several times a year around the globe

There's alsoPredictive Analytics World, again several times a year around the world

Then there'smuch smaller conferences, I love Tapestry or tapestryconference

com, which is aboutstorytelling in data science

And Extract, a one-day conference about data stories thatis put on by import

io, one of the great data sourcing applications that's available forscraping web data

If you want to start working with actual data, a great choice is to goto Kaggle

com and they sponsor data science competitions, which actually have cash rewards

There's also wonderful data sets you can work with there to find out how they work and compareyour results to those of other people

And once you are feeling comfortable with that,you may actually try turning around and doing some service; datakind

org is the premierorganization for data science as humanitarian service

They do major projects around theworld

I love their examples

There are other things you can do; there's an annual eventcalled Do Good Data, and then datalab

cc will be sponsoring twice-a-year data charrettes,which are opportunities for people in the Utah area to work with the local nonprofitson their data

But above all of this, I want you to remember this one thing: data scienceis fundamentally democratic

It's something that everybody needs to learn to do in someway, shape or form

The ability to work with data is a fundamental ability and everybodywould be better off by learning to work with data intelligently and sensitively

Or, toput it another way: data science needs you

Thanks so much for joining me in this introductorycourse and I hope it has been good and I look forward to seeing you in the other courseshere at datalab


Welcome to "Data Sourcing"

I'm Barton Poulson and in this course, we'regoing to talk about Data Opus or that's Latin for Data Needed

The idea here is that nodata, no data science; and that is a sad thing

So, instead of leaving it at that we're goingto use this course to talk about methods for measuring and evaluating data and methodsfor accessing existing data and even methods for creating new, custom data

Take thosetogether and it's a happy situation

At the same time, we'll do all of this still at anaccessible, conceptual and non-technical level because the technical hands-on stuff willhappen in later other courses

But for now, let's talk data

For data sourcing, the firstthing we want to talk about is measurement

And within that category, we're going to talkabout metrics

The idea here is that you actually need to know what your target is if you wantto have a chance to hit it

There's a few particular reasons for this

First off, datascience is action-oriented; the goal is to do something as opposed to simply understandsomething, which is something I say as an academic practitioner

Also, your goal needsto be explicit and that's important because the goals can guide your effort

So, you wantto say exactly what you are trying to accomplish, so you know when you get there

Also, goalsexist for the benefit of the client, and they can prevent frustration; they know what you'reworking on, they know what you have to do to get there

And finally, the goals and themetrics exist for the benefit of the analyst because they help you use your time well

You know when you're done, you know when you can move ahead with something, and that makeseverything a little more efficient and a little more productive

And when we talk about thisthe first thing you want to do is try to define success in your particular project or domain

Depending on where you are, in commerce that can include things like sales, or click-throughrates, or new customers

In education it can include scores on tests; it can include graduationrates or retention

In government, it can include things like housing and jobs

In research,it can include the ability to serve the people that you're to better understand

So, whateverdomain you're in there will be different standards for success and you're going to need to knowwhat applies in your domain

Next, are specific metrics or ways of measuring

Now again, thereare a few different categories here

There are business metrics, there are key performanceindicators or KPIs, there are SMART goals (that's an acronym), and there's also theissue of having multiple goals

I'll talk about each of those for just a second now

First off, let's talk about business metrics

If you're in the commercial world there aresome common ways of measuring success

A very obvious one is sales revenue; are you makingmore money, are you moving the merchandise, are you getting sales

Also, there's the issueof leads generated, new customers, or new potential customers because that, then, inturn, is associated with future sales

There's also the issue of customer value or lifetimecustomer value, so you may have a small number of customers, but they all have a lot of revenueand you can use that to really predict the overall profitability of your current system

And then there's churn rate, which has to do with, you know, losing and gaining newcustomers and having a lot of turnover

So, any of these are potential ways for definingsuccess and measuring it

These are potential metrics, there are others, but these are somereally common ones

Now, I mentioned earlier something called a key performance indicatoror KPI

KPIs come from David Parmenter and he's got a few ways of describing them, hesays a key performance indicator for business

Number one should be nonfinancial, not justthe bottom line, but something else that might be associated with it or that measures theoverall productivity of the association

They should be timely, for instance, weekly, daily,or even constantly gathered information

They should have a CEO focus, so the senior managementteams are the ones who generally make the decisions that affect how the organizationacts on the KPIs

They should be simple, so everybody in the organization, everybody knowswhat they are and knows what to do about them

They should be team-based, so teams can takejoint responsibility for meeting each one of the KPIs

They should have significantimpact, what that really means is that they should affect more than one important outcome,so you can do profitably and market reach or improved manufacturing time and fewer defects

And finally, an ideal KPI has a limited dark side, that means there's fewer possibilitiesfor reinforcing the wrong behaviors and rewarding people for sort of exploiting the system

Next, there are SMART goals, where SMART stands for Specific, Measurable, Assignable to aparticular person, Realistic (meaning you can actually do it with the resources youhave at hand), and Time-bound, (so you know when it can get done)

So, whenever you forma goal you should try to assess it on each of these criteria and that's a way of sayingthat this is a good goal to be used as a metric for the success of our organization

Now,the trick, however, is when you have multiple goals, multiple possible endpoints

And thereason that's difficult is because, well, it's easy to focus on one goal if you're justtrying to maximize revenue or if you're just trying to maximize graduation rate

There'sa lot of things you can do

It becomes more difficult when you have to focus on many thingssimultaneously, especially because some of these goals may conflict

The things thatyou do to maximize one may impair the other

And so when that happens, you actually needto start engaging in a deliberate process of optimization, you need to optimize

Andthere are ways that you can do this if you have enough data; you can do mathematicaloptimization to find the ideal balance of efforts to pursue one goal and the other goalat the same time

Now, this is a very general summary and let me finish with this

In sum,metrics or methods for measuring can help awareness of how well your organization isfunctioning and how well you're reaching your goals

There are many different methods availablefor defining success and measuring progress towards those things

The trick, however,comes when you have to balance efforts to reach multiple goals simultaneously, whichcan bring in the need for things like optimization

When talking about data sourcing and measurement,one very important issue has to do with the accuracy of your measurements

The idea hereis that you don't want to have to throw away all your ideas; you don't want to waste effort

One way of doing this in a very quantitative fashion is to make a classification table

So, what that looks like is this, you talk about, for instance, positive results, negativeresults

and in fact let's start by looking at the top here

The middle two columns heretalk about whether an event is present, whether your house is on fire, or whether a sale occurs,or whether you have got a tax evader, whatever

So, that's whether a particular thing is actuallyhappening or not

On the left here, is whether the test or the indicator suggests that thething is or is not happening

And then you have these combinations of true positives;where the test says it's happening and it really is, and false positives; where thetest says it happening, but it is not, and then below that true negatives, where thetest says it isn't happening and that's correct and then false negatives, where the test saysthere's nothing going on, but there is in fact the event occurring

And then you startto get the column totals, the total number of events present or absent, then the rowtotals about the test results

Now, from this table what you get is four kinds of accuracy,or really four different ways of quantifying accuracy using different standards

And theygo by these names: sensitivity, specificity, positive predictive value, and negative predictivevalue

I'll show you very briefly how each of them works

Sensitivity can be expressedthis way, if there's a fire does the alarm ring? You want that to happen

And so, that'sa matter of looking at the true positives and dividing that by the total number of alarms

So, the test positive means there's an alarm and the event present means there's a fire;you want it to always have an alarm when there's a fire

Specificity, on the other hand, issort of the flip side of this

If there isn't a fire, does the alarm stay quiet? This iswhere you're looking at the ratio of true negatives to total absent events, where there'sno fire, and the alarms aren't ringing, and that's what you want

Now, those are lookingat columns; you can also go sideways across rows

So, the first one there is positivepredictive value, often abbreviated as PPV, and we flip around the order a little bit

This one says, if the alarm rings, was there a fire? So, now you're looking at the truepositives and dividing it by the total number of positives

Total number of positives isany time the alarm rings

True positives are because there was a fire

And negative predictivevalue, or NPV, says of the alarm doesn't ring, does that in fact mean that there is no fire?Well, here you are looking at true negatives and dividing it by total negatives, the timethat it doesn't ring

And again, you want to maximize that so the true negatives accountfor all of the negatives, the same way you want the true positives to account for allof the positives and so on

Now, you can put numbers on all of these going from zero percentto a 100% and the idea is to maximize each one as much as you can

So, in sum, from thesetables we get four kinds of accuracy and there's a different focus for each one

But, the sameoverall goal, you want to identify the true positives and true negatives and avoid thefalse positives and the false negatives

And this is one of way of putting numbers on,an index really, on the accuracy of your measurement

Now data sourcing may seem like a very quantitativetopic, especially when we're talking about measurement

But, I want measure one importantthing here, and that is the social context of measurement

The idea here really, is thatpeople are people, and they all have their own goals, and they're going their own ways

And we all have our own thoughts and feelings that don't always coincide with each other,and this can affect measurement

And so, for instance, when you're trying to define yourgoals and you're trying to maximize them you want to look at things like, for instance,the business model

An organization's business model, the way they conduct their business,the way they make their money, is tied to its identity and its reason to be

And ifyou make a recommendation and it'scontrary to their business model, that can actuallybe perceived as a threat to their core identity, and people tend to get freaked out in thatsituation

Also, restrictions, so for instance, there may be laws, policies, and common practices,both organizationally and culturally, that may limit the ways the goals can be met

Now,most of these make a lot of sense, so the idea is you can'tjust do anything you want,you need to have these constraints

And when you make your recommendations, maybe you'llwork creatively in them as long as you're still behaving legally and ethically, butyou do need to be aware of these constraints

Next, is the environment

And the idea hereis that competition occurs both between organizations, that company here is trying to reach a goal,but they're competing with company B over there, but probably even more significantlythere is competition within the organization

This is really a recognition of office politics

And when you, as a consultant, make a recommendation based on your analysis, you need to understandyou're kind of dropping a little football into the office and things are going to furtherone person's career, maybe to the detriment of another

And in order for your recommendationsto have maximum effectiveness they need to play out well in the office

That's somethingthat you need to be aware of as you're making your recommendations

Finally, there's theissue of manipulation

And a sad truism about people is that any reward system, any rewardsystem at all, will be exploited and people will generally game the system

This happensespecially when you have a strong cut off; you need to get at least 80 percent, or youget fired and people will do anything to make their numbers appear to be eighty percent

This happens an awful lot when you look at executive compensation systems, it looks alot when you have very high stake school testing, it happens in an enormous number of situations,and so, you need to be aware of the risk of exploitation and gaming

Now, don't think,then, that all is lost

Don't give up, you can still do really wonderful assessment,you can get good metrics, just be aware of these particular issues and be sensitive tothem as you both conduct your research and as you make your recommendations

So, in sum,social factors affect goals and they affect the way you meet those goals

There are limitsand consequences, both on how you reach the goals and how, really, what the goal shouldbe and that when you're making advice on how to reach those goals please be sensitive tohow things play out with metrics and how people will adapt their behavior to meet the goals

That way you can make something that's more likely to be implemented the way you meantand more likely to predict accurately what can happen with your goals

When it comesto data sourcing, obviously the most important thing is to get data

But the easiest wayto do that, at least in theory, is to use existing data

Think of it as going to thebookshelf and getting the data that you have right there at hand

Now, there's a few differentways to do this: you can get in-house data, you can get open data, and you can get third-partydata

Another nice way to think of that is proprietary, public, and purchased data; thethree Ps I've heard it called

Let's talk about each of these a little bit more

So,in-house data, that's stuff that's already in your organization

What's nice about that,it can be really fast and easy, it's right there and the format may be appropriate forthe kind of software in the computer that you are using

If you're fortunate, there'sgood documentation, although sometimes when it's in-house people just kind of throw ittogether, so you have to watch out for that

And there's the issue of quality control

Now, this is true with any kind of data, but you need to pay attention with in-house, becauseyou don't know the circumstances necessarily under which people gathered the data and howmuch attention they were paying to something

There's also an issue of restrictions; theremay be some data that, while it is in-house, you may not be allowed to use, or you maynot be able to publish the results or share the results with other people

So, these arethings that you need to think about when you're going to use in-house data, in terms of howcan you use it to facilitate your data science projects

Specifically, there are a few prosand cons

In-house data is potentially quick, easy, and free

Hopefully it's standardized;maybe even the original team that conducted this study is still there

And you might haveidentifiers in the data which make it easier for you to do an individual level analysis

On the con side however, the in-house data simply may not exist, maybe it's just notthere

Or the documentation may be inadequate and of course, the quality may be uncertain

Always true, but may be something you have to pay more attention to when you're usingin-house data

Now, another choice is open data like going to the library and gettingsomething

This is prepared data that's freely available, consists of things like governmentdata and corporate data and scientific data from a number of sources

Let me show yousome of my favorite open data sources just so you know where they are and that they exist

Probably, the best one is data

gov here in the US

That is the, as it says right here,the home of the US government's open data

Or, you may have a state level one

For instance,I'm in Utah and we have data


gov, also a great source of more regional information

If you're in Europe, you have open-data


eu, the European Union open data portal

And thenthere are major non-profit organizations, so the UN has unicef

org/statistics for theirstatistical and monitoring data

The World Health Organization has the global healthobservatory at who


And then there are private organizations that work in thepublic interest, such as the Pew Research Center, which shares a lot of its data setsand the New York Times, which makes it possible to use APIs to access a huge amount of thedata of things they've published over a huge time span

And then two of the mother loads,there's Google, which at google

com has public data which is a wonderful thing

And thenAmazon at aws


com/datasets has gargantuan datasets

So, if you needed a data set thatwas like five terabytes in size, this is the place that you would go to get it

Now, there'ssome pros and cons to using this kind of open data

First, is that you can get very valuabledatasets that maybe cost millions of dollars to gather and to process

And you can geta very wide range of topics and times and groups of people and so on

And often, thedata is very well formatted and well documented

There are, however, a few cons

Sometimesthere's biased samples

Say, for instance, you only get people who have internet access,and that can mean, not everybody

Sometimes the meaning of the data is not clear or itmay not mean exactly what you want it to

A potential problem is that sometimes youmay need to share your analyses and if you are doing proprietary research, well, it'sgoing to have to be open instead, so that can create a crimp with some of your clients

And then finally there are issues with privacy and confidentiality and in public data thatusually means that the identifiers are not there and you are going to have to work ata larger aggregate level of measurement

Another option is to use data from a third-party,these go by the name Data as a Service or DaaS

You can also call them data brokers

And the thing about data brokers is they can give you an enormous amount of data on manydifferent topics, plus they can save you some time and effort, by actually doing some ofthe processing for you

And that can include things like consumer behaviors and preferences,they can get contact information, they can do marketing identity and finances, there'sa lot of things

There's a number of data brokers around, here's a few of them

Acxiomis probably the biggest one in terms of marketing data

There's also Nielsen which providesdata primarily for media consumption

And there's another organization Datasift, that'sa smaller newer one

And there's a pretty wide range of choices, but these are someof the big ones

Now, the thing about using data brokers, there's some pros and there'ssome cons

The pros are first, that it can save you a lot of time and effort

It canalso give you individual level data which can be hard to get from open data

Open datais usually at the community level; they can give you information about specific consumers

They can even give you summaries and inferences about things like credit scores and maritalstatus

Possibly even whether a person gambles or smokes

Now, the con is this, number 1it can be really expensive, I mean this is a huge service; it provides a lot of benefitand is priced accordingly

Also, you still need to validate it, you still need to doublecheck that it means what you think it means and that it works in with what you want

Andprobably the real sticking point here is the use of third-party data is distasteful tomany people, and so you have to be aware that as you're making your choices

So, in sum,as far as data sourcing existing data goes obviously data science needs data and there'sthe three Ps of data sources, Proprietary and Public and Purchased

But no matter whatsource you use, you need to pay attention to quality and to the meaning and the usabilityof the data to help you along in your own projects

When it comes to data sourcing,a really good way of getting data is to use what are called APIs

Now, I like to thinkof these as the digital version of Prufrock's mermaids

If you're familiar with the lovesong on J

Alfred Prufrock by TS Eliot, he says, "I have heard the mermaids singing,each to each," that's TS Eliot

And I like to adapt that to say, "APIs have heard appssinging, each to each," and that's by me

Now, more specifically when we talk aboutan API, what we're talking about is something called Application Programming Interface,and this is something that allows programs to talk to each other

Its most importantuse, in terms of data science, is it allows you to get web data

It allows your programto directly go to the web, on its own, grab the data, bring it back in almost as thoughit were local data, and that's a really wonderful thing

Now, the most common version of APIsfor data science are called REST APIs; that stands for Representational State Transfer

That's the software architectural style of the world wide web and it allows you to accessdata on web pages via HTTP, that's the hypertext transfer protocol

They, you know, run theweb as we know it

And when you download the data that you usually get its in JSON format,that stands for Javascript Object Notation

The nice thing about that is that's humanreadable, but it's even better for machines

Then you can take that information and youcan send it directly to other programs

And the nice thing about REST APIs is that they'rewhat is called language agnostic, meaning any programming language can call a REST API,can get data from the web, and can do whatever it needs to with it

Now, there are a fewkinds of APIs that are really common

The first is what are called Social APIs; theseare ways of interfacing with social networks

So, for instance, the most common one is Facebook;there's also Twitter

Google Talk has been a big one and FourSquare as well and thenSoundCloud

These are on lists of the most popular ones

And then there are also whatare called Visual APIs, which are for getting visual data, so for instance, Google Mapsis the most common, but YouTube is something that accesses YouTube on a particular websiteor AccuWeather which is for getting weather information

Pinterest for photos, and Flickrfor photos as well

So, these are some really common APIs and you can program your computerto pull in data from any of these services and sites and integrate it into your own websiteor here into your own data analysis

Now, there's a few different ways you can do this

You can program it in R, the statistical programming language, you can do it in Python, also youcan even use it in the very basic BASH command line interface, and there's a ton of otherapplications

Basically, anything can access an API one way or another

Now, I'd like toshow you how this works in R

So, I'm going to open up a script in RStudio and then I'mgoing to use it to get some very basic information from a webpage

Let me go to RStudio and showyou how this works

Let me open up a script in RStudio that allows me to do some datasourcing here

Now, I'm just going to use a package called JSON Lite, I'm going to loadthat one up, and then I'm going to go to a couple of websites

I'm going to getting historicaldata from Formula One car races and I'm going to be getting it from Ergast


Now, ifwe go to this page right here, I can go straight to my browser right now

And this is whatit looks like; it gives you the API documentation, so what you're doing for an API, is you'rejust entering a web address and in that web address it includes the information you want

I'll go back to R here just for a second

And if I want to get information about 1957races in JSON format, I go to this address

I can skip over to that for a second, andwhat you see is it's kind of a big long mess here, but it is all labeled and it is clearto the computer what's going on here

Let's go back to R

And so what I'm going to dois, I am going to save that URL into an object here, in R, and then I'm going to use thecommand from JSON to read that URL and save it into R

And which it has now done

AndI'm going to zoom in on that so you can see what's happened

I've got this sort of messof text, this is actually a list object in R

And then I'm going to get just the structureof that object, so I'm going to do this one right here and you can see that it's a listand it gives you the names of all the variables within each one of the lists

And what I'mgoing to do is, I'm going to convert that list to a data frame

I went through the listand found where the information I wanted was located, you have to use this big long statementhere, that will give me the names of the drivers

Let me zoom in on that again

There they are

And then I'm going to get just the column names for that bit of the data frame

So,what I have here is six different variables

And then what I'm going to do is, I'm goingto pick just the first five cases and I'm going to select some variables and put themin a different order

And when I do that, this is what I get

I will zoom in on thatagain

And the first five people listed in this data set that I pulled from 1957, areJuan Fangio, makes sense one of the greatest drivers ever, and other people who competedin that year

And so what I've done is by using this API call in R, a very simple thingto do, I was able to pull data off that webpage in a structured format, and do a very simpleanalysis with it

And let's sum up what we've learned from all this

First off, APIs makeit really easy to work with web data, they structure, they call it for you, and thenthey feed it straight into the program for you to analyze

And they are one of the bestways of getting data and getting started in data science

When you're looking for data,another great way of getting data is through scraping

And what that means is pulling informationfrom webpages

I like to think of it as when data is hiding in the open; it's there, youcan see it, but there's not an easy, immediate way to get that data

Now, when you're dealingwith scraping, you can get data in several different formats

You can get HTML text fromwebpages, you can get HTML tables from the rows and columns that appear on webpages

You can scrape data from PDFs, and you can scrape data from all sorts of data from imagesand video and audio

Now, we will make one very important qualification before we sayanything else: pay attention to copyright and privacy

Just because something is onthe web, doesn't mean you're allowed to pull it out

Information gets copyrighted, andso when I use examples here, I make sure that this is stuff that's publicly available, andyou should do the same when you are doing your own analyses

Now, if you want to scrapedata there's a couple of ways to do it

Number one, is to use apps that are developed forthis

So, for instance, import

io is one of my favorites

It is both a webpage, that'sits address, and it's a downloadable app

There's also ScraperWiki

There's an applicationcalled Tabula, and you can even do scraping in Google Sheets, which I will demonstratein a second, and Excel

Or, if you don't want to use an app or if you want to do somethingthat apps don't really let you do, you can code your scraper

You can do it directlyin R, or Python, or Bash, or even Java or PHP

Now, what you're going to do is you'regoing to be looking for information on the webpage

If you're looking for HTML text,what you're going to do is pull structured text from webpages, similar to how a readerview works in a browser

It uses HTML tags on the webpage to identify what's the importantinformation

So, there's things like body, and h1 for header one, and p for paragraph,and the angle brackets

You can also get information from HTML tables, although this is a physicaltable of rows and columns I am showing you

This also uses HTML table tags, that is liketable, and tr for table row, and td for table data, that's the cell

The trick is when you'redoing this, you need the table number and sometimes you just have to find that throughtrial and error

Let me give you an example of how this works

Let's take a look at thisWikipedia page on the Iron Chef America Competition

I'm going to go to the web right now and showyou that one

So, here we are in Wikipedia, Iron Chef America

And if you scroll downa little bit, you see we have got a whole bunch of text here, we have got our tableof contents, and then we come down here, we have a table that lists the winners, the statisticsfor the winners

And let's say we want to pull that from this webpage into another programfor us to analyze

Well, there is an extremely easy way to do this with Google Sheets

Allwe need to do is open up the Google Sheet and in cell A1 of that Google Sheet, we pastein this formula

It's IMPORTHTML, then you give the webpage and then you say that youare importing a table, you have to put that stuff in quotes, and the index number forthe table

I had to poke around a little bit to figure out this was table number 2

So,let me go to Google Sheets and show you how this works

Here I have a Google Sheet andright now it's got nothing in it

But watch this; if I come here to this cell, and I simplypaste in that information, all the stuff just sort of magically propagates into the sheet,makes it extremely easy to deal with, and now I can, for instance, save this as a CSVfile, put it in another program

Lots of options

And so this is one way that I'm scraping thedata from a webpage because I didn't use an API, but I just used a very simple, one-linkcommand to get the information

Now, that was a HTML table

You can also scrape datafrom PDFs

You have to be aware of if it's a native PDF, I call that a text PDF, or ascanned or imaged PDF

And what it does with native PDFs, it looks for text elements; againthose are like code that indicates this is text

And you can deal with Raster images,that's pixel images, or vector, which draws the lines, and that's what makes them infinitelyscalable in many situations

And then in PDFs, you can deal with tabular data, but you probablyhave to use a specialized program like Scraper, Wiki, or Tabula in order to get that

Andthen finally media, like images and video and audio

Getting images is easy; you candownload them in a lot of different ways

And then if you want to read data from them,say for instance, you have a heat map of a country, you can go through it, but you willprobably have to write a program that loops through the image pixel-by-pixel to read thedata and them encode it numerically into your statistical program

Now, that's my very briefsummary and let's summarize that

First off, if the data you are trying to get at doesn'thave an existing API, you can try scraping and you can write code in a language likeR or Python

But, no matter what you do, be sensitive to issues of copyright and privacy,so you don't get yourself in hot water, but instead, you make an analysis that can beof use to you or to your client

The next step in data sourcing is making data

Andspecifically, we're talking about getting new data

I like to think of this as, you'regetting your hands on and you're getting "data de novo," new data

So, can't find the datathat you need for your analysis? Well, one simple solution is, do it yourself

And we'regoing to talk about a few general strategies used for doing that

Now, these strategiesvary on a few dimensions

First off is the role

Are you passive and simply observingstuff that's happening already, or are you active where you play a role in creating thesituation to get the data? And then there's the "Q/Q question," and that is, are you goingto get quantitative, or numerical, data, or are you going to get qualitative data, whichusually means text, paragraphs, sentences as well as things like photos and videos andaudio? And also, how are you going to get the data? Do you want to get it online, ordo you want to get it in person? Now, there's other choices than these, but these are someof the big delineators of the methods

When you look at those, you get a few possibleoptions

Number one is interviews, and I'll say more about those

Another one is surveys

A third one is card sorting

And a fourth one is experiments, although I actually wantto split experiments into two kinds of categories

The first one is laboratory experiments, andthat's in-person projects where you shape the information or an experience for the participantsas a way of seeing how that involvement changes their reactions

It doesn't necessarily meanthat you're a participant, but you create the situation

And then there's also A/B testing

This is automated, online testing of two or more variations on a webpage

It's a very,very simple kind of experimentation that's actually very useful for optimizing websites

So, in sum, from this very short introduction make sure you can get exactly what you need

Get the data you need to answer your question

And if you can't find it somewhere, then makeit

And, as always, you have many possible methods, each of which have their own strengthsand their own compromises

And we'll talk about each of those in the following sections

The first method of data sourcing where you're making new data that I want to talk aboutis interviews

And that's not because it's the most common, but because it's the oneyou would do for the most basic problem

Now, basically an interview is nothing more thana conversation with another person or a group of people

And, the fundamental question is,why do interviews as opposed to doing a survey or something else? Well, there's a few goodreasons to do that

Number one: you're working with a new topic and you don't know what people'sresponses will be, how they'll react

And so you need something very open-ended

Numbertwo: you're working with a new audience and you don't know how they will react in particularto what it is you're trying to do

And number three: something's going on with the currentsituation, it's not working anymore, and you need to find what's going on, and you needto find ways to improve

The open-ended information where you get past you're existing categoriesand boundaries can be one of the most useful methods for getting that data

If you wantto put it another way, you want to do interviews when you don't want to constrain responses

Now, when it comes to interviews, you have one very basic choice, and that's whetheryou do a structured interview

And with a structured interview, you have a predeterminedset of questions, and everyone gets the same questions in the same order

It gives a lotof consistency even though the responses are open-ended

And then you can also have what'scalled an unstructured interview

And this is a whole lot more like a conversation whereyou as the interviewer and the person you're talking to - your questions arise in responseto their answers

Consequently, an unstructured interview can be different for each personthat you talk to

Also, interviews are usually done in person, but not surprisingly, theycan be done over the phone, or often online

Now, a couple of things to keep in mind aboutinterviews

Number one is time

Interviews can range from just a few minutes to severalhours per person

Second is training

Interviewing's a special skill that usually requires specifictraining

Now, asking the questions is not necessarily the hard part

The really trickypart is the analysis

The hardest part of interviews by far is analyzing the answersfor themes, and way of extracting the new categories and the dimensions that you needfor your further research

The beautiful thing about interviews is that they allow you tolearn things that you never expected

So, in sum: interviews are best for new situationsor new audiences

On the other hand, they can be time-consuming, and they also requirespecial training; both to conduct the interview, but also to analyze the highly qualitativedata that you get from them

The next logical step in data sourcing and making data is surveys

Now, think of this: if you want to know something just ask

That's the easy way

And you wantto do a survey under certain situations

The real question is, do you know your topic andyour audience well enough to anticipate their answers? To know what the range of their answersand the dimensions and the categories that are going to be important

If you do, thena survey might be a good approach

Now, just as there were a few dimensions for interviews,there are a few dimensions for surveys

You can do what is called a closed-ended survey;that is also called a forced choice

It is where you give people just particular options,like a multiple choice

You can have an open-ended survey, where you have the same questionsfor everybody, but you allow them to write in a free-form response

You can so surveysin person and you can also do them online or over the mail or phone or however

Andnow, it is very common to use software when doing surveys

Some really common applicationsfor online surveys are SurveyMonkey, and Qualtrics, or at the very simple end there is GoogleForms, and the simple and pretty end there is Typeform

There is a lot more choices,but these are some of the major players and how you can get data from online participantsin survey format

Now, the nice thing about surveys is, they are really easy to do, theyare very easy to set up and they are really easy to send out to large groups of people

You can get tons of data really fast

On the other hand, the same way that they are easyto do, they are also really easy to do badly

The problem is that the questions you ask,they can be ambiguous, they can be double-barreled, they can be loaded and the response scalescan be confusing

So, if you say, "I never think this particular way" and the personputs strongly disagree, they may not know exactly what you are trying to get at

So,you have to take special effort to make sure that the meaning is clear, unambiguous, andthat the rating scale, the way that people respond, is very clear and they know wheretheir answer falls

Which gets us into one of the things about people behaving badlyand that is beware the push poll

Now, especially during election time; like we are in rightnow, a push poll is something that sounds like a survey, but really what it is is avery biased attempt to get data, just fodder for social media campaigns or I am going tomake a chart that says that 98% of people agree with me

A push poll is one that isso biased, there is really only one way to answer to the questions

This is consideredextremely irresponsible and unethical from a research point of view

Just hang up onthem

Now, aside from that egregious violation of research ethics, you do need to do otherthings like watch out for bias in the question wording, in the response options, and alsoin the sample selection because any one of those can push your responses off one wayor another without you really being aware that it is happening

So, in sum, let's saythis about surveys

You can get lots of data quickly, on the other hand, it requires familiaritywith the possible answers in your audience

So, you know, sort of, what to expect

Andno matter what you do, you need to watch for bias to make sure that your answers are goingto be representative of the group that you are really concerned about understanding

An interesting topic in Data Sourcing when you are making data is Card Sorting

Now,this isn't something that comes up very often in academic research, but in web research,this can be a really important method

Think of it as what you are trying to do is likebuilding a model of a molecule here, you are trying to build a mental model of people'smental structures

Put more specifically, how do people organize information intuitively?And also, how does that relate to the things that you are doing online? Now, the basicprocedure goes like this: you take a bunch of little topics and you write each one ona separate card

And you can do this physically, with like three by five cards, or there area lot of programs that allow you to do a digital version of it

Then what you do is you givethis information to a group of respondents and the people sort those cards

So, theyput similar topics with each other, different topics over here and so on

And then you takethat information and from that you are able to calculate what is called, dissimilaritydata

Think of it as like the distance or the difference between various topics

Andthat gives you the raw data to analyze how things are structured

Now, there are twovery general kinds of card sorting tasks

There are generative and there's evaluative

A generative card sorting task is one in which respondents create their own sets, their ownpiles of cards using any number of groupings they like

And this might be used, for instance,to design a website

If people are going to be looking for one kind of information nextto another one, then you are going to want to put that together on the website, so theyknow where to expect it

On the other hand, if you've already created a website, thenyou can do an evaluative card sorting

This is where you have a fixed number or fixednames of categories

Like for instance, the way you have set up your menus already

Andthen what you do is you see if people actually put the cards into these various categoriesthat you have created

That's a way of verifying that your hierarchical structure makes senseto people

Now, whichever method you do, generative or evaluative, what you end up with when youdo a card structure is an interesting kind of visualization called a Dendrogram

Thatactually means branches

And what we have here is actually a hundred and fifty datapoints; if you are familiar with the Fisher's Iris data, that's what's going on here

Andit groups it from one giant group on the left and then splits it in pieces and pieces andpieces until you end up with lots of different observations, well actually, individual-levelobservations at the end

But you can cut things off into two or three groups or whatever ismost useful for you here, as a way of visualizing the entire collection of similarity or dissimilaritybetween the individual pieces of information that you had people sort

Now, I will justmention very quickly if you want to do a digital card sorting, which makes your life infinitelyeasier because keeping track of physical cards is really hard

You can use something likeOptimal Workshop, or UserZoom or UX Suite

These are some of the most common choices

Now, let's just sum up what we've learned about card sorting in this extremely briefoverview

Number one, card sorting allows you to see intuitive organization of informationin a hierarchical format

You can do it with physical cards or you can also have digitalchoices for doing the same thing

And when you are done, you actually get this hierarchicalor branched visualization of how the information is structured and related to each other

Whenyou are doing your Data Sourcing and you are making data, sometimes you can't get whatyou want through the easy ways, and you've got to take the hard way

And you can do whatI am calling laboratory experiments

Now of course, when I mention laboratory experimentspeople start to think of stuff like, you know, doctor Frankenstein in his lab, but lab experimentsare less like this and in fact they are a little more like this

Nearly every experimentI have done in my career has been a paper and pencil one with people in a well-lightedroom and it's not been the threatening kind

Now, the reason you do a lab experiment isbecause you want to determine cause and effect

And this is the single most theoreticallyviable way of getting that information

Now, what makes an experiment an experiment isthe fact that researchers play active roles in experiments with manipulations

Now, peopleget a little freaked out when they hear manipulations, think that you are coercing people and messingwith their mind

All that means is that you are manipulating the situation; you are causingsomething to be different for one group of people or for one situation than another

It's a benign thing, but it allows you to see how people react to those different variations

Now, you are going to want to do an experiment, you are going to want to have focused research,it is usually done to test one thing or one variation at a time

And it is usually hypothesis-driven;usually you don't do an experiment until you have done enough background research to say,"I expect people to react this way to this situation and this way to the other

" A keycomponent to all of this is that experiments almost always have random assignment regardlessof how you got your sample, when they are in your study, you randomly assign them toone condition or another

And what they does is it balances out the pre-existing differencesbetween groups and that's a great way of taking care of confounds and artifacts

The thingsthat are unintentionally associated with differences between groups that provide alternate explanationsfor your data

If you have done good random assignment and you have a large enough groupof people than those confounds and artifacts are basically minimized

Now, some placeswhere you are likely to see laboratory experiments in this version are for instance are eye trackingand web design

This is where you have to bring people in front of a computer and youstick a thing there that sees where they are looking

That's how we know for instance thatpeople don't really look at ads on the side of web pages

Another very common place isresearch in medicine and education and in my field, psychology

And in all of these,what you find is that experimental research is considered the gold standard for reliablevalid information about cause and effect

On the other hand, while it is a wonderfulthing to have, it does come at a cost

Here's how that works

Number 1, experimentationrequires extensive, specialized training

It is not a simple thing to pick up

Two,experiments are often very time consuming and labor intensive

I have known some thattake hours per person

And number three, experiments can be very expensive

So, what that all meansis that you want to make sure that you have done enough background research and you needto have a situation where it is sufficiently important to get really reliable cause andeffect information to justify these costs for experimentation

In sum, laboratory experimentationis generally considered the best method for causality or assessing causality

That's becauseit allows you to control for confounds through randomization

On the other hand, it can bedifficult to do

So, be careful and thoughtful when considering whether you need to do anexperiment and how to actually go about doing it

There's one final procedure I want totalk about in terms of Data Sourcing and Making New Data

It's a form of experimentation andit is simply called A/B testing and it's extremely common in the web world

So, for instance,I just barely grabbed a screenshot of Amazon

com's homepage and you're got these various elementson the homepage and I just noticed, by the way, when I did this that this woman is actuallyan animated gif, so she moves around

That was kind of weird; I have never seen thatbefore

But the thing about this, is this entire layout, how things are organized andhow they are on there, will have been determined by variations on A/B testing by Amazon

Here'show it works

For your webpage, you pick one element like what's the headline or what arethe colors or what's the organization or how do you word something and you create multipleversions, maybe just two version A and version B, why you call it A/B testing

Then whenpeople visit your webpage you randomly assign these visitors to one version or another,you have software that does that for you automatically

And then you compare the response rates onsome response

I will show you those in a second

And then, once you have enough data,you implement the best version, you sort of set that one solid and then you go on to somethingelse

Now, in terms of response rates, there are a lot of different outcomes you can lookat

You can look at how long a person is on a page, you can actually do mouse trackingif you want to

You can look at click-throughs, you can also look at shopping cart value orabandonment

A lot of possible outcomes

All of these contribute through A/B testing tothe general concept of website optimization; to make your website as effective as it canpossibly be

Now, the idea also is that this is something that you are going to do a lot

You can perform A/B tests continually

In fact, I have seen one person say that whatA/B testing really stands for is always be testing

Kind of cute, but it does give youthe idea that improvement is a constant process

Now, if you want some software to do A/B testing,two of the most common choices are Optimizely and VWO, which stands for Visual Web Optimizer

Now, many others are available, but these are especially common and when you get thedata you are going to use statistical hypothesis testing to compare the differences or reallythe software does it for you automatically

But you may want to adjust the parametersbecause most software packages cut off testing a little too soon and the information is notquite as reliable as it should be

But, in sum, here is what we can say about A/B testing

It is a version of website experimentation; it is done online, which makes it really easyto get a lot of data very quickly

It allows you to optimize the design of your websitefor whatever outcome is important to you

And it can be done as a series of continualassessments, testing, and development to make sure that you're accomplishing what you wantto as effectively as possible for as many people as possible

The very last thing Iwant to talk about in terms of data sourcing is to talk about the next steps

And probablythe most important thing is, you know, don't just sit there

I want you to go and see whatyou already have

Try to explore some open data sources

And if it helps, check witha few data vendors

And if those don't give you what you need to do your project, thenconsider making new data

Again, the idea here is get what you need and get going

Thanksfor joining me and good luck on your own projects

Welcome to "Coding in Data Science"

I'm BartPoulson and what we are going to do in this series of videos is we're going to take alittle look at the tools of Data Science

So, I am inviting you to know your tools,but probably even more important than that is to know their proper place

Now, I mentionthat because a lot of the times when people talk about data tools, they talk about itas though that were the same thing as data science, as though they were the same set

But, I think if you look at it for just a second that is not really the case

Data toolsare simply one element of data science because data science is made up of a lot more thanthe tools that you use

It includes things like, business knowledge, it includes themeaning making and interpretation, it includes social factors and so there's much more thanjust the tools involved

That being said, you will need at least a few tools and sowe're going to talk about some of the things that you can use in data science if it workswell for you

In terms of getting started, the basic things

#1 is spreadsheets, it isthe universal data tool and I'll talk about how they play an important role in data science

#2 is a visualization program called Tableau, there is Tableau public, which is free, andthere's Tableau desktop and there is also something called Tableau server

Tableau isa fabulous program for data visualization and I'm convinced for most people providesthe great majority of what they need

And though while it is not a tool, I do need totalk about the formats used in web data because, you have to be able to navigate that whendoing a lot of data science work

Then we can talk about some of the essential toolsfor data science

Those include the programming language R, which is specifically for data,there's the general purpose programming language Python, which has been well adapted to data

And there's the database language sequel or SQL for structured query language

Then ifyou want to go beyond that, there are some other things that you can do

There are thegeneral purpose programming languages C, C++, and Java, which are very frequently used toform the foundation of data science and sort of high level production code is going torely on those as well

There's the command line interface language Bash, which is verycommon, a very quick tool for manipulating data

And then there's the, sort of wild cardsupercharged regular expressions or Regex

We'll talk about all of these in separatecourses

But, as you consider all the tools that you can use, don't forget the 80/20 rule

Also known as the Pareto Principle

And the idea here is that you are going to get a lotof bang for your buck out of small number of things

And I'm going to show you a littlesample graph here

Imagine that you have ten different tools and we'll call them A throughB

A does a lot for you, B does a little bit less and it kind of tapers down to, you havegot a bunch of tools that do just a little of stuff that you need

Now, instead of lookingat the individual effectiveness, look at the cumulative effectiveness

How much are youable to accomplish with a combination of tools? Well, the first ones right here at 60% wherethe tools started and then you add on the 20% from B and it goes up and then you addon C and D and you add up little smaller, smaller pieces and by the time you get tothe end, you have got 100% of effectiveness from your ten tools combined

The importantthing about this is, you only have to go to the 2nd tool, that is two out of ten, that'sB, that's 20% of your tools and in this made up example, you have got 80% of your output

So, 80% of the output from 20% of the tools, that's a fictional example of the Pareto Principle,but I find in real life it tends to work something approximately like that

And so, you don'tnecessarily have to learn everything and you don't have to learn how to do everything ineverything

Instead you want to focus on the tools that will be most productive and specificallymost productive for you

So, in sum, let's say these three things

Number 1, coding orsimply the ability to manipulate data with programs and computers

Coding is important,but data science is much greater than the collection of tools that's used in it

Andthen finally, as you're trying to decide what tools to use and what you need to learn andhow to work, remember the 80/20, you are going to get a lot of bang from a small set of tools

So, focus on the things that are going to be most useful for you in conducting yourown data science projects

As we begin our discussion of Coding and Data Science, I actuallywant to begin with something that's not coding

I want to talk about applications or programsthat are already created that allow you to manipulate data

And we are going to beginwith the most basic of these, spreadsheets

We're going to do the rows and columns andcells of Excel

And the reason for this is you need spreadsheets

Now, you may be sayingto yourself, "no no no not me, because you know what I'm fancy, I'm working in my bigset of servers, I've got fancy things going on

" But, you know what, you too fancy people,you need spreadsheets as well

There's a few reasons for this

Most importantly, spreadsheetscan be the right tool for data science in a lot of circumstances; there are a few reasonsfor that

Number one, spreadsheets, they're everywhere, they're ubiquitous, they're installedon a billion machines around the world and everybody uses them

They probably have moredata sets in spreadsheets than anything else, and so it's a very common format

Importantly,it's probably your client's format; a lot of your clients are going to be using spreadsheetsfor their own data

I've worked with billion dollar companies that keep all of their datain spreadsheets

So, when you're working with them, you need to know how to manipulate thatand how to work with it

Also, regardless of what you're doing, spreadsheets are specificallycsv - comma separated value files - are sort of the lingua franca or the universal interchangeformat for data transfer, to allow you to take it from one program to another

And then,truthfully, in a lot of situations they're really easy to use

And if you want a secondopinion on this, let's take a look at this ranking

There's a survey of data mining experts,it's the KDnuggets data mining poll, and these are the tools they most use in their own work

And look at this: lowly Excel is fifth on the list, and in fact, what's interestingabout it is it's above Hadoop and Spark, two of the major big data fancy tools

And so,Excel really does have place of pride in a toolkit for data analyst

Now, since we'regoing to sort of the low tech end of things, let's talk about some of the things you cando with a spreadsheet

Number one, they are really good for data browsing

You reallyget to see all of the data in front of you, which isn't true if you are doing somethinglike R or Python

They're really good for sorting data, sort by this column then thiscolumn then this column

They're really good for rearranging columns and cells and movingthings around

They're good for finding and replacing and seeing what happens so you knowthat it worked right

Some more uses they're really good for formatting, especially conditionalformatting

They're good for transposing data, switching the rows and the columns, they makethat really easy

They're good for tracking changes

Now it's true if you're a big fancydata scientist you're probably using GitHub, but for everybody else in the world spreadsheetsand the tracking changes is a wonderful way to do it

You can make pivot tables, thatallows you to explore the data in a very hands-on way, in a very intuitive way

And they'realso really good for arranging the output for consumption

Now, when you're workingwith spreadsheets, however, there's one thing you need to be aware of: they are really flexible,but that flexibility can be a problem in that when you are working in data science, youspecifically want to be concerned about something called Tidy Data

That's a term I borrowedfrom Hadley Wickham, a very well-known developer in the R world

Tidy Data is for transferringdata and making it work well

There's a few rules here that undo some of the flexibilityinherent in spreadsheets

Number one, what you want to do is have a column be equivalentto the same thing as a variable; columns, variables, they are the same thing

And then,rows are equal - exactly the same thing as cases

That you have one sheet per file, andthat you have one level of measurement, say, individual, then organization, then stateper file

Again, this is undoing some of the flexibility that's inherent in spreadsheets,but it makes it really easy to move the data from one program to another

Let me show youhow all this works

You can try this in Excel

If you have downloaded the files for thiscourse, we simply want to open up this spreadsheet

Let me go to Excel and show you how it works

So, when you open up this spreadsheet, what you get is totally fictional data here thatI made up, but it is showing sales over time of several products at two locations, likeif you're selling stuff at a baseball field

And this is the way spreadsheets often appear;we've got blank rows and columns, we've got stuff arranged in a way that makes it easyfor the person to process it

And we have got totals here, with formulas putting themall together

And that's fine, that works well for the person who made it

And then,that's for one month and then we have another month right here and then we have anothermonth right here and then we combine them all for first quarter of 2014

We have gotsome headers here, we've got some conditional formatting and changes and if we come to thebottom, we have got a very busy line graphic that eventually loads; it's not a good graphic,by the way

But, similar to what you will often find

So, this is the stuff that, whileit may be useful for the client's own personal use, you can't feed this into R or Python,it will just choke and it won't know what to do with it

And so, you need to go througha process of tidying up the data

And what this involves is undoing some of the stuff

So, for instance, here's data that is almost tidy

Here we have a single column for date,a single column for the day, a column for the site, so we have two locations A and B,and then we have six columns for the six different things that are sold and how many were soldon each day

Now, in certain situations, you would want the data laid out exactly likethis if you are doing, for instance, a time series, you will do something vaguely similarto this

But, for true tidy stuff, we are going to collapse it even further

Let mecome here to the tidy data

And now what I have done is, I have created a new columnthat says what is the item being sold

And so, by the way, what this means is that wehave got a really long data set now, it has got over a thousand rows

Come back up tothe top here

But, what that shows you is that now it's in a format that's really easyto import from one program to another, that makes it tidy and you can re-manipulate ithowever you want once you get to each of those

So, let's sum up our little presentation here,in a few lines

Number one, no matter who you are, no matter what you are doing in datascience you need spreadsheets

And the reason for that is that spreadsheets are often theright tool for data science

Keep one thing in mind though, that is as you are movingback and forth from one language to another, tidy data or well-formatted data is goingto be important for exporting data into your analytical programmer language of choice

As we move through "Coding and Data Science," and specifically the applications that canbe used, there's one that stands out for me more than almost anything else, and that'sTableau and Tableau Public

Now, if you are not familiar with these, these are visualizationprograms

The idea here is that when you have data, the most important thing you can dois to first look and see what you have and work with it from there

And in fact, I'mconvinced that for many organizations Tableau might be all that they really need

It willgive them the level of insight that they need to work constructively with data

So, let'stake a quick look by going to tableau


Now, there are a few different versions ofTableau

Right here we have Tableau Desktop and Tableau Server, and these are the paidversions of Tableau

They actually cost a lot of money, unless you work for a nonprofitorganization, in which case you can get them for free

Which is a beautiful thing

Whatwe're usually looking for, however, is not the paid version, but we are looking for somethingcalled Tableau Public

And if you come in here and go to products and we have got thesethree paid ones, over here to Tableau Public

We click on that, it brings us to this page

It is public



And this is the one that has what we want, it's the free versionof Tableau with one major caveat: you don't save files locally to your computer, whichis why I didn't give you a file to open

Instead, it saves them to the web in a public form

So, if you are willing to trade privacy, you can get an immensely powerful applicationfor data visualization

That's a catch for a lot of people, which is why people are willingto pay a lot of money for the desktop version

And again, if you work for a nonprofit youcan get the desktop version for free

But, I am going to show you how things work inTableau Public

So, that's something that you can work with personally

The first thingyou want to do is, you want to download it

And so, you put in your email address, youdownload; it is going to know what you are on

It is a pretty big download

And onceit is downloaded, you can install and open up the application

And here I am in TableauPublic, right here, this is the blank version

By the way, you also need to create an accountwith Tableau in order to save your stuff online to see it

I will show you what that lookslike

But, you are presented with a blank thing right here and the first thing you needto do is, you need to bring in some data

I'm going to bring in an Excel file

Now,if you downloaded the files for the course, you will see that there is this one righthere, DS03_2_2_TableauPublic



In fact, it is the one that I used in talkingabout spreadsheets in the first video in this course

I'm going to select that one and I'mgoing to open it

And a lot of programs don't like bringing in Excel because it's got allthe worksheets and all the weirdness in it

This one works better with it, but what I'mgoing to do is, I am going to take the tidy data

By the way, you see that it put themin alphabetical order here

I'm going to take tidy data and I'm going to drag it over tolet it know that it's the one that I want

And now what it does is it shows me a versionof the data set along with things that you can do here

You can rename it, I like thatyou can create bin groups, there's a lot of things that you can do here

I'm going todo something very, very quick with this particular one

Now, I've got the data set right here,what I'm going to do now is I'm going to go to a worksheet

That's where you actuallycreate stuff

Cancel that and go to worksheet one


This is a drag and drop interface

And so what we are going to do is, we are going to pull the bits and pieces of informationwe want to make graphics

There's immense flexibility here

I'm going to show you twovery basic ones

I'm going to look at the sales of my fictional ballpark items

So,I'm going to grab sales right here and I'm going to put that as the field that we aregoing to measure


And you see, put it down right here and this is our total sales

We're going to break it down by item and by time

So, let me take item right here, andyou can drag it over here, or I can put it right up here into rows

Those will be myrows and that will be how many we have sold total of each of the items

Fine, that's reallyeasy

And then, let's take date and we will put that here in columns to spread it across

Now, by default it is doing it by year, I don't want to do that, I want to have threemonths of data

So, what I can do is, I can click right here and I can choose a differenttime frame

I can go to quarter, but that's not going to help because I only have onequarter's worth of data, that's three months

I'm going to come down to week

Actually,let me go to day

If I do day, you see it gets enormously complicated, so that's nogood

So, I'm going to back up to week

And I've got a lot of numbers there, but whatI want is a graph

And so, to get that, I'm going to come over here and click on thisand tell it that I want a graph

And so, we're seeing the information, except it lost items

So, I'm going to bring item and put it back up into this graph to say this is a row forthe data

And now I've got rows for sales by week for each of my items

That's great

I want to break it down one more by putting in the site, the place that it sold

So, I'mgoing to grab that and I'm going to put it right over here

And now you see I've gotit broken down by the item that is sold and the different sites

I'm going to color thesites, and all I've got to do to do that is, I'm going to grab site and drag it onto color

Now, I've got two different colors for my sites

And this makes it a lot easier to tellwhat is going on

And in fact, there is some other cool stuff you can do

One of the thingsI'm going to do is come over here to analytics and I can tell it to put an average line througheverything, so I'll just drag this over here

Now we have the average for each line


And I can even do forecasting

Let me get a little bit of a forecast right here

I will drag this on and if you can go over here

I will get this out of the way for asecond

Now, I have a forecast for the next few weeks, and that's a really convenient,quick, and easy thing

And again, for some organizations that might be all that theyreally need

And so, what I'm showing you here is the absolute basic operation of Tableau,which allows you to do an incredible range of visualizations and manipulate the dataand create interactive dashboards

There's so much to it and we'll show that in anothercourse, but for right now I want to show you one last thing about Tableau Public, and thatis saving the files

So now, when I come here and save it, it's going to ask me to signinto Tableau Public

Now, I sign in and it asks me how I want to save this, same nameas the video

There we go, and I'm going to hit save

And then that opens up a web browser,and since I'm already logged into my account, see here's my account and my profile

Here'sthe page that I created

And it's got everything that I need there; I'm going to edit justa few details

I'm going to say, for instance, I'm going to leave its name just like that

I can put more of a description in there if I wanted

I can allow people to download theworkbook and its data; I'm going to leave that there so you can download it if you needto

If I had more than one tab, I would do this thing that says show the different sheetsas tabs

Hit save

And there's my data set and also it's published online and peoplecan now find it

And so what you have here is an incredible tool for creating interactivevisualizations; you can create them with drop-down menus, and you can rearrange things, and youcan make an entire dashboard

It's a fabulous way of presenting information, and as I saidbefore, I think that for some organizations this may be as much as they need to get reallygood, useful information out of their data

And so I strongly recommend that you takesome time to explore with Tableau, either the paid desktop version or the public versionand see what you can do to get some really compelling and insightful visualizations outof your work in data science

For many people, their first experience of "Coding and DataScience" is with the application SPSS

Now, I think of SPSS and the first thing that comesto my mind is sort of life in the Ivory tower, though this looks more like Harry Potter

But, if you think about it the package name SPSS comes from Statistical Package for theSocial Sciences

Although, if you ask IBM about it now, they act like it doesn't standfor anything

But, it has its background in social science research which is generallyacademic

And truthfully, I'm a social psychologist and that's where I first learned how to useSPSS

But, let's take a quick look at their webpage ibm


If you type that in,that will just be an alias that will take you to IBM's main webpage

Now, IBM didn'tcreate SPSS, but they bought it around version 16, and it was very briefly known as PASWpredictive analytic software, that only lasted briefly and now it's back to SPSS, which iswhere it's been for a long time

SPSS is a desktop program; it's pretty big, it doesa lot of things, it's very powerful, and is used in a lot of academic research

It's alsoused in a lot of business consulting, management, even some medical research

And the thingabout SPSS, is it looks like a spreadsheet but has drop-down menus to make your lifea little bit easier compared to some of the programming languages that you can use

Now,you can get a free temporary version, if you're a student you can get a cheap version, otherwiseSPSS costs a lot of money

But, if you have it one way or another, when you open it upthis is what it is going to look like

I'm showing SPSS version 22, now it's currentlyon 24

And the thing about SPSS versioning is, in anything other than software packaging,these would be point updates, so I sort of feel like we should be on 17

3, as opposedto 23 or 24

Because the variations are so small that anything you learn from the earlyones, is going to work on the later ones and there is a lot of backwards and forwards compatibility,so I'd almost say that this one, the version I have practically doesn't matter

You getthis little welcome splash screen, and if you don't want to see it anymore you can getrid of it

I'm just going to hit cancel here

And this is our main interface

It looks alot like a spreadsheet, the difference is, you have a separate pane for looking at variableinformation and then you have separate windows for output and then an optional one for somethingcalled Syntax

But, let me show you how this works by first opening up a data set

SPSShas a lot of sample data sets in them, but they are not easy to get to and they are reallywell hidden

On my Mac, for instance, let me go to where they are

In my mac I go tothe finder, I have to go to Mac, to applications, to the folder IBM, to SPSS, to statistics,to 22 the version number, to samples, then I have to say I want the ones that are inEnglish, and then it brings them up


sav files are the actual data files, there aredifferent kinds in here, so

sav is a different kind of file and then we have a differentone about planning analyses

So, there are versions of it

I'm going to open up a filehere called "market values

sav," a small data set in SPSS format

And if you don'thave that, you can open up something else; it really doesn't matter for now

By the way,in case you haven't noticed, SPSS tends to be really really slow when it opens

It also,despite being version 24, it tends to be kind of buggy and crashes

So, when you work withSPSS, you want to get in the habit of saving your work constantly

And also, being patientwhen it is time to open the program

So, here is a data set that just shows addresses andhouse values, and square feet for information

This, I don't even know if this is real information,it looks artificial to me

But, SPSS lets you do point and click analyses, which isunusual for a lot of things

So, I am going to come up here and I am going to say, forinstance, make a graph

I'm going to make a- I'm going to use what is called a legacydialogue to get a histogram of house prices

So, I simply click values

Put that rightthere and I will put a normal curve in top of it and click ok

This is going to openup a new window, and it opened up a microscopic version of it, so I'm going to make that bigger

This is the output window, this is a separate window and it has a navigation pane here onthe side

It tells me where the data came from, and it saves the command here, and then,you know, there's my default histogram

So, we see most of the houses were right around$125,000, and then they went up to at least $400,000

I have a mean of $256,000, a standarddeviation of about $80,000, and then there is 94 houses in the data set

Fine, that'sgreat

The other thing I can do is, if I want to do some analyses, let me go back to thedata just for a moment

For instance, I can come here to analyze and I can do descriptiveand I'm actually going to do one here called Explore

And I'll take the purchase priceand I'll put it right here and I'm going to get a whole bunch just by default

I'm goingto hit ok

And it goes back to the output window

Once again made it tiny

And so, nowyou see beneath my chart I now have a table and I've got a bunch of information

A stemand leaf plot, and a box plot too, a great way of checking for outliers

And so thisis a really convenient way to save things

You can export this information as images,you can export the entire file as an HTML, you can do it as a pdf or a PowerPoint

There'sa lot of options here and you can customize everything that's on here

Now, I just wantto show you one more thing that makes your life so much easier in SPSS

You see righthere that it's putting down these commands, it's actually saying graph, and then histogram,and normal equals value

And then down here, we've got this little command right here

Most people don't know how to save their work in SPSS, and that's something you kind ofjust have to do it over again every time, but there's a very simple way to do this

What I'm going to do is, I'm going to open up something called a Syntax file

I'm goingto go to new, Syntax

And this is just a blank window that's a programming window, it's forsaving code

And let me go back to my analysis I did a moment ago

I'll go back to analyzeand I can still get at it right here

Descriptives and explore, my information is still there

And what happens here is, even though I set it up with drop-down menus and point and click,if I do this thing, paste, then what it does is, it takes the code that creates that commandand it saves it to this syntax window

And this is just a text file

It saves it as

spss,but it is a text file that can be opened in anything

And what's beautiful about thisis, it is really easy to copy and paste, and you can even take this into Word and do afind and replace on it, and it's really easy to replicate the analyses

And so for me,SPSS is a good program

But, until you use Syntax you don't know the true power of itand it makes your life so much easier as a way of operating it

Anyhow, this is my extremelybrief introduction to SPSS

All I want to say is that it is a very common program, kindof looks like a spreadsheet, but it gives you a lot more power and options and you canuse both drop-down menus and text-based Syntax commands as well to automate your work andmake it easier to replicate it in the future

I want to take a look at one more applicationfor "Coding and Data Science", that's called JASP

This is a new application, not veryfamiliar to a lot of people and still in beta, but with an amazing promise

You can basicallythink of it as a free version of SPSS and you know what, we love free

But, JASP isnot just free, it's also open source, and it's intuitive, and it makes analyses replicable,and it even includes Bayesian approaches

So, take that all together, you know, we'repretty happy and we're jumping for joy

So, before we move on, you just may be askingyourself, JASP, what is that? Well, the creator has emphatically denied that it stands forJust Another Statistics Program, but be that as it may, we will just go ahead and callit JASP and use it very happily

You can get to it by going to jasp-stats


And let'stake a look at that right now

JASP is a new program, they say a low fat alternative toSPSS, but it is a really wonderful great way of doing statistics

You're going to wantto download it, by supplying your platform; it even comes in Linux format, which is beautiful

And again, it's beta so stay posted, things are updating regularly

If you're on Mac,you're going to need to use Xquartz, that's an easy thing to install and it makes a lotof things work better

And it's the wonderful way to do analyses

When you open up JASP,it's going to look like this

It's a pretty blank interface, but it's really easy to getgoing with it

So for instance, you can come over here to file and you can even choosesome example data sets

So for instance, here's one called Big 5 that's personality factors

And you've got data here that's really easy to work with

Let me scroll this over herefor a moment

So, there's our five variables and let's do some quick analyses with these

Say for instance, we want to get descriptives; we can pick a few variables

Now, if you'refamiliar with SPSS, the layout feels very much the same and the output looks a lot thesame

You know, all I have to do is select what I want and it immediately pops up overhere

Then I can choose additional statistics, I can get core tiles, I can get the median

And you can choose plots; let's get some plots, all you have to do is click on it and theyshow up

And that's a really beautiful thing and you can modify these things a little bit,so for instance, I can take the plot points

Let's see if I can drag that down and if Imake it small enough I can see the five plots, I went a little too far on that one

Anyhow,you can do a lot of things here

And I can hide this, I can collapse that and I can goon and do other analyses

Now, what's really neat though is when I navigate away, so Ijust clicked in a blank area of the results page, we are back to the data here

But ifI click on one of these tables, like this one right here, it immediately brings up thecommands that produced it and I can just modify it some more if I want

Say I want skewnessand kurtosis, boom they are in there

It is an amazing thing and then I can come backout here, I can click away from that and I can come down to the plots expand those andif I click on that it brings up the commands that made them

It's an amazingly easy andintuitive way to do things

Now, there's another really nice thing about JASP and that is thatyou can share the information online really well through a program called osf


Thatstands for the open science foundation, that's its web address osf


So, let's take a quicklook at what that's like

Here's the open science framework website and it's a wonderfulservice, it's free and it's designed to support open, transparent, accessible, accountable,collaborative research and I really can't say enough nice things about it

What's neatabout this is once you sign up for OSF you can create your own area and I've got oneof my own, I will go to that now

So, for instance, here's the datalab page in openscience framework

And what I've done is i created a version of this JASP analysis andI've saved it here, in fact, let's open up my JASP analysis in JASP and I'll show youwhat it looks like in osf

So, let's first go back to JASP

When we're here we can comeover to file and click computer and I just saved this file to the desktop

Click on desktop,and you should have been able to download this with all the other files, DS03_2_4_JASP,double click on that to open it and now it's going to open up a new window and you seeI was working with the same data set, but I did a lot more analyses

I've got thesegraphs; I have correlations and scatter plots

Come down here, I did a linear regression

And we just click on that and you can see the commands that produce it as well as theoptions

I didn't do anything special for that, but I did do some confidence intervalsand specified that and it's really a great way to work with all this

I'll click backin an empty area and you see the commands go away and so I've got my output here inJASP, but when I saved it though, I had the option of saving it to OSF, in fact if yougo to this webpage osf

io/3t2jg you'll actually be able to go to the page where you can seeand download the analyses that I conducted, let's take a look

This is that page, there'sthe address I just barely gave you and what you see here is the same analysis that I conducted,it's all right here, so if you're collaborating with people or if you want to show thingsto people, this is a wonderful way to do it

Everything is right there, this is a staticimage, but up at the top people have the option of downloading the original file and workingwith it on their own

In case you can't tell, I'm really enthusiastic about JASP and aboutits potential, still in beta, still growing rapidly

I see it really as an open sourcefree and collaborative replacement to SPSS and I think it is going to make data sciencework so much easier for so many people

I strongly recommend you give JASP a close look

Let's finish up our discussion of "Coding and Data Science" the applications part ofit by just briefly looking at some other software choices

And I'll have to admit it gets kindof overwhelming because there are just so many choices

Now, in addition to the spreadsheets,and Tableau, and SPSS, and JASP, that we have already talked about, there's so much morethan that

I'm going to give you a range of things that I'm aware of and I'm sure I'veleft out some important ones or things that other people like really well, but these aresome common choices and some less common, but interesting ones

Number one, in termsof ones that I haven't mentioned is SAS

SAS is an extremely common analytical program,very powerful, used for a lot of things

It's actually the first program that I learnedand on the other hand it can be kind of hard to use and it can be expensive, but there'sa couple of interesting alternatives

SAS also has something called the SAS UniversityEdition, if you're a student this is free and it's slightly reduced in what it does,but the fact that it's free

And also it runs in a virtual machine which makes it an enormousdownload, but it's a good way to learn SAS if it's something that you want to do

SASalso makes a program that I really love were it not so extraordinarily expensive and thatis called JMP and its visualization software

Think a little bit of Tableau, how we sawit, you work with it visually and this one you can drag things around, it's really wonderfulprogram

I personally find it prohibitively expensive

Another very common choice amongworking analysts is Stata and some people use Minitab

Now, for mathematical people,there's MATLAB and then of course there's Mathematica itself, but it is really moreof a language than a program

On the other hand, Wolfram; who makes Mathematica, is alsothe people who give us Wolfram Alpha, most people don't think of this a stats applicationbecause you can run it on your iPhone

But, Wolfram Alpha is an incredibly capable andespecially if you pay for the pro account, you can do amazing things in this, includinganalyses, regression models, visualizations and so it's worth taking a little closer lookat that

Also, because it provides a lot of the data that you need so Wolfram Alpha isan interesting one

Now, several applications that are more specifically geared towardsdata mining, so you don't want to do your regular, you know, little t tests and stuffon these

But, there's RapidMiner and there's KNIME and Orange and those are all reallynice to use because they are control languages where you drag notes onto a screen and youconnect them with lines and you can see how things run through

All three of them arefree or have free versions and all three of them work in pretty similar manners

There'salso BigML, which is for machine learning and this is unusual because it's browser based,it runs on their servers

There's a free version, though you can't download a whole lot, itdoesn't cost a lot to use BigML and it's a very friendly, very accessible program

Thenin terms of programs you can actually install for free on your own computer, there's onecall SOFA Statistics, it means statistics open for all, it's kind of a cheesy title,but it's a good program

And then one with a web page straight out of 1990 is Past 3,this is paleontological software, on the other hand does do very general stuff, it runs onmany platforms and it's a really powerful thing and it's free, but it is relativelyunknown

And then speaking of relatively unknown, one that's near and dear to my heart is aweb application called Statcrunch, it costs, but it costs like $6 or $12 a year, it's reallycheap and it's very good, especially if for basic statistics and for learning, I usedin some of the classes that I was teaching

And then if you're deeply wedded to Exceland you can't stand to leave that environment, you can purchase add-ons like XLSTAT, whichgive you a lot of statistical functions within the Excel environment itself

That's a lotof choices and the most important thing here is don't get overwhelmed

There's a lot ofchoices, but you don't even have to try all of them

Really the important question iswhat works best for you and the project that you're working on? Here's a few things youwant to consider in that regard

First off is functionality, does it actually do whatyou want or does it even run on your machine? You don't need everything that a program cando

When you think about the stuff Excel can do, people probably use five percent of what'savailable

Second is ease of use

Some of these programs are a lot easier to use thanthe others and I personally find that the ones that are easier to use, I like them,so you might say, "No, I need to program because I need custom stuff"

But I'm willing to betthat 95% of what people do does not require anything custom

Also, the existence of acommunity

Constantly when you're working you come across problems and don't know howto solve it and being able to get online and do a search for an answer and have enoughof a community that there are people there who have put answers up and discuss thesethings

Those are wonderful

Some of these programs are very substantial communitiesand some of them it is practically nonexistent and it is to you to decide how important itis to you

And then finally of course there is the issue of cost

Many of these programsI mentioned are free, some of them are very cheap, some of them run some sort of premiummodel and some of them are extremely expensive

So, you don't buy them unless somebody elseis paying for it

So, these are some of the things that you want to keep in mind whenyou're trying to look at various programs

Also, let's mention this; don't forget the80/20 rule

You're going to be able to do most of the stuff that you need to do withonly a small number of tools, one or two, maybe three, will probably be all that youever need

So, you don't need to explore the range of every possible tool

Find somethingthat you need, find something you're comfortable with and really try to extract as much valueas you can out of that

So, in sum, in our discussion of available applications for codingand data science

First remember applications are tools, they don't drive you, you use them

And that your goals are what drive the choice of your applications and the way that youdo it

And the single most important thing is to remember, what works for you, may workwell for somebody else, if you're not comfortable with it, if it's not the questions you address,then it's more important to think about what works for you and the projects that you'reworking on as you make your own choices for tools, for working in data science

When you're"Coding in Data Science," one of the most important things you can do is be able towork with web data

And if you work with web data you're going to be working with HTML

And in case you're not familiar with it, HTML is what makes the World Wide Web go ‘round

What it stands for is HyperText Markup Language - and if you've never dealt with web pagesbefore, here's a little secret: web pages are just text

It is just a text document,but it uses tags to define the structure of the document and a web browser knows whatthose tags are and it displays them the right way

So, for instance, some of the tags, theylook like this

They are in angle brackets, and you have an angle bracket and then thebeginning tag, so body, and then you have the body, the main part of your text, andthen you have in angle brackets with backslash body to let the computer know that you aredone with that part

You also have p and backslash p for paragraphs

H1 is for header one andyou put it in between that text

TD is for table data or the cell in a table and youmark it off that way

If you want to see what it looks like just go to this document: DS03_3_1_HTML


I'm going to go to that one right now

Now, depending on what text editor you open thisup, it may actually give you the web preview

I've opened it up in TextMate and so it actuallyis showing the text the way I typed it

I typed this manually; I just typed it all inthere

And I have HTML to see what a document is, I have an empty header, but that sortof needs to be there

This, I say what the body is, and then I have some text

li isfor list items, I have headers, this is for a link to a webpage, then I have a small table

And if you want to see what this looks like when displayed as a web page, just go up hereto window and show web preview

This is the same document, but now it is in a browserand that's how you make a web page

Now, I know this is very fundamental stuff, but thereason this is important is because if you're going to be extracting data from the web,you have to understand how that information is encoded in the web, and it is going tobe in HTML most of the time for a regular web page

Now, I will mention something that,there's another thing called CSS

Web pages use CSS to define the appearance of a document

HTML is theoretically there to give the content and CSS gives the appearance

And that standsfor Cascading Style Sheets

I'm not going to worry about that right now because we'rereally interested in the content

And now you have the key to being able to read webpages and pull data from web pages for your data science project

So, in sum; first, theweb runs on HTML and that's what makes the web pages that are there

HTML defines thepage structure and the content that is on the page

And you need to learn how to navigatethe tags and the structure in order to get data from the web pages for your data scienceprojects

The next step in "Coding and Data Science" when you're working with web datais to understand a little bit about XML

I like to think of this as the part of web datathat follows the imperative, "Data, define thyself"

XML stands for eXtensible MarkupLanguage, and what it is XML is semi-structured data

What that means is that tags definedata so a computer knows what a particular piece of information is

But, unlike HTML,the tags are free to be defined any way you want

And so you have this enormous flexibilityin there, but you're still able to specify it so the computer can read it

Now, there'sa couple of places where you're going to see XML files

Number one is in web data

HTMLdefines the structure of a web page, but if they're feeding data into it, then that willoften come in the form of an XML file

Interestingly, Microsoft Office files, if you have


xlsx, the X-part at the end stands for a version of XML that's used to create thesedocuments

If you use iTunes, the library information that has all of your artists,and your genre's, and your ratings and stuff, that's all stored in an XML file

And thenfinally, data files that often go with particular programs can be saved as XML as a way of representingthe structure of the data to the program

And for XML, tags use opening and closingangle brackets just like HTML did

Again, the major difference is that you're free todefine the tags however you want

So for instance, thinking about iTunes, you can define a tagthat's genre, and you have the angle brackets in genre to begin that information, and thenyou have the angle brackets with the backslash to let it know you're done with that pieceof information

Or, you can do it for composer, or you can do it for rating, or you can doit for comments, and you can create any tags you want and you put the information in betweenthose two things

Now, let's take an example of how this works

I'm going to show you aquick dataset that comes from the web

It's at ergast

com and API, and this is a websitethat stores information about automobile Formula One racing

Let's go to this webpage and takea quick look at what it's like

So, here we are at Ergast

com, and it's the API for FormulaOne

And what I'm bringing up is the results of the 1957 season in Formula One racing

And here you can see who the competitors were in each race, and how they finished and soon

So, this is a dataset that is being displayed in a web page

If you want to see what itlooks like in XML, all you have to do is type XML onto the end of this:


I've donethat already, so I'm just going to go to that one

And as you see, it's only this bit thatI've added:


Now, it looks exactly the same because the web page is structuring XMLdata by default but if you want to see what it looks like in its raw format, just do anoption, click on the web page, and go to view page source

At least that's how it worksin Chrome, and this is the structured XML page

And you can see we have tags here

Itsays Race Name, Circuit Name, Location, and obviously, these are not standard HTML tags

They are defined for the purposes of this particular dataset

But we begin with one

We have Circuit Name right there, and then we close it using the backslash right there

And so this is structured data; the computer knows how to read it, which is exactly, thisis how it displays it by default

So, it's a really good way of displaying data and itsa good way to know how to pull data from the web

You can actually use what is called anAPI, an Application Programming interface to access this XML data and it pulls it inalong with its structure which makes working with it really easy

What's even more interestingis how easy it is to take XML data and convert it between different formats, because it'sstructured and the computer knows what you're dealing with

So for example, one it's reallyeasy to convert XML to CSV or comma separated value files (that's the spreadsheet format)because it knows exactly what the headings are; what piece of information goes in eachcolumn

Example two: it's really easy to convert HTML documents to XML because you can thinkof HTML with its restricted set of tags as sort of a subset of the much freer XML

Andthree, you can convert CSV, or your spreadsheet comma separated value, to XML and vice versa

You can bounce them all back and forth because the structure is made clear to the programsyou're working with

So in sum, here's what we can say

Number one, XML is semi-structureddata

What that means is that it has tags to tell the computer what the piece of informationis, but you can make the tags whatever you want them to be

And, XML is very common forweb data and it's really easy to translate the format XML/HTML/CSV so on and so forth

It's really easy to translate them back and forth which gives you a lot of flexibilityin manipulating data so can get into the format you need for your own analysis

The last thingI want to mention about "Coding and Data Science" and web data is something called JSON

AndI like to think of it as a version of smaller is better

Now, what JSON stands for is JavaScriptObject Notation, although JavaScript is supposed to be one word

And what it is, is that likeXML, JSON is semi-structured data

That is, you have tags that define the data, so thecomputer knows what each piece of information is, but like XML the tags can vary freely

And so there's a lot in common between XML and JSON

So XML is a Markup Language (that'swhat the ML stands for), and that gives meaning to the text; it lets the computer know whateach piece of information is

Also, XML allows you to make comments in the document, andit allows you to put metadata in the tags so you can actually put information therein the angle brackets to provide additional context

JSON, on the other hand, is specificallydesigned for data interchange and so it's got that special focus

And the structure;JSON corresponds with data structures, you know it directly represents objects and arraysand numbers and strings and booleans, and that works really well with the programs thatare used to analyze data

Also, JSON is typically shorter than XML because it does not requirethe closing tags

Now, there are ways to do that with XML, but that's not typically howit's done

As a result of these differences, JSON is basically taking XML's place in webdata

XML still exists, it's still used for a lot of things, but JSON is slowly replacingit

And we'll take a look at the comparison between the three by going back to the examplewe used in XML

This is data about Formula One car races in 1957 from ergast


Youcan just go to the first web page here, then we will navigate to the others from that

So this is the general page

This is if you just type in without the

XML or

JSON oranything

So it's a table of information about races in 1957

And we saw earlier that ifyou add just add

XML to the end of this, it looks exactly the same

That's becausethis browser is displaying XML properly by default

But, if you were to right click onit, and go to view page source, you would get this instead, and you can see the structure

This is still XML, and so everything has an opening tag and a closing tag and some extrainformation in there

But, if you type in

JSON what you really get is this jumbledmess

Now that's unfortunate because there is a lot of structure to this

So, what Iam going to do is, I am actually going to copy all of this data, then I'm going to goto a little web page; there's a lot of things you can do here, and it's a cute phrase

It'scalled JSON Pretty Print

And that is, make it look structured so it's easier to read

I just paste that in there and hit Pretty Print JSON, and now you can see hierarchicalstructure of the data

The interesting thing is that the JSON tags only have tags at thebeginning

It says series in quotes, then a colon, then it gives the piece of informationin quotes, and a comma and it moves on to the next one

And this is a lot more similarto the way data would be represented in something like R or Python

It is also more compact

Again, there are things you can do with XML but this is one of the reasons that JSON isbecoming preferred as a data carrier for websites

And as you may have guessed, it's really easyto convert between the formats

It's easy to convert between XML, JSON, CSV, etc

Youcan get a web page where you can paste a version in and you get the other version out

Thereare some differences, but for the vast majority of situations, they are just interchangeable

In Sum: what did we get from this? Like XML, JSON is semi-structured data, where thereare tags that say what the information is, but you define the tags however you want

JSON is specifically designed for data interchange and because it reflects the structure of thedata in the programs, that makes it really easy

Also, because it's relatively compactJSON is replacing gradually XML on the web, as the container for data on web pages

Ifwe are going to talk about "Coding and Data Science" and the languages that are used,then first and foremost is R

The reason for that is, according to many standards, R isthe language of data and data science

For example, take a look at this chart

This isa ranking based on a survey of data mining experts of the software they use in doingtheir work, and R is right there at the top

R is first, and in fact that's important becausethere's Python which is usually taken hand in hand with R for Data Science

But R sees50% more use than Python does, at least in this particular list

Now there's a few reasonsfor that popularity

Number one, R is free and it's open source, both of which make thingsvery easy

Second, R is specially developed for vector operations

That means it's ableto go through an entire list of data without having to write ‘for' loops to go through

If you've ever had to write ‘for' loops, you know that would be kind of disastroushaving to do that with data analysis

Next, R has a fabulous community behind it

It'svery easy to get help on things with R, you Google it, you're going to end up in a placewhere you're going to be able to find good examples of what you need

And probably mostimportantly, R is very capable

R has 7,000 packages that add capabilities to R

Essentially,it can do anything

Now, when you are working with R, you actually have a choice of interfaces

That is, how you actually do the coding and how you get your results

R comes with it'sown IDE or Interactive Development Environment

You can do that, or if you are on a Mac ora Linux you can actually do R through the Terminal through the command line

If you'veinstalled R, you just type R and it starts up

There is also a very popular developmentenvironment called RStudio

com, and that's actually the one I use and the one I willbe using for all my examples

But another new competitor is Jupyter, which is very commonlyused for Python; that's what I use for examples there

It works in a browser window, eventhough its locally installed

And RStudio and Jupyter there's pluses and minus to eachone of them and I'll mention them as we get to each one of them

But no matter which interfaceyou use, R's command line, you're typing lines of code in order to get the commands

Somepeople get really scared about that but really there are some advantages to that in termsof the replicability and really the accessibility, the transparency of your commands

So forinstance, here's a short example of some of the commands in R

You can enter them intowhat is called a console, and that's just one line at a time and that's called an interactiveway

Or you can save scripts and run bits and pieces selectively and that makes yourlife a lot easier

No matter how you do it, if you are familiar with programming otherlanguages then you're going to find that R's a little weird

It has an idiosyncratic model

It makes sense once you get used to it, but it is a different approach, and so it takessome adaptation if you are accustomed to programming in different languages

Now, once you do yourprogramming to get your output, what you're going to get is graphs in a separate window

You're going to get text and numbers, numerical output in the console, and no matter whatyou get, you can save the output to files

So that makes it portable, you can do it inother environments

But most importantly, I like to think of this: here's our box ofchocolates where you never know what you're going to get

The beauty of R is in the packagesthat are available to expand its capabilities

Now there are two sources of packages forR

One goes by the name of CRAN, and that stands for the Comprehensive R Archive Network,and that's at cran



And what that does is takes the 7,000 different packagesthat are available and organizes them into topics that they call task views

And foreach one if they have done their homework, they have datasets that come along with thepackage

You have a manual in

pdf format, and you can even have vignettes where theyrun through examples of how to do it

Another interface is called Crantastic! And the exclamationpoint is part of the title

And that is at crantastic


And what this is, is an alternativeinterface that links to CRAN

So if you find something you like in Crantastic! and youclick on the link, it's going to open in CRAN

But the nice thing about Crantastic! is itshows the popularity of packages, and it also shows how recently they were updated, andthat can be a nice way of knowing you're getting sort of the latest and greatest

Now fromthis very abstract presentation, we can say a few things about R: Number one, accordingto many, R is the language of data science and it's a command line interface

You'retyping lines of code, so that gives it both a strength and a challenge for some people

But the beautiful thing is that for the thousands and thousands of packages of additional codeand capability that are available for R, that make it possible to do nearly anything inthis statistical programming language

When, talking about "Coding and Data Science" andthe languages, along with R, we need to talk about Python

Now, Python the snakes is ageneral-purpose program that can do it all, and that's its beauty

If we go back to thesurvey of the software used by data mining experts, you see that Python's there and it'snumber three on the list

What's significant about that, is that on this list, Python isthe only general purpose programming language

It's the only one that can be theoreticallyused to develop any kind of application that you want

That gives it some special powerscompared to all the others, most of which are very specific to data science work

Thenice things about Python are: number one, it's general purpose

It's also really easyto use, and if you have a Macintosh or Linux computer, Python is built into it

Also, Pythonhas a fabulous community around it with hundreds of thousands of people involved, and alsopython has thousands of packages

Now, it actually has 70 or 80,000 packages, but interms of ones that are for data, there are still thousands available that give it someincredible capabilities

A couple of things to know about Python

First, is about versions

There are two versions of Python that are in wide circulation: there's 2

x; so thatmeans like 2

5 or 2

6, and 3

x; so 3

1, 3


Version 2 and version 3 are similar, but theyare not identical

In fact, the problem is this: there are some compatibility issueswhere code that runs in one does not run in the other

And consequently, most people haveto choose between one and the other

And what this leads to is that many people still use2


I have to admit, in the examples that I use, I'm using 2

x because so many of thedata science packages that are developed with that in mind

Now let me say a few thingsabout the interfaces for Python

First, Python does come with its own Interactive DevelopmentLearning Environment and they call it IDLE

You can also run it from the Terminal, orcommand line interface, or any IDE that you have

A very common and a very good choiceis Jupyter

Jupyter is a browser-based framework for programming and it was originally calledIPython

That served as its initial, so a lot of the time when people are talking aboutIPython, what they are really talking about is this Python in Jupyter and the two aresometimes used interchangeably

One of the neat things you can do, there are two companies:Continuum and Enthought

Both of which have made special distributions of Python withhundreds and hundreds of packages preconfigured to make it very easy to work with data

Ipersonally prefer Continuum Anaconda, it's the one that I use, a lot of other peopleuse it, but either one is going to work and it's going to get you up and running

Andlike I said with R, no matter what interface you use, all of them are command line

You'retyping lines of code

Again, there is tremendous strength to that but, it can be intimidatingto some people at first

In terms of the actual commands of Python, we have some exampleshere on the side, and the important thing to remember is that it's a text interface

On the other hand, Python is familiar to millions of people because it is very often a firstprogramming language people learn to do general purpose programming

And there are a lot ofvery simple adaptations for data that make it very powerful for data science work

So,let me say something else again: data science loves Jupyter, and Jupyter is the browser-basedframework

It's a local installation, but you access it through a web browser that makesit possible to really do some excellent work in data science

There's a few reasons forthis

When you're working in Jupyter you get text output and you can use what's calledMarkdown as a way of formatting documents

You can get inline graphics for the graphicsto show up directly beneath the code that you did it

It's also really easy to organize,present, and to share analyses that are done in Jupyter

Which makes it a strong contenderfor your choices in how you do data science programming

Another one of the beautifulthings about Python, like R, is there are thousands of packages available

In Python,there is one main repository; it goes by the name PyPI

Which is for the Python PackageIndex

Right here it says there are over 80,000 packages and 7 or 8,000 of those are for data-specificpurposes

Some of the packages that you will get to be very familiar with are NumPy andSciPy, which are for scientific computing in general; Matplotlib and a development ofit called Seaborn are for data visualization and graphics

Pandas is the main package forthe doing statistical analysis

And for machine learning, almost nothing beats scikit-learn

And when I go through hands-on examples in Python, I will be using all of these as away of demonstrating the power of the program for working with data

In sum we can say afew things: Python is a very popular program very familiar to millions of people and thatmakes it a good choice

Second, of all the languages we use for data science on a frequentbasis, this is the only one that's general purpose

Which means it can be used for alot of things other than processing data

And it gets its power, like R does, from havingthousands of contributed packages which greatly expand its capabilities especially in termsof doing data science work

A choice for "Coding in Data Science," one of the languages thatmay not come immediately to mind when they think data science, is Sequel or SQL

SQLis the language of databases and we think, "why do we want to work in SQL?" Well, toparaphrase the famous bank robber Willie Sudden who apparently explained why he robbed banksand said: "Because that's where the money is

" The reason we would with SQL in datascience is because that's where the data is

Let's take another look at our ranking ofsoftware among data mining professionals, and there's SQL

Third on the list, and alsoof this list, its also the first database tool

Other tools, for instance, get muchfancier, and much new and shinier, but SQL has been around for a while as very very capable

There's a few things to know about SQL

You will notice that I am saying Sequel even thoughit stands for Structured Query Language

SQL is a language, not an application

There'snot a program SQL, it's a language that can be used in different applications

Primarily,SQL is designed for what are called relational databases

And those are special ways of storingstructured data that you can pull in

You can put things together, you can join themin special ways, you can get summary statistics, and then what you usually do is then exportthat data into your analytical application of choice

The big word here is RDBMS - RelationalDatabase Management System; that is where you will usually see SQL as a query languagebeing used

In terms of Relational Database Management System, there are a few very commonchoices

In the industrial world where people have some money to spend, there's Oracle databaseis a very common one and Microsoft SQL Server

In the open source world, two very commonchoices are MySQL, even though we generally say Sequel, when it's here you generally sayMySQL

Another one is PostgreSQL

These are both open source, free versions of the language;sort of dialects of each, that make it possible for you to working with your databases andfor you to get your information out

The neat thing about them, no matter what you do, databasesminimize data redundancy by using connected tables

Each table has rows and columns andthey store different levels or different of abstraction or measurement, which means youonly have to put the information one place and then it can refer to lots of other tables

Makes it very easy to keep things organized and up to date

When you are looking intoa way of working with a Relational Database Management System, you get to choose in partbetween using a graphical user interface or GUI

Some of those include SQL Developer andSQL Server Management Studio, two very common choices

And there are a lot of other choicessuch as Toad and some other choices that are graphical interfaces for working with thesedatabases

There are also text-based interfaces

So really, any command line interface, andany interactive development environment or programming tool is going to be able to dothat

Now, you can think of yourself on the command deck of your ship and think of a fewbasic commands that are very important for working with SQL

There are just a handfulof commands that can get you where you need to go

There is the Select command, whereyou're choosing the cases that you want to include

From: says what tables are you goingto be extracting them from

Where: is a way of specifying conditions, and then Order By:obviously is just a way of putting it all together

This works because usually whenyou are in a SQL database you're just pulling out the information

You want to select it,you want to organize it, and then what you are going to do is you are going to send thedata to your program of choice for further analysis, like R or Python or whatever

Insum here's what we can say about SQL: Number one, as a language it's generally associatedwith relational databases, which are very efficient and well-structured ways of storingdata

Just a handful of basic commands can be very useful when working with databases

You don't have have to be a super ninja expert, really a handful

Five, 10 commands will probablyget you everything you need out of a SQL database

Then once the data is organized, the datais typically exported to some other program for analysis

When you talk about coding inany field, one of the languages or one of the groups of languages that come up mostoften are C, C++, and Java

These are extremely powerful applications and very frequentlyused for professional, production level coding

In data science, the place where you willsee these languages most often is in the bedrock

The absolute fundamental layer that makesthe rest of data science possible

For instance, C and C++

C is from the ‘60s, C++ is fromthe ‘80s, and they have extraordinary wide usage, and their major advantage is that they'rereally really fast

In fact, C is usually used as the benchmark for how fast is a language

They are also very, very stable, which makes them really well suited to production-levelcode and, for instance, server use

What's really neat is that in certain situations,if time is really important, if speeds important, then you can actually use C code in R or otherstatistical languages

Next is Java

Java is based on C++, it's major contribution wasthe WORA or the Write Once Run Anywhere

The idea that you were going to be able to developcode that is portable to different machines and different environments

Because of that,Java is the most popular computer programming language overall against all tech situations

The place you would use these in data science, like I said, when time is of the essence,when something has to be fast, it has to get the job accomplished quickly, and it has tonot break

Then these are the ones you're probably going to use

The people who aregoing to use it are primarily going to be engineers

The engineers and the softwaredevelopers who deal with the inner workings of the algorithms in data science or the backend of data science

The servers and the mainframes and the entire structure that makes analysispossible

In terms of analysts, people who are actually analyzing the data, typicallydon't do hands-on work with the foundational elements

They don't usually touch C or C++,more of the work is on the front end or closer to the high-level languages like R or Python

In sum: C, C++ and Java form a foundational bedrock in the back end of data and data science

They do this because they are very fast and they are very reliable

On the other hand,given their nature that work is typically reserved for the engineers who are workingwith the equipment that runs in the back that makes the rest of the analysis possible

Iwant to finish our extremely brief discussion of "Coding in Data Sciences" and the languagesthat can be used, by mentioning one other that's called Bash

Bash really is a greatexample of old tools that have survived and are still being used actively and productivelywith new data

You can think of it this way, it's almost like typing on your typewriter

You're working at the command line, you're typing out code through a command line interfaceor a CLI

This method of interacting with computers practically goes back to the typewriterphase, because it predates monitors

So, before you even had a monitor, you would type outthe code and it would print it out on a piece of paper

The important thing to know aboutthe command line is it's simply a method of interacting

It's not a language, becauselots of languages can run at the command line

For instance, it is important to talk aboutthe concept of a shell

In computer science, a shell is a language or something that wrapsaround the computer

It's a shell around the language, that is the interaction level forthe user to get things done at the lower level that aren't really human-friendly

On Maccomputers and Linux, the most common is Bash, which is short for Bourne Again Shell

OnWindows computers, the most common is PowerShell

But whatever you do there actually are a lotof choices, there's the Bourne Shell, the C shell; which is why I have a seashell righthere, the Z shell, there's fish for Friendly Interactive Shell, and a whole bunch of otherchoices

Bash is the most common on Mac and Linux and PowerShell is the most common onWindows as a method of interacting with the computer at the command line level

There'sa few things you need to know about this

You have a prompt of some kind, in Bash, it'sa dollar sign, and that just means type your command here

Then, the other thing is youtype one line at a time

It's actually amazing how much you can get done with a one-linerprogram, by sort of piping things together, so one feeds into the other

You can run morecomplex commands if you use a script

So, you call a text document that has a bunchof things in it and you can get much more elaborate analyses done

Now, we have ourtools here

In Bash we talk about utilities and what these are, are specific programsthat accomplish specific tools

Bash really thrives on "Do one thing, and do it very well

"There are two general categories of utilities for Bash

Number one, is the Built-ins

Theseare the ones that come installed with it, and so you're able to use it anytime by simplycalling in their name

Some more common ones are: cat, which is for catenate; that's toput information together

There's awk, which is it's own interpreted language, but it'soften used for text processing from the command line

By the way, the name 'Awk' comes fromthe initials of the people who created it

Then there's grep, which is for Global searchwith a Regular Expression and Print

It's a way of searching for information

And thenthere's sed, which stands for Stream Editor and its main use is to transform text

Youcan do an enormous amount with just these 4 utilities

A few more are head & tail, displaythe first or last 10 lines of a document

Sort & uniq, which sort and count the numberof unique answers in a document

Wc, which is for word count, and printf which formatsthe output that you get in your console

And while you can get a huge amount of work donewith just this small number of built-in utilities, there are also a wide range of installable

Or, other command line utilities that you can add to Bash, or whatever programming languageyou're using

So, since some really good ones that have been recently developed are jq:which is for pulling in JSON or JavaScript, object notation data from the web

And thenthere's json2csv, which is a way of converting JSON to csv format, which is what a lot ofstatistical programs are going to be happy with

There's Rio which allows you to runa wide range of commands from the statistical programming language R in the command lineas part of Bash

And then there's BigMLer

This is a command line tool that allows youto access BigML's machine learning servers through the command line

Normally, you doit through a web browser and it accesses their servers remote

It's an amazingly useful programbut to be able to just pull it up when you're in the command line is an enormous benefit

What's interesting is that even though you have all these opportunities, all these differentutilities, you can do all amazing things

And there's still an active element of utilitiesfor the command line

So, in sum: despite being in one sense as old as the dinosaurs,the command line survives because it is extremely well evolved and well suited to its purposeof working with data

The utilities; both the built-in and the installable are fastand they are easy

In general, they do one thing and they do it very, very well

Andthen surprisingly, there is an enormous amount of very active development of command lineutilities for these purposes, especially with data science

One critical task when you areCoding in Data Science is to be able to find the things that you are looking for, and Regex(which is short of Regular Expressions) is a wonderful way to do that

You can thinkof it as the supercharged method for finding needles in haystacks

Now, Regex tends tolook a little cryptic so, for instance, here's an example

As something that's designed todetermine if something is a valid email address, and it specifies what can go in the beginning,you have the at sign in the middle, then you've got a certain number of letters and numbers,then you have to have a dot something at the end

And so, this is a special kind of codefor indicating what can go where

Now regular expressions, or regex, are really a form ofpattern matching in text

And it's a way of specifying what needs to be where, what canvary, and how much it can vary

And you can write both specific patterns; say I only wanta one letter variation here, or a very general like the email validator that I showed you

And the idea here is that you can write this search pattern, your little wild card thing,you can find the data and then once you identify those cases, then you export them into anotherprogram for analysis

So here's a short example of how it can work

What I've done is takensome text documents, they're actually the texts to Emma and to Pygmalion, two booksI got off of Project Gutenberg, and this is the command

Grep ^l

ve *

txt - so what I'mlooking for in either of these books are lines that start with ‘l', then they can haveone character; can be whatever, then that's followed by ‘ve', and then the

txt meanssearch for all the text files in the particular folder

And what it found were lines thatbegan with love, and lived, and lovely, and so on

Now in terms of the actual nuts andbolts of regular expressions, there are some certain elements

There are literals, andthose are things that are exactly what they mean

You type the letter ‘l', you're lookingfor the letter ‘l'

There are also metacharacters, which specify, for instance, things need togo here; they're characters but are really code that give representations

Now, thereare also escape sequences, which is normally this character is used as a variable, butI want to really look for a period as opposed to a placeholder

Then you have the entiresearch expression that you create and you have the target string, the thing that itis searching through

So let me give you a few very short examples

^ this is the caret

This is the sometimes called a hat or in French, a circonflexe

What that means, you're lookingfor something at the beginning of the search you are searching

For example, you can have^ and capital M, that means you need something that begins with capital M

For instance theword "Mac," true, it will find that

But if you have iMac, it's a capital M, but it'snot the first letter and so that would be false, it won't find that

The $ means youare looking for something at the end of the string

So for example: ing$ that will findthe word ‘fling' because it ends in ‘ing', but it won't find the word ‘flings' becauseit actually ends with an ‘s'

And then the dot, the period, simply means that we arelooking for one letter and it can be anything

So, for example, you can write ‘at


Andthat will find ‘data' because it has an ‘a', a ‘t', and then one letter afterit

But it won't find ‘flat', because ‘flat' doesn't have anything after the ‘at'

Andso these are extremely simple examples of how it can work

Obviously, it gets more complicatedand the real power comes when you start combining these bits and elements

Now, one interestingthing about this is you can actually treat this as a game

I love this website, it'scalled Regex golf and it's at regex



And what it does is brings up lists of words;two columns, and your job is to write a regular expression in the top, that matches all thewords on the left column and none of the words on the right

And uses the fewest characterspossible, and you get a score! And it's a great way of learning how to do regular expressionsand learning how to search in a way that is going to get you the data you need for yourprojects

So, in sum: Regex, or regular expressions, help you find the right data for your project,they're very powerful and they're very flexible

Now, on the other hand, they are cryptic,at least when you first look at them but at the same time, it's like a puzzle and it canbe a lot of fun if you practice it and you see how you can find what you need

I wantto thank you for joining me in "Coding in Data Science" and we'll wrap up this courseby talking about some of the specific next steps you can take for working in data science

The idea here, is that you want to get some tools and you want to start working with thosetools

Now, please keep in mind something that I've said at another time

Data toolsand data science are related, they're important but don't make the mistake of thinking thatif you know the tools that you have done the same thing as actually conducted data science

That's not true, people sometimes get a little enthusiastic and they get a little carriedaway

What you need to remember is the relationship really is this: Data Tools are an importantpart of data science, but data science itself is much bigger than just the tools

Now, speakingof tools remember there's a few kinds that you can use, and that you might want to getsome experience with these

#1, in terms of just Apps, specific built applications Excel& Tableau are really fundamental for both getting the data from clients or doing somebasic data browsing and Tableau is really wonderful for interactive data visualization

I strongly recommend you get very comfortable with both of those

In terms of code, it'sa good idea to learn either ‘R' or ‘Python' or ideally to learn both

Ideally becauseyou can use them hand in hand

In terms of utilities, it's a great idea to work withBash, the command line utility and to use regular expression or regex

You can actuallyuse those in lots and lots of programs; regular expressions

So they can have a very wideapplication

And then finally, data science requires some sort of domain expertise

You'regoing to need some sort of field experience or intimate understanding of a particulardomain and the challenges that come up and what constitutes workable answers and thekind of data that's available

Now, as you go through all of this, you don't need tobuild this monstrous list of things

Remember, you don't need everything

You don't needevery tool, you don't need every function, you don't need every approach

Instead remember,get what's best for your needs, and for your style

But no matter what you do, rememberthat tools are tools, they are a means to an end

Instead, you want to focus on thegoal of your data science project whatever it is

And I can tell you really, the goalis in the meaning, extracting meaning out of your data to make informed choices

Infact, I'll say a little more

The goal is always meaning

And so with that, I stronglyencourage you to get some tools, get started in data science and start finding meaningin the data that's around you

Welcome to "Mathematics in Data Science"

I'm BartonPoulson and we're going to talk about how Mathematics matters for data science

Now,you maybe saying to yourself, "Why math?", and "Computers can do it, I don't need todo it"

And really fundamentally, "I don't need math I am just here to do my work"

Well,I am here to tell you, No

You need math

That is if you want to be a data scientist,and I assume that you do

So we are going to talk about some of the basic elements ofMathematics, really at a conceptual level and how they apply to data science

Thereare few ways that math really matters to data science

#1, it allows you to know which proceduresto use and why

So you can answer your questions in a way that is the most informative andthe most useful

#2, if you have a good understanding of math, then you know what to do when thingsdon't work right

That you get impossible values or things won't compute, and that makesa huge difference

And then #3, an interesting thing is that some mathematical proceduresare easier and quicker to do by hand then by actually firing up the computer

And sofor all 3 of these reasons, it's really helpful to have at least a grounding in Mathematicsif you're going to do work in data science

Now probably the most important thing to startwith in Algebra

And there are 3 kinds of algebra I want to mention

The first is elementaryalgebra, that's the regular x+y

Then there is Linear or matrix algebra which looks morecomplex, but is conceptually it is used by computers to actually do the calculations

And then finally I am going to mention Systems of Linear Equations where you have multipleequations simultaneously that you're trying to solve

Now there's more math than justalgebra

A few other things I'm going to cover in this course

Calculus, a little bit ofBig O or order which has to do with the speed and complexity of operations

A little bitof probability theory and a little bit of Bayes or Bayes theorem which is used for gettingposterior probabilities and changes the way you interpret the results of an analysis

And for the purposes of this course, I'm going to demonstrate the procedures by hand, ofcourse you would use software to do this in the real world, but we are dealing with simpleproblems at conceptual levels

And really, the most important thing to remember is thateven though a lot of people get put off by math, really You can do it! And so, in sum:let's say these three things about math

First off, you do need some math to do good datascience

It helps you diagnose problems, it helps you choose the right procedures, andinterestingly you can do a lot of it by hand, or you can use software computers to do thecalculations as well

As we begin our discussion of the role of "Mathematics and Data Science",we'll of course begin with the foundational elements

And in data science nothing is morefoundational than Elementary Algebra

Now, I'd like to begin this with really just abit of history

In case you're not aware, the first book on algebra was written in 820by Muhammad ibn Musa al-Khwarizmi

And it was called "The Compendious Book on Calculationby Completion and Balancing"

Actually, it was called this, which if you transliteratethat comes out to this, but look at this word right here

That's the algebra, which meansRestoration

In any case, that's where it comes from and for our concerns, there areseveral kinds of algebra that we're going to talk about

There's Elementary Algebra,there's Linear Algebra and there are systems of linear equations

We'll talk about eachof those in different videos

But to put it into context, let's take an example here ofsalaries

Now, this is based on real data from a survey of the salary of people employedin data science and to give a simple version of it

The salary was equal to a constant,that's sort of an average value that everybody started with and to that you added years,then some measure of bargaining skills and how many hours they worked per week

And thatgave you your prediction, but that wasn't exact there's also some error to throw intoit to get to the precise value that each person has

Now, if you want to abbreviate this,you can write it kind of like this: S + C + Y + B + H + E, although it's more commonto write it symbolically like this, and let's go through this equation very quickly

Thefirst thing we have is outcome,; we call that y the variable y for person i, "i" standsfor each case in our observations

So, here's outcome y for person i

This letter here,is a Greek Beta and it represents the intercept or the average, that's why it has a zero,because we don't multiply it times anything

But right next to it we have a coefficientfor variable 1

So Beta, which means a coefficient, sub 1 for the first variable and then we havevariable 1 then x 1, means variable 1, then i means its the score on that variable forperson i, whoever we are talking about

Then we do the same thing for variables 2 and 3,and at the end, we have a little epsilon here with an i for the error term for person i,which says how far off from the prediction was their actual score

Now, I'm going torun through some of these procedures and we'll see how they can be applied to data science

But for right now let's just say this in sum

First off, Algebra is vital to data science

It allows you to combine multiple scores, get a single outcome, do a lot of other manipulations

And really, the calculations, their easy for one case at at time

Especially when you'redoing it by hand

The next step for "Mathematics for Data Science" foundations is to look atLinear algebra or an extension of elementary algebra

And depending on your background,you may know this by another name and I like to think welcome to the Matrix

Because it'salso known as matrix algebra because we are dealing with matrices

Now, let's go backto an example I gave in the last video about salary

Where salary is equal to a constantplus years, plus bargaining, plus hours plus error, okay that's a way to write it out inwords and if you want to put it in symbolic form, it's going to look like this

Now beforewe get started with matrix algebra, we need to talk about a few new words, maybe you'refamiliar with them already

The first is Scalar, and this means a single number

And then avector is a single row or a single column of numbers that can be treated as a collection

That usually means a variable

And then finally, a matrix consists of many rows and columns

Sort of a big rectangle of numbers, the plural of that by the way is matrices and the thingto remember is that Machines love Matrices

Now let's take a look at a very simple exampleof this

Here is a very basic representation of matrix algebra or Linear Algebra

Wherewe are showing data on two people, on four variables

So over here on the left, we havethe outcomes for cases 1 and 2, our people 1 and 2

And we put it into the square bracketsto indicate that it's a vector or a matrix

Here on the far left, it's a vector becauseit's a single column of values

Next to that is a matrix, that has here on the top, thescores for case 1, which I've written as x's

X1 is for variable 1, X2 is for variable 2and the second subscript is indicated that it's for person 1

Below that, are the scoresfor case 2, the second person

And then over here, in another vertical column are the regressioncoefficients, that's a beta there that we are using

And then finally, we've got a tinylittle vector here which contains the error terms for cases 1 and 2

Now, even thoughyou would not do this by hand, it's helpful to run through the procedure, so I'm goingto show it to you by hand

And we are going to take two fictional people

This will befictional person #1, we'll call her Sophie

We'll say that she's 28 years old and we'llsay that she's has good bargaining skills, a 4 on a scale of 5, and that she works 50hours a week and that her salary is $118,000


Our second fictional person, we'll call himLars and we'll say that he's 34 years old and he has moderate bargaining skills 3 outof 5, works 35 hours per week and has a salary of $84,000


And so if we are trying tolook at salaries, we can look at our matrix representation that we had here, with ourvariables indicated with their Latin and sometimes Greek symbols

And we will replace those variableswith actual numbers

We have the salary for Sophie, our first person

So why don't weplug in the numbers here and let's start with the result here

Sophie's salary is $118,000

00and here's how all these numbers all add up to get that

The first thing here is the intercept

And we just multiply that times 1, so that's sort of the starting point, and then we getthis number 10, which actually has to do with years over 18

She's 28 so that's 10 yearsover 18, we multiply each year by 1395

Next is bargaining skills

She's got a 4 out of5 and for each step up you get $5,900


By the way, these are real coefficients fromstudy of survey of salary of data scientists

And then finally hours per week

For eachhour, you get $382


Now you can add these up, and get a predicted value for her butit's a little low

It's $30,00

00 low

Which you may be saying that's pretty messed up,well that's because there's like 40 variables in the equation including she might be theowner and if she's the owner then yes she's going to make a lot more

And then we do asimilar thing for the second case, but what's neat about matrix algebra or Linear Algebrais this means the same stuff and what we have here are these bolded variables

That standin for entire vectors or matrices

So for instance; this Y, a bold Y stands for thevector of outcome scores

This bolded X is the entire matrix of values that each personhas on each variable

This bolded beta is all of the regression coefficients and thenthis bolded epsilon is the entire vector of error terms

And so it's a really super compactway of representing the entire collection of data and coefficients that you use in predictingvalues

So in sum, let's say this

First off, computers use matrices

They like to do linearalgebra to solve problems and is conceptually simpler because you can put it all in therein this type formation

In fact, it's a very compact notation and it allows you to manipulateentire collections of numbers pretty easily

And that's that major benefit of learninga little bit about linear or matrix algebra

Our next step in "Mathematics for Data ScienceFoundations" is systems of linear equations

And maybe you are familiar with this, butmaybe you're not

And the idea here is that there are times, when you actually have manyunknowns and you're trying to solve for them all simultaneously

And what makes this reallytricky is that a lot of these are interlocked

Specifically that means X depends on Y, butat the same time Y depends on X

What's funny about this, is it's actually pretty easy tosolve these by hand and you can also use linear matrix algebra to do it

So let's take a littleexample here of Sales

Let's imagine that you have a company and that you've sold 1,000iPhone cases, so that they are not running around naked like they are in this picturehere

Some of them sold for $20 and others sold for $5

You made a total of $5,900

00and so the question is "How many were sold at each price?" Now, if you were keeping ourrecords, but you can also calculate it from this little bit of information

And to showyou I'm going to do it by hand

Now, we're going to start with this

We know that salesthe two price points x + y add up to 1,000 total cases sold

And for revenue, we knowthat if you multiply a certain number times $20 and another number times $5, that it alladds up to $5,900


Between the two of those we can figure out the rest

Let's start withsales

Now, what I'm going to do is try to isolate the values

I am going to do thatby putting in this minus y on both sides and then I can take that and I can subtract it,so I'm left with x is equal to 1,000 - y

Normally I solve for x, but I solve for y,you'll see why in just a second

Then we go to revenue

We know from earlier that oursales at these two prices points, add up to $5,900

00 total

Now what we are going todo is take the x that's right here and we are going to replace it with the equationwe just got, which is 1,000 - y

Then we multiply that through and we get $20,000

00 minus $20yplus $5 y equals $5,900


Well, we can subtract these two because they are on the same thing

So, $20y then we get $15y, and then we subtract $20,000

00 from both sides

So there it is,right there on the left, and that disappears, then I get it over on the right side

Andthen I do the math there, and I get minus $14, 100


Well, then I divide both sidesby negative $15

00 and when we do that we get y equals 940

Okay, so that's one of ourvalues for sales

Let's go back to sales

We have x plus y equals 1,000

We take thevalue we just got, 940, we stick that into the equation, then we can solve for x

Justsubtract 940 from each side, there we go

We get x is equal to 60

So, let's put itall together, just to recap what happened

What this tells us is that 60 cases were soldat $20

00 each

And that 940 cases were sold at $5 each

Now, what's interesting aboutthis is you can also do this graphically

We're going to draw it

So, I'm going to graphthe two equations

Here are the original ones we had

This one predicts sales, this onegives price

The problem is, these aren't in the economical form for creating graphs

That needs to be y equals something else, so we're going to solve both of these fory

We subtract x from both sides, there it is on the left, we subtract that

Then wehave y is equals to minus x plus 1,000

That's something we can graph

Then we do the samething for price

Let's divide by 5 all the way through, that gets rid of that and thenwe've got this 4x, then let's subtract 4x from each side

And what we are left withis minus 4x plus 1,180, which is also something we can graph

So this first line, this indicatescases sold

It originally said x plus y equals 1000, but we rearranged it to y is equal tominus x plus 1000

And so that's the line we have here

And then we have another line,which indicates earnings

And this one was originally written as $20

00 times x plus$5

00 times y equals $5,900

00 total

We rearranged that to y equals minus 4x plus 1,180

That'sthe equation for the line and then the solution is right here at the intersection

There'sour intersection and it's at 60 on the number of cases sold at $20

00 and 940 as the numberof cases sold at $5

00 and that also represents the solution of the joint equations

It'sa graphical way of solving a system of linear equations

So in sum, systems of linear equationsallow us to balance several unknowns and find unique solutions

And in many cases, it'seasy to solve by hand, and it's really easy with linear algebra when you use softwareto do it at the same time

As we continue our discussion of "Mathematics for Data Science"and the foundational principles the next thing we want to talk about is Calculus

And I'mgoing to give a little more history right here

The reason I'm showing you picturesof stones, is because the word Calculus is Latin for stone, as in a stone used for tallying

Where when people would actually have a bag of stones and they would use it to count sheepor whatever

And the system of Calculus was formalized in the 1,600s simultaneously, independentlyby Isaac Newton and Gottfried Wilhelm Leibniz

And there are 3 reasons why Calculus is importantfor data science

#1, it's the basis for most of the procedures we do

Things like leastsquares regression and probability distributions, they use Calculus in getting those answers

Second one is if you are studying anything that changes over time

If you are measuringquantities or rates that change over time then you have to use Calculus

Calculus isused in finding the maxima and minima of functions especially when you're optimizing

Which issomething I'm going to show you separately

Also, it is important to keep in mind, thereare two kinds of Calculus

The first is differential Calculus, which talks about rates of changeat a specific time

It's also known as the Calculus of change

The second kind of Calculusis Integral Calculus and this is where you are trying to calculate the quantity of somethingat a specific time, given the rate of change

It's also known as the Calculus of Accumulation

So, let's take a look at how this works and we're going to focus on differential Calculus

So I'm going to graph an equation here, I'm going to do y equals x2 a very simple onebut it's a curve which makes it harder to calculate things like the slope

Let's takea point here that's at minus 2, that's the middle of the red dot

X is equal to minus2

And because y is equal to x2 , if we want to get the y value, all we got to do is takethat negative 2 and square it and that gives us 4

So that's pretty easy

So the coordinatesfor that red point are minus 2 on x, and plus 4 on the y

Here's a harder question

"Whatis the slope of the curve at that exact point?" Well, it's actually a little tricky becausethe curve is always curving there's no flat part on it

But we can get the answer by gettingthe derivative of the function

Now, there are several different ways of writing this,I am using the one that's easiest to type

And let's start by this, what we are goingto do is the n here and that is the squared part, so that we have x2

And you see thatsame n turns into the squared, and then we come over here and we put that same value2 in right there, and we put the two in right here

And then we can do a little bit of subtraction

2 minus 1 is 1 and truthfully you can just ignore that then then you get 2x

That isthe derivative, so what we have here is the derivative of x2 is 2x

That means, the slopeat any given point in the curve is 2x

So, let's go back to the curve we had a momentago

Here's our curve, here's our point at x minus 2, and so the slope is equal to 2x,well we put in the minus 2, and we multiply it and we get minus 4

So that is the slopeat this exact point in the curve

Okay, what if we choose a different point? Let's saywe came over here to x is equal to 3? Well, the slope is equal to 2x so that's 2 times3, is equal to 6

Great! And on the other hand, you might be saying to yourself "Andwhy do I care about this?" There's a reason that this is important and what it is, isthat you can use these procedures to optimize the decisions

And if that seems a littleto abstract to you, that means you can use them to make more money

And I'm going todemonstrate that in the next video

But for right now in sum, let's say this

Calculusis vital to practical data science, it's the foundation of statistics and it forms thecore that's needed for doing optimization

In our discussion about Mathematics and datascience foundations, the last thing I want to talk about right here is calculus and howit relates to optimization

I like to think of this, in other words, as the place wheremath meets reality, or it meets Manhattan or something

Now if you remember this graphI made in the last video, y is equal to x2, that shows this curve here and we have thederivative that the slope can be given by 2x

And so when x is equal to 3, the slopeis equal to 6, fine

And this is where this comes into play

Calculus makes it possibleto find values that maximize or minimize outcomes

And if you want to think of something a littlemore concrete here, let's think of an example, by the way that's Cupid and Psyche

Let'stalk about pricing for online dating

Let's assume you've created a dating service andyou want to figure out how much can you charge for it that will maximize your revenue

So,let's get a few hypothetical parameters involved

First off, let's say that subscriptions, annualsubscriptions cost $500

00 each year and you can charge that for a dating service

Andlet's say you sell 180 new subscriptions every week

On the other hand, based on your previousexperience manipulating prices around, you have some data that suggests that for each$5 you discount from the price of $500

00 you will get 3 more sales

Also, because itsan online service, lets make our life a little more easier right now and assume there isno increase in overhead

It's not really how it works, but we'll do it for now

And I'mactually going to show you how to do all this by hand

Now, let's go back to price first

We have this


00 is the current annual subscription price and you're going to subtract$5

00 for each unit of discount, that's why I'm giving D

So, one discount is $5

00, twodiscounts is $10

00 and so on

And then we have a little bit of data about sales, thatyou're currently selling 180 subscriptions per week and that you will add 3 more foreach unit of discount that you give

So, what we're going to do here is we are going tofind sales as a function of price

Now, to do that the first thing we have to do is getthe y intercept

So we have price here, is $500

00, is the current annual subscriptionprice minus $5 times d

And what we are going to do is, is we are going to get the y interceptby solving when does this equal zero? Okay, well we take the $500 we subtract that fromboth sides and then we end up with minus $5d is equal to minus $500


Divide both sidesby minus $5 and we are left with d is equal to 100

That is, when d is equal to 100, xis 0

And that tells us how we can get the y intercept, but to get that we have to substitutethis value into sales

So we take d is equal to 100, and the intercept is equal to 180plus 3; 180 is the number of new subscriptions per week and then we take the three and wemultiply that times our 100

So, 180 times 3 times 100,[1] is equal to 300 add thosetogether and you get 480

And that is the y intercept in our equation, so when we'vediscounted sort of price to zero then the expected sale is 480

Of course that's notgoing to happen in reality, but it's necessary for finding the slope of the line

So nowlet's get the slope

The slope is equal to the change in y on the y axis divided by thechange in x

One way we can get this is by looking at sales; we get our 180 new subscriptionsper week plus 3 for each unit of discount and we take our information on price


00a year minus $5

00 for each unit of discount and then we take the 3d and the $5d and thosewill give us the slope

So it's plus 3 divided by minus 5, and that's just minus 0


Sothat is the slope of the line

Slope is equal to minus 0


And so what we have from thisis sales as a function of price where sales is equal to 480 because that is the y interceptwhen price is equal to zero minus 0

6 times price

So, this isn't the final thing

Nowwhat we have to do, we turn this into revenue, there's another stage to this

Revenue isequal to sales times price, how many things did you sell and how much did it cost

Well,we can substitute some information in here

If we take sales and we put it in as a functionof price, because we just calculated that a moment ago, then we do a little bit of multiplicationand then we get that revenue is equal to 480 times the price minus 0

6 times the price

Okay, that's a lot of stuff going on there

What we're going to do now is we're goingto get the derivative, that's the calculus that we talked about

Well, the derivativeof 480 and the price, where price is sort of the x, the derivative is simply 480 andthe minus 0

6 times price? Well, that's similar to what we did with the curve

And what weend up with is 0

6 times 2 is equal to 1

2 times the price

This is the derivative ofthe original equation

We can solve that for zero now, and just in case you are wondering

Why do we solve it for zero? Because that is going to give us the place when y is ata maximum

Now we had a minus squared so we have to invert the shape

When we are tryingto look for this value right here when it's at the very tippy top of the curve, becausethat will indicate maximum revenue

Okay, so what we're going to do is solve for zero

Let's go back to our equation here

We want to find out when is that equal to zero? Well,we subtract 480 from each side, there we go and we divide by minus 1

2 on each side

Andthis is our price for maximum revenue

So we've been charging $500

00 a week, but thissays we'll have more total income if we charge $400

00 instead

And if you want to find outhow many sales we can get, currently we have 480 and if you want to know what the salesvolume is going to be for that

Well, you take the 480 which is the hypothetical y interceptwhen the price is zero, but then we put in our actual price of $400

00, multiply that,we get 240, do the subtraction and we get 240 total

So, that would be 240 new subscriptionsper week

So let's compare this

Current revenue, is 180 new subscriptions per week at $500

00per year

And that means our current revenue is $90,000

00 per year, I know it sounds reallygood, but we can do better than that

Because the formula for maximum value is 240 times$400

00, when you multiply those you get $96,000


And so the improvement is just a ratio ofthose two


00 divided by $90,000

00 is equal to 1


And what that means is a7% increase and anybody would be thrilled to get a 7% increase in their business simplyby changing the price and increasing the overall revenue

So, let's summarize what we foundhere

If you lower the cost by 20%, go from $500

00 year to $400

00 per year, assumingall of our other information is correct, then you can increase sales by 33%; that's morethan the 20 that you had and that increases total revenue by 7%

And so we can optimizethe price to get the maximum total revenue and it has to do with that little bit of calculusand the derivative of the function

So in sum, calculus can be used to find the minimaand maxima of functions including prices

It allows for optimization and that in turnallows you to make better business decisions

Our next topic in "Mathematics and Data Principals",is something called Big O

And if you are wondering what Big O is all about, it is abouttime

Or, you can think of it as how long does it take to do a particular operation

It's the speed of the operation

If you want to be really precise, the growth rate of afunction; how much more it requires as you add elements is called its Order

That's whyit's called Big O, that's for Order

And Big O gives the rate of how things grow as thenumber of elements grows, and what's funny is there can be really surprising differences

Let me show you how it works with a few different kinds of growth rates or Big O

First off,there's the ones that I say are sort of one the spot, you can get stuff done right away

The simplest one is O1, and that is a constant order

That's something that takes the sameamount of time, no matter what

You can send an email out to 10,000 people just hit onebutton; it's done

The number of elements, the number of people, the number of operations,it just takes the same amount of time

Up from that is Logarithmic, where you take thenumber of operations, you get the logarithm of that and you can see it's increased, butreally it's only a small increase, it tapers off really quickly

So an example is findingan item in a sorted rate

Not a big deal

Next, one up from that, now this looks likea big change, but in the grand scheme, it's not a big change

This is a linear function,where each operation takes the same unit of time

So if you have 50 operations, you have50 units of time

If you're storing 50 objects it takes 50 units of space

So, find an itemin an unsorted list it's usually going to be linear time

Then we have the functionswhere I say you know, you'd better just pack a lunch because it's going to take a while

The best example of this is called Log Linear

You take the number of items and you multiplythat number times the log of the items

An example of this is called a fast Fourier transform,which is used for dealing for instance with sound or anything that sort of is over time

You can see it takes a lot longer, if you have 30 elements your way up there at thetop of this particular chart at 100 units of time, or 100 units of space or whateveryou want to put it

And it looks like a lot

But really, that's nothing compared to thenext set where I say, you know you're just going to be camping out you may as well gohome

That includes something like the Quadratic

You square the number of elements, you seehow that kind of just shoots straight up

That's Quadratic growth

And so multiplyingtwo n-digit numbers, if you're multiplying two numbers that have 10 digit numbers it'sgoing to take you that long, it's going to take a long time

Even more extreme is thisone, this is the exponential, two raised to the power to the number of items you have

You'll see, by the way, the red line does not even go all the way to the top

That'sbecause the graphing software that I'm using, doesn't draw it when it goes above my upperlimit there, so it kind of cuts it off

But this is a really demanding kind of thing,it's for instance finding an exact solution for what's called the Travelling SalesmanProblem, using dynamic programming

That's an example of exponential rate of growth

And then one more I want to mention which is sort of catastrophic is Factorial

Youtake the number of elements and you raise that to the exclamation point Factorial, andyou see that one cuts off very soon because it basically goes straight up

You have anynumber of elements of any size, it's going to be hugely demanding

And for instance ifyou're familiar with the Travelling Salesman Problem, that's trying to find the solutionthrough the brute force search, it takes a huge amount of time

And you know before somethinglike that is done, you're probably going to turn to stone and wish you'd never even started

The other thing to know about this, is that not only do something's take longer than others,some of these methods and some functions are more variable than others

So for instance,if you're working with data that you want to sort, there are different kinds of sortor sorting methods

So for instance, there is something called an insertion sort

Andwhen you find this on its best day, it's linear

It's O of n, that's not bad

On the otherhand the average is Quadratic and that's a huge difference between the two

Selectionsorts on the other hand, the best is quadratic and the average is quadratic

It's alwaysconsistent, so it's kind of funny, it takes a long time, but at least you know how longit's going to take versus the variability of something like an insertion sort

So insum, let me say a few things about Big O

#1, You need to know that certain functionsor procedures vary in speed, and the same thing applies to making demands on a computer'smemory or storage space or whatever

They vary in their demands

Also, some are inconsistent

Some are really efficient sometimes and really slow or difficult the others

Probably themost important thing here is to be aware of the demands of what you are doing

That youcan't, for instance, run through every single possible solution or you know, your companywill be dead before you get an answer

So be mindful of that so you can use your timewell and get the insight you need, in the time that you need it

A really importantelement of the "Mathematics and Data Science" and one of its foundational principles isProbability

Now, one of the things that Probability comes in intuitively for a lot of people issomething like rolling dice or looking at sports outcomes

And really the fundamentalquestion of what are the odds of something

That gets at the heart of Probability

Nowlet's take a look at some of the basic principles

We've got our friend, Albert Einstein hereto explain things

The Principles of Probability work this way

Probabilities range from zeroto 1, that's like zero percent to one hundred percent chance

When you put P, then in parenthesishere A, that means the Probability of whatever is in parenthesis

So P(A), means the Probabilityof A

and then P(B) is the Probability of B

When you take all of the probabilitiestogether, you get what is called the probability Space

And that's why we have S and that alladds up to 1, because you've now covered 100 % of the possibilities

Also you can talkabout the compliment

The tilde here is used to say the probability of not A is equal to1 minus the probability of A, because those have to add up

So, let's take a look at somethingalso that conditional probabilities, which is really important in statistics

A conditionalprobability is the probability that something if something else is true

You write it thisway: the probability of, and that vertical line is called a Pipe and it's read as assumingthat or given that

So you can read this as the probability of A given B, is the probabilityof A occurring if B is true

So you can say for instance, what's the probability if something'sorange, what's the probability that it's a caret given this picture

Now, the place thatthis comes in really important for a lot of people is the probability of type one andtype two errors in hypothesis testing, which we'll mention at some other point

But I dowant to say something about arithmetic with probabilities because it does not always workout the way people think it will

Let's start by talking about adding probabilities

Let'ssay you have two events A and B, and let's say you want to find the probabilities ofeither one of those events

So that's like adding the probabilities of the two events

Well, it's kind of easy

You take the probability of event A and you add the probability ofevent B, however you may have to subtract something, you may have to subtract this littlepiece because maybe there are some overlap between the two of them

On the other handif A and B are disjoined, meaning they never occur together, then that's equal to zero

And then you can subtract zero which is just, you get back to the original probabilities

Let's take a really easy example of this

I've created my super simple sample spaceI have 10 shapes

I have 5 squares on top, 5 circles on the bottom and I've got a coupleof red shapes on the right side

Let's say we want to find the probability of a squareor a red shape

So we are adding the probabilities but we have to adjust for the overlap betweenthe two

Well here's our squares on top

5 out of the 10 are squares and over here onthe right we have two red shapes, two out of 10

Let's go back to our formula here andlet's change a little bit

Change the A and the B to S and R for square and red

Now wecan start this way, let's get the probability that something is a square

Well, we go backto our probability space and you see we have 5 squares out of 10 shapes total

So we do5 over 10, that reduces to


Okay, next up the probability of something red in oursample space

Well, we have 10 shapes total, two of them on the far right are red

That'stwo over 10, and you do the division get


Now, the trick is the overlap between thesetwo categories, do we have anything that is both square and red, because we don't wantto count that twice we have to subtract it

Let's go back to our sample space and we arelooking for something that is square, there's the squares on top and there's the thingsthat are red on the side

And you see they overlap and this is our little overlappingsquare

So there's one shape that meets both of those, one out of 10

So we come back here,one out of 10, that reduces to

1 and then we just do the addition and subtraction here

5 plus

2 minus

1, gets us


And so what that means is, there is a 60% chance of anobject being square or red

And you can look at it right here

We have 6 shapes outlinednow and so that's the visual interpretation that lines up with the mathematical one wejust did

Now let's talk about multiplication for Probabilities

Now the idea here is youwant to get joint probabilities, so the probability of two things occurring together, simultaneously

And what you need to do here, is you need to multiply the probabilities

And we cansay the probability of A and B, because we are asking about A and B occurring together,a joint occurrence

And that's equal to the probability of A times the probability ofB, that's easy

But you do have to expand it just a little bit because you can havethe problem of things overlapping a little bit, and so you actually need to expand itto a conditional probability, the probability of B given A

Again, that's that verticalpipe there

On the other hand, if A and B are independent and they never co-occur, orB is no more likely to occur if A happens, then it just reduces to the probability ofB, then you get your slightly simpler equation

But let's go and take a look at our samplespace here

So we've got our 10 shapes, 5 of each kind, and then two that are red

Andwe are going to look at originally, the probability of something being square or red, now we aregoing to look at the probability of it being square and red

Now, I know we can eyeballthis one real easy, but let's run through the math

The first thing we need to do, isget the ones that are square

There's those 5 on the top and the ones that are red, andthere's those two on the right

In terms of the ones that are both square and red, yesobviously there's just this one red square at the top right

But let's do the numbershere

We change our formula to be S and R for square and red, we get the probabilityof square

Again that's those 5 out of 10, so we do 5/10, reduce this to


And thenwe need the probability of red given that it's a square

So, we only need to look atthe squares here

There's the squares, 5 of them, and one of them is red

So that's 1over 5

That reduces to


You multiply those two numbers;

5 times

2, and what youget is

10 or 10% chance or 10 percent of our total sample space is red squares

Andyou come back and you look at it and you say yeah there's one out of 10

So, that justconfirms what we are able to do intuitively

So, that's our short presentation on probabilitiesand in sum what did we get out of that? #1, Probability is not always intuitive

And alsothe idea that conditional values can help in a lot of situations, but they may not workthe way you expect them to

And really the arithmetic of Probability can surprise peopleso pay attention when you are working with it so you can get a more accurate conclusionin your own calculations

Let's finish our discussion of "Mathematics and Data Science"and the basic principles by looking at something called Bayes' theorem

And if you're familiarwith regular probability and influential testing, you can think of Bayes' theorem as the flipside of the coin

You can also think of it in terms of intersections

So for instance,standard inferential tests and calculations give you the probability of the data; that'sour d, given the hypothesis

So, if you assume a known hypothesis is true, this will giveyou the probability of the data arising by chance

The trick is, most people actuallywant the opposite of that

They want the probability of the hypothesis given the data

And unfortunately,those two things can be very different in many circumstances

On the other hand, there'sa way of dealing with it, Bayes does it and this is our guy right here

Reverend ThomasBayes, 18th Century English minister and statistician

He developed a method for getting what hecalled posterior probabilities that use as prior probabilities

And test informationor something like base rates, how common something overall to get the posterior or after thefact Probability

Here's the general recipe to how this works: You start with the probabilityof the data given the hypothesis which is what you get from the likelihood of the data

You also get that from a standard inferential test

To that, you need to add the probabilityto the hypothesis or the cause of being true

That's called the prior or the prior probability

To that you add the D; the probability of the data, that's called the marginal probability

And then you combine those and in a special way to get the probability of the hypothesisgiven the data or the posterior probability

Now, if you want to write it as an equation,you can write it in words like this; posterior is equal to likelihood times prior dividedby marginal

You can also write it in symbols like this; the probability of H given D, theprobability of the hypothesis given the data, that's the posterior probability

Is equalto the probability of the data given the hypothesis, that the likelihood, multiplied by the probabilityof the hypothesis and divided by probability of the data overall

But this is a lot easierif we look at a visual version of it

So, let's go this example here

Let's say we havea square here that represents 100% of all people and we are looking at a medical condition

And what we are going to say here is that we got this group up here that representspeople who have a disease, so that's a portion of all people

And that what we say, is wehave a test and people with the disease, 90% of them will test positive, so they're markedin red

Now it does mean over here on the far left people with the disease who testnegative that's 10%

Those are our false negatives

And so if the test catches 90% of the peoplewho have the disease, that's good right? Well, let's look at it this way

Let me ask y0ua basic question

"If a person tests positive for a disease, then what is the probabilitythey really have the disease?" And if you want a hint, I'm going to give you one

It'snot 90%,

Here's how it goes

So this is the information I gave you before and we've got90% of the people who have the disease; that's a conditional probability, they test positive

But what about the other people, the people in the big white area below, ‘of all people'

We need to look at them and if any of them ever test positive, do we ever get false positivesand with any test you are going to get false positives

And so let's say our people withoutthe disease, 90% of them test negative, the way they should

But of the people who don'thave the disease, 10% of them test positive, those are false positives

And so if you reallywant to answer the question, "If you test positive do you have the disease?", here'swhat you need

What you need is the number of people with the disease who test positivedivided by all people who test positive

Let's look at it this way

So here's our information

We've got 29

7% of all people are in this darker red box, those are the people who havethe disease and test positive, alright that's good

Then we have 6

7% of the entire group,that's the people without the disease who test positive

So we want to do, we want theprobability of the disease what percentage have the disease and test positive and thendivide that by all the people that test positive

And that bottom part is made up of two things

That's made up of the people who have the disease and test positive, and the peoplewho don't have the disease and test positive

Now we can take our numbers and start pluggingthem in

Those who have the disease and test positive that's 29

7% of the total populationof everybody

We can also put that number right here

That's fine, but we also needto look at the percentage that do not have the disease and test positive; of the totalpopulation, that's 6


So, we just need to rearrange, we add those two numbers onthe bottom, we get 36

4% and we do a little bit of division

And the number we get is81

6%, here's what that means

A positive test result still only means a probabilityof 81

6% of having the disease

So, the test is advertised at having 90% accuracy, wellif you test positive there's really only a 82% chance you have the disease

Now that'snot really a big difference

But consider this: what if the numbers change? For instance,what if the probability of the disease changes? Here's what we originally had

Let's moveit around a little bit

Let's make the disease much less common

And so now what we do, weare going to have 4

5% of all people are people who have the disease and test positive

Andthen because there is a larger number of people who don't have the disease, we are going tohave a relatively larger proportion of false positives

Again, compared to the entire populationit's going to be 9

5% of everybody

So we are going to go back to our formula here inwords and start plugging in the numbers

We get 4

5% right there, and right there

Andthen we add in our other number, the false positives that's 9


Well, we rearrangeand we start adding things up, that's 14% and when we divide that, we get 32


Here'swhat that number means

That means a positive test result; you get a positive test result,now means you only have a probability of 32

1% of having the disease

That's ? less thanthe accuracy of 90%, and in case you can't tell, that's a really big difference

Andthat's why Bayes theorem matters, because it answers the questions that people wantand the answer can be dramatically different depending on the base rate of the thing youare talking about

And so in sum, we can say this

Bayes theorem allows you to answer theright question, people really want to know; what's the probability that I have the disease

What's the probability of getting a positive if I have the disease

They want to know whetherthey have the disease

And to do this, you need to have prior probabilities, you needto know how common the disease is, you need to know how many people get positive testresults overall

But, if you can get that information and run them through it can changeyour answers and really the emotional significance of what you're dealing with dramatically

Let's wrap up some of our discussion of "Mathematics and Data Science" and the data principlesand talk about some of the next steps

Things you can do afterwards

Probably the most importantthing is, you may have learned about math a long time ago but now it's a good time todig out some of those books and go over some of the principles you've used before

Theidea here is that a little math can go a long way in data science

So, things like Algebraand things like Calculus and things like Big O and Probability

All of those are importantin data science and its helpful to have at least a working understanding of each

Youdon't have to know everything, but you do need to understand the principles of yourprocedures that you select when you do your projects

There are two reasons for that verygenerally speaking

First, you need to know if a procedure will actually answer your question

Does it give you the outcome that you need? Will it give you the insight that you need?Second; really critical, you need to know what to do when things go wrong

Things don'talways work out, numbers don't always add up, you got impossible results or things justaren't responding

You need to know enough about the procedure and enough about the mathematicsbehind it, so you can diagnose the problem, and respond appropriately

And to repeat myselfonce again, no matter what you're working on in data science, no matter what tool you'reusing, what procedure you're doing, focus on your goal

And in case you can't rememberthat, your goal is meaning

Your goal is always meaning

Welcome to "Statistics in Data Science"

I'm Barton Poulson and what we are going to be doing in this course is talking about someof the ways you can use statistics to see the unseen

To infer what's there, even whenmost of it's hidden

Now this shouldn't be surprised

If you remember the data scienceVenn Diagram we talked about a while ago, we have math up here at the top right corner,but if you were to go to the original description of this Venn Diagram, it's full name was mathand stats

And let me just mention something in case it's not completely obvious aboutwhy statistics matters to data science

And the idea is this; counting is easy

It's easyto say how many times a word appears in a document, it's easy to say how many peoplevoted for a particular candidate in one part of the country

Counting is easy, but summarizingand generalizing those things hard

And part of the problem is there's no such thing asa definitive analysis

All analyses really, depend on the purposes that you're dealingwith

So as an example, let me give you a couple of pairs of words and try to summarizethe difference between them in just two or three words

In a word or two, how is a souffledifferent from a quiche, or how is an Aspen different from a Pine tree? Or how is Baseballdifferent from Cricket? And how are musicals different from opera? It really depends onwho you are talking to, it depends on your goals and it depends on the shared knowledge

And so, there's not a single definitive answer, and then there's the matter of generalization

Think about it again, take music

Listen to three concerti by Antonio Vivaldi, and doyou think you can safely and accurately describe all of his music? Now, I actually chose Vivaldion purpose because even Igor Stravinsky said you could, he said he didn't write 500 concertoshe wrote the same concerto 500 times

But, take something more real world like politics

If you talk to 400 registered voters in the US, can you then accurately predict the behaviorof all of the voters? There's about 100 million voters in the US, and that's a matter of generalization

That's the sort of thing we try to take care of with inferential statistics

Now thereare different methods that you can use in statistics and all of them are described togive you a map; a description of the data you're working on

There are descriptive statistics,there are inferential statistics, there's the inferential procedure Hypothesis testingand there's also estimation and I'll talk about each of those in more depth

There area lot of choices that have to be made and some of the things I'm going to discuss indetail are for instance the choice of Estimators, that's different from estimation

Differentmeasures of fit

Feature selection, for knowing which variables are the most important inpredicting your outcome

Also common problems that arise when trying to model data and theprinciples of model validation

But through this all, the most important thing to rememberis that analysis is functional

It's designed to serve a particular purpose

And there'sa very wonderful quote within the statistics world that says all models are wrong

Allstatistical descriptions of reality are wrong, because they are not exact depictions, theyare summaries but some are useful and that's from George Box

And so the question is, you'renot trying to be totally, completely accurate, because in that case you just wouldn't doan analysis

The real question is, are you better off not doing your analysis than notdoing it? And truthfully, I bet you are

So in sum, we can say three things: #1, you wantto use statistics to both summarize your data and to generalize from one group to anotherif you can

On the other hand, there is no "one true answer" with data, you got to beflexible in terms of what your goals are and the shared knowledge

And no matter what yourdoing, the utility of your analysis should guide you in your decisions

The first thingwe want to cover in "Statistics in Data Science" is the principles of exploring data and thisvideo is just designed to give an exploration overview

So we like to think of it like this,the intrepid explorers, they're out there exploring and seeing what's in the world

You can see what's in your data, more specifically you want to see what your dataset is like

You want to see if your assumptions are right so you can do a valid analysis with your procedure

Something that may sound very weird, but you want to listen to your data

Something's notwork out, if it's not going the way you want, then you're going to have to pay attentionand exploratory data analysis is going to help you do that

Now, there are two generalapproaches to this

First off, there's a graphical exploration, so you use graphs and picturesand visualizations to explore your data

The reason you want to do this is that graphicsare very dense in information

They're also really good, in fact the best to get the overallimpression of your data

Second to that, there is numerical exploration

I make it very clear,this is the second step

Do the visualization first, then do the numerical part

Now youwant to do this, because this can give greater precision, this is also an opportunity totry variations on the data

You can actually do some transformations, move things arounda little bit and try different methods and see how that effects the results, see howit looks

So, let's go first to the graphical part

They are very quick and simple plotsthat you can do

Those include things like bar charts, histograms and scatterplots, veryeasy to make and a very quick way to getting to understand the variables in your dataset

In terms of numerical analysis; again after the graphical method, you can do things liketransform the data, that is take like the logarithm of your numbers

You can do Empiricalestimates of population numbers, and you can use robust methods

And I'll talk about allof those at length in later videos

But for right now, I can sum it up this way

The purposeof exploration is to help you get to know your data

And also you want to explore yourdata thoroughly before you start modelling, before you build statistical models

And allthe way through you want to make sure you listen carefully so that you can find hiddenor unassumed details and leads in your data

As we move in our discussion of "Statisticsand Exploring Data", the single most important thing we can do is Exploratory Graphics

Inthe words of the late great Yankees catcher Yogi Berra, "You can see a lot by just looking"

And that applies to data as much as it applies to baseball

Now, there's a few reasons youwant to start with graphics

#1, is to actually get a feel for the data

I mean, what's itdistributed like, what's the shape, are there strange things going on

Also it allows youto check the assumptions and see how well your data match the requirements of the analyticalprocedures you hope to use

You can check for anomalies like outliers and unusual distributionsand errors and also you can get suggestions

If something unusual is happening in the data,that might be a clue that you need to pursue a different angle or do a deeper analysis

Now we want to do graphics first for a couple of reasons

#1, is they are very informationdense, and fundamentally humans are visual

It's our single, highest bandwidth way ofgetting information

It's also the best way to check for shape and gaps and outliers

There's a few ways that you can do this if you want to and the first is with programsthat rely on code

So you can use the statistical programming language R, the general purposelanguage Python

You can actually do a huge amount in JavaScript, especially D3JS

Oryou can use Apps, that are specifically designed for exploratory analysis, that includes Tableauboth the desktop and public versions, Qlik and even Excel is a good way to do this

Andfinally you can do this by hand

John Tukey who's the father of Exploratory Data Analysis,wrote his seminal book, a wonderful book where it's all hand graphics and actually it's awonderful way to do it

But let's start the process for doing these graphics

We startwith one variable

That is univariate distributions

And so you'll get something like this, thefundamental chart is the bar chart

This is when you are dealing with categories and youare simply counting however many cases there are in each category

The nice thing aboutbar charts is they are really easy to read

Put them in descending order and may be havethem vertical, maybe have them horizontal

Horizontal could be nice to make the labelsa little easier to read

This is about psychological profiles of the United States, this is realdata

We have most states in the friendly and conventional, a smaller amount in thetemperamental and uninhibited and the least common of the United States is relaxed andcreative

Next you can do a Box plot, or sometimes called a box and whiskers plot

This is whenyou have a quantitative variable, something that's measured and you can say how far apartscores are

A box plot shows quartile values, it also shows outliers

So for instance thisis google searches for modern dance

That's Utah at 5 standard deviations above the nationalaverage

That's where I'm from and I'm glad to see that there

Also, it's a nice way toshow many variables side by side, if they are on proximately similar scales

Next, ifyou have quantitative variables, you are going to want to do a histogram

Again, quantitativeso interval or ratio level, or measured variables

And these let you see the shape of a distributionand potentially compare many

So, here are three histograms of google searches on DataScience, and Entrepreneur and Modern Dance

And you can see, mostly for the part normallydistributed with a couple of outliers

Once you've done one variable, or the univariateanalyses, you're going to want to do two variables at a time

That is bivariate distributionsor joint distributions

Now, one easy way to do this is with grouped plots

You cando grouped bar charts and box plots

What I have here is grouped box plots

I have mythree regions, Psychological Regions of the United States and I'm showing how they rankon openness that's a psychological characteristic

As you can see, the relaxed and creative arehigh and the friendly conventional tend to go to the lowest and that's kind of how thatworks

It's also a good way of seeing the association between a categorical variablelike region of the United States psychologically, and a quantitative outcome, which is whatwe have here with openness

Next, you can also do a Scatterplot

That's where you havequantitative variables and what you're looking for here is, is it a straight line? Is itlinear? Do we have outliers? And also the strength of association

How closely do thedots all come to the regression line that we have here in the middle

And this is aninteresting one for me because we have openness across the bottom, so more open as you goto the right and agreeableness

And what you can see is there is a strong downhill association

The states and the states that are the most open are also the least agreeable, so we'regoing to have to do something about that

And then finally, you're going to want togo to many variables, that is multivariate distributions

Now, one big question hereis 3D or not 3D? Let me make an argument for not 3D

So, what I have here is a 3D Scatterplotabout 3 variables from Google searches

Up the left, I have FIFA which is for professionalsoccer

Down there on the bottom left, I have searches for the NFL and on the right I havesearches for NBA

Now, I did this in R and what's neat about this is you can click anddrag and move it around

And you know that's kind of fun, you kind of spin around and itgets kind of nauseating as you look at it

And this particular version, I'm using plotlyin R, allows you to actually click on a point and see, let me see if I can get the floorin the right place

You can click on a point and see where it ranks on each of these characteristics

You can see however, this thing is hard to control and once it stops moving, it's notmuch fun and truthfully most 3D plots I've worked with are just kind of nightmares

Theyseem like they're a good idea, but not really

So, here's the deal

3D graphics, like theone I just showed you, because they are actually being shown in 2D, they have to be in motionfor you to tell what is going on at all

And fundamentally they are hard to read and confusing

Now it's true, they might be useful for finding clusters in 3 dimensions, we didn't see thatin the data we had, but generally I just avoid them like the plague

What you do want todo however, is see the connection between the variables, you might want to use a matrixof plots

This is where you have for instance many quantitative variables, you can use markersfor group membership if you want, and I find it to be much clearer than 3D

So here, Ihave the relationship between 4 search terms: NBA, NFL, MLB for Major League Baseball andFIFA

You can see the individual distributions, you can see the scatterplots, you can getthe correlation

Truthfully for me this is a much easier chart to read and you can getthe richness that we need, from a multidimensional display

So the questions you're trying toanswer overall are: Number 1, Do you have what you need? Do you have the variables thatyou need, do you have the ability that you need? Are there clumps or gaps in the distributions?Are there exceptional cases/anomalies that are really far out from everybody else, spikesin the scores? And of course are there errors in the data? Are there mistakes in coding,did people forget to answer questions? Are there impossible combinations? And these kindsof things are easiest to see with a visualization that really kind of puts it there in frontof you

And so in sum, I can say this about graphical exploration of data

It's a criticalfirst step, it's basically where you always want to start

And you want to use the quickand easy methods, again

Bar charts, scatter plots are really easy to make and they'revery easy to understand

And once you're done with the graphical exploration, then you cango to the second step, which is exploring the data through numbers

The next step in"Statistics and Exploring Data" is exploratory statistics or numerical exploration of data

I like to think of this, as go in order

First, you do visualization, then you do the numericalpart

And a couple of things to remember here

#1, you are still exploring the data

You'renot modeling yet, but you are doing a quantitative exploration

This might be an opportunityto get empirical estimates, that is of population parameters as opposed to theoretically basedones

It's a good time to manipulate the data and explore the effect of manipulating thedata, looking at subgroups, looking at transforming variables

Also, it's an opportunity to checkthe sensitivity of your results

Do you get the same general results if you test underdifferent circumstances

So we are going to talk about things like Robust Statistics,resampling data and transforming data

So, we'll start with Robust Statistics

This bythe way is Hercules, a Robust mythical character

And the idea with robust statistics is thatthey are stable, is that even when the data varies in unpredictable ways you still getthe same general impression

This is a class of statistics, it's an entire category, that'sless affected by outliers, and skewness, kurtosis and other abnormalities in the data

So let'stake a quick look

This is a very skewed distribution that I created

The median, which is the darkline in the box, is right around one

And I am going to look at two different kindsof robust statistics, The Trimmed Mean and the Winsorized Mean

With the Trimmed mean,you take a certain percentage of data from the top and the bottom and you just throwit away and compute for the rest

With the Winsorized, you take those and you move thosescores into the highest non-outlier score

Now the 0% is exactly the same as the regularmean and here it's 1

24, but as we trim off or move in 5%, the mean shifts a little bit

Then 10 % it comes in a little bit more to 25%, now we are throwing away 50% of our data

25% on the top and 25% on the bottom

And we get a trimmed mean of 1

03 and a winsorizedof 1


When we throw away 50% or we trim 50%, that actually means we are leaving justthe median, only the middle scores left

Then we get 1


What's interesting is how closewe get to that, even when we have 50% of the data left, and so that's an interesting exampleof how you can use robust statistics to explore data, even when you have things like strongskewness

Next is the principle of resampling

And that's like pulling marbles repeatedlyfrom the jar, counting the colors, putting them back in and trying again

That's an empiricalestimate of sampling variability

So, sometimes you get 20% red marbles, sometimes you get30, sometimes you get 22 and so on

There are several versions for this, they go bythe name jackknife, the bootstrap the permutation

And the basic principle of resampling is alsokey to the process of cross-validation, I'll have more to say about validation later

Andthen finally there's transforming variables

Here's our caterpillars in the process oftransforming into butterflies

But the idea here, is that you take a difficult data setand then you do what's called a smooth function

There's no jumps in it, and something thatallows you to preserve the order and work on the full dataset

So you can fix skeweddata, and in a scatter plot you might have a curved line, you can fix that

And probablythe best way to look at this is probably with something called Tukey's ladder of powers

I mentioned before John Tukey, the father of exploratory data analysis

He talked alot about data transformations

This is his ladder, starting at the bottom with the -1,over x2, up to the top with x3

Here's how it works, this distribution over here is asymmetrical normally distributed variable, and as you start to move in one directionand you apply the transformation, take the square root you see how it moves the distributionover to one end

Then the logarithm, then you get to the end then you get to this minus1 over the square of the score

And that pushes it way way, way over

If you go the otherdirection, for instance you square the score, it pushes it down in the one direction andthen you cube it and then you see how it can move it around in ways that allow you to,you can actually undo the skewness to get back to a more centrally distributed distribution

And so these are some of the approaches that you can use in the numerical distributionof data

In sum, let's say this: statistical or numerical exploration allows you to getmultiple perspectives on your data

It also allows you to check the stability, see howit works with outliers, and skewness and mixed distributions and so on

And perhaps mostimportant it sets the stage for the statistical modelling of your data

As a final step of"Statistics and Exploring Data", I'm going to talk about something that's not usuallyexploring data but it is basic descriptive statistics

I like to think of it this way

You've got some data, and you are trying to tell a story

More specifically, you're tryingto tell your data's story

And with descriptive statistics, you can think of it as tryingto use a little data to stand in for a lot of data

Using a few numbers to stand in fora large collection of numbers

And this is consistent with the advice we get from goodole Henry David Thoreau, who told us Simplify, Simplify

If you can tell your story withmore carefully chosen and more informative data, go for it

So there's a few differentprocedures for doing this

#1, you'll want to describe the center of your distributionof data, that is if you're going to choose a single number, use that

# 2, if you cangive a second number give something about the spread or the dispersion of the variability

And #3, give something about the shape of the distribution

Let me say more about eachof these in turn

First, let's talk about center

We have the center of our rings here

Now there are a few very common measure of center or location or central tendency ofa distribution

There's the mode, the median and there's the mean

Now, there are many,many others but those are the ones that are going to get you most of the way

Let's talkabout the mode first

Now, I'm going to create a little dataset here on a scale from 1 to11, and I'm going to put individual scores

There's a one, and another one, and anotherone and another one

Then we have a two, two, then we have a score way over at 9 and anotherscore over at 11

So we have 8 scores, and this is the distribution

This is actuallya histogram of the dataset

The mode is the most commonly occurring score or the mostfrequent score

Well, if you look at how tall each of these go, we have more ones than anythingelse, and so one is the mode

Because it occurs 4 times and nothing else comes close to that

The median is a little different

The median is looking for the score that is at the centerif you split it into two equal groups

We have 8 scores, so we have to get one groupof 4, that's down here, and the other group of four, this really big one because it'sway out and the median is going to be the place on the number line that splits thoseinto two groups

That's going to be right here at one and a half

Now the mean is goingto be a little more complicated, even though people understand means in general

It's thefirst one here that actually has a formula, where M for the mean is equal to the sum ofX (that's our scores on the variable), divided by N (the number of scores)

You can alsowrite it out with Greek notation if you want, like this where that's sigma - a capital sigmais the summation sign, sum of X divided by N

And with our little dataset, that worksout to this: one plus one plus one plus one plus two plus two plus nine plus eleven

Addthose all up and divide by 8, because that's how many scores there are

Well that reducesto 28 divided by 8, which is equal to 3


If you go back to our little chart here, 3

5is right over here

You'll notice there aren't any scores really exactly right there

That'sbecause the mean tends to get very distorted by its outliers, it follows the extreme scores

But a really nice, I say it's more than just a visual analogy, is that if this number werea sea saw, then the mean is exactly where the balance point or the fulcrum would befor these to be equal

People understand that

If somebody weighs more they got to sit incloser to balance someone who less, who has to sit further out, and that's how the meanworks

Now, let me give a bit of the pros and cons of each of these

Mode is easy todo, you just count how common it is

On the other hand, it may not be close to what appearsto be the center of the data

The Median it splits the data into two same size groups,the same number of scores in each and that's pretty easy to deal with but unfortunately,it's pretty hard to use that information in any statistics after that

And finally themean, of these three it's the least intuitive, it's the most effective by outliers and skewnessand that really may strike against it, but it is the most useful statistically and soit's the one that gets used most often

Next, there's the issue of spread, spread your tailfeathers

And we have a few measures here that are pretty common also

There's the range,there are percentiles and interquartile range and there's variance and standard deviation

I'll talk about each of those

First the Range

The Range is simply the maximum score minusthe minimum score, and in our case that's 11 minus 1, which is equal to 10, so we havea range of 10

I can show you that on our chart

It's just that line on the bottom fromthe 11 down to the one

That's a range of 10

The interquartile range which is actuallyusually referred to simply as the IQR is the distance between the Q3; which is the thirdquartile score and Q1; which is the first quartile score

If you're not familiar withquartiles, it's the same the 75th percentile score and the 25th percentile score

Reallywhat it is, is you're going to throw away some of the some of the data

So let's goto our distribution here

First thing we are going to do, we are going to throw away thetwo highest scores, there they are, they're greyed out now, and then we are going to throwaway two of the lowest scores, they're out there

Then we are going to get the rangefor the remaining ones

Now, this is complicated by the fact that I have this big gap between2 and 9, and different methods of calculating quartiles do something with that gap

So ifyou use a spreadsheet it's actually going to do an interpolation process and it willgive you a value of 3

75, I believe

And then down to one for the first quartile, so notso intuitive with this graph but that it is how it works usually

If you want to writeit out, you can do it like this

The interquartile range is equal to Q3 minus Q1, and in ourparticular case that's 3

75 minus 1

And that of course is equal to just 2

75 and thereyou have it

Now our final measure of spread or variability or dispersion, is two relatedmeasures, the variance and the standard deviation

These are little harder to explain and a littleharder to show

But the variance, which is at least the easiest formula, is this: thevariance is equal to that's the sum, the capital sigma that's the sum, X minus M; that's howfar each score is from the mean and then you take that deviation there and you square it,you add up all the deviations, and then you divide by the number

So the variance is,the average square deviation from the mean

I'll try to show you that graphically

Sohere's our dataset and there's our mean right there at 3 and a half

Let's go to one ofthese twos

We have a deviation there of 1

5 and if we make a square, that's 1

5 pointson each side, well there it is

We can do a similar square for the other score too

If we are going down to one, then it's going to be 2

5 squared and it's going to be thatmuch bigger, and we can draw one of these squares for each one of our 8 points

Thesquares for the scores at 9 and 11 are going to be huge and go off the page, so I'm notgoing to show them

But once you have all those squares you add up the area and youget the variance

So, this is the formula for the variance, but now let me show thestandard deviation which is also a very common measure

It's closely related to this, specificallyit's just the square root of the variance

Now, there's a catch here

The formulas forthe variance and the standard deviation are slightly different for populations and samplesin that they use different denominators

But they give similar answers, not identical butsimilar if the sample is reasonably large, say over 30 or 50, then it's really goingto be just a negligible difference

So let's do a little pro and con of these three things

First, the Range

It's very easy to do, it only uses two numbers the high and the low,but it's determined entirely by those two numbers

And if they're outliers, then you'vegot really a bad situation

The Interquartile Range the IQR, is really good for skewed dataand that's because it ignores extremes on either end, so that's nice

And the varianceand the standard deviation while they are the least intuitive and they are the mostaffected by outliers, they are also generally the most useful because they feed into somany other procedures that are used in data science

Finally, let's talk a little bitabout the shape of the distribution

You can have symmetrical or skew distribution, unimodal,uniform or u-shaped

You can have outliers, there's a lot of variations

Let me show youa few of them

First off is a symmetrical distribution, pretty easy

They're the sameon the left and on the right

And this little pyramid shape is an example of a symmetricaldistribution

There are also skewed distributions, where most of the scores are on one end andthey taper off

This here is a positively skewed distribution where most of the scoresare at the low end and the outliers are on the high end

This is unimodal, our same pyramidshape

Unimodal means it has one mode, really kind of one hump in the data

That's contrastedfor instance to bimodal where you have two modes, and that usually happens when you havetwo distributions that got mixed together

There is also uniform distribution where everyresponse is equally common, there's u-shaped distributions where people tend to pile upat one end or the other and a big dip in the middle

And so there's a lot of differentvariations, and you want to get those, the shape of the distribution to help you understandand put the numerical summaries like the mean and like the standard deviation and put thoseinto context

In sum, we can say this: when you use this script of statistics that allowsyou to be concise with your data, tell the story and tell it succinctly

You want tofocus on things like the center of the data, the spread of the data, the shape of the data

And above all, watch out for anomalies, because they can exercise really undue influence onyour interpretations but this will help you better understand your data and prepare youfor the steps to follow

As we discuss "Statistics in Data Science", one of the really big topicsis going to be Inference

And I'll begin that with just a general discussion of inferentialstatistics

But, I'd like to begin unusually with a joke, you may have seen this beforeit says "There are two kinds of people in the world

1) Those you can extrapolate fromincomplete data and, the end"

Of course, because the other group is the people whocan't

But let's talk about extrapolating from incomplete data or inferring from incompletedata

First thing you need to know is the difference between populations and samples

A population represents all of the data, or every possible case in your group of interest

It might be everybody who's a commercial pilot, it might be whatever

But it represents everybodyin that or every case in that group that you're interested in

And the thing with the populationis, it just is what it is

It has its values, it has it's mean and standard deviation andyou are trying to figure out what those are, because you generally use those in doing youranalyses

On the other hand, samples instead of being all of the data are just some ofthe data

And the trick is they are sampled with error

You sample one group and you calculatethe mean

It's not going to be the same if you do it the second time, and it's that variabilitythat's in sampling that makes Inference a little tricky

Now, also in inference thereare two very general approaches

There's testing which is short for hypothesis testing andmaybe you've had some experience with this

This is where you assume a null hypothesisof no effect is true

You get your data and you calculate the probability of getting thesample data that you have if the null hypothesis is true

And if that value is small, usuallyless than 5%, then you reject the null hypothesis which says really nothings happen and youinfer that there is a difference in the population

The other most common version is Estimation

Which for instance is characterizing confidence intervals

That's not the only version ofEstimation but it's the most common

And this is where you sample data to estimate a populationparameter value directly, so you use the sample mean to try to infer what the population meanis

You have to choose a confidence level, you have to calculate your values and youget high and low bounds for you estimate that work with a certain level of confidence

Now,what makes both of these tricky is the basic concept of sampling error

I have a colleaguewho demonstrates this with colored M&M's, what percentage are red, and you get themout of the bags and you count

Now, let's talk about this, a population of numbers

I'm going to give you just a hypothetical population of the numbers 1 through 10

Andwhat I am going to do, is I am going to sample from those numbers randomly, with replacement

That means I pull a number out, it might be a one and I put it back, I might get the oneagain

So I'm going to sample with replacement, which actually may sound a little bit weird,but it's really helpful for the mathematics behind inference

And here are the samplesthat I got, I actually did this with software

I got a 3, 1, 5, and 7

Interestingly, thatis almost all odd numbers, almost

My second sample is 4, 4, 3, 6 and 10

So you can seeI got the 4 twice

And I didn't get the 1, the 2, the 5, 7, or 8 or 9

The third sampleI got three 1's! And a 10 and a 9, so we are way at the ends there

And then my fourthsample, I got a 3, 9, 2, 6, 5

All of these were drawn at random from the exact same population,but you see that the samples are very different

That's the sampling variability or the samplingerror

And that's what makes inference a little trickier

And let's just say again, why thesampling variability, why it matters

It's because inferential methods like testing andlike estimation try to see past the random sampling variation to get a clear pictureon the underlying population

So in sum, let's say this about Inferential Statistics

Yousample your data from the larger populations, and as you try to interpret it, you have toadjust for error and there's a few different ways of doing that

And the most common approachesare testing or hypothesis testing and estimation of parameter values

The next step in ourdiscussion of "Statistics and Inference" is Hypothesis Testing

A very common procedurein some fields of research

I like to think of it as put your money where your mouth isand test your theory

Here's the Wright brothers out testing their plane

Now the basic ideabehind hypothesis testing is this, and you start out with a question

You start out withsomething like this: What is the probability of X occurring by chance, if randomness ormeaningless sampling variation is the only explanation? Well, the response is this, ifthe probability of that data arising by chance when nothing's happening is low, then youreject randomness as a likely explanation

Okay, there's a few things I can say aboutthis

#1, it's really common in scientific research, say for instance in the social sciences,it's used all the time

#2, this kind of approach can be really helpful in medical diagnostics,where you're trying to make a yes/no decision; does a person have a particular disease

And3, really anytime you're trying to make a go/no go decision, which might be made forinstance with a purchasing decision for a school district or implementing a particularlaw, You base it on the data and you have to make a yes/no

Hypothesis testing mightbe helpful in those situations

Now, you have to have hypotheses to do hypothesis testing

You start with H0, which is shorthand for the null hypothesis

And what that is in larger,what that is in lengthier terms is that there is no systematic effect between groups, there'sno effect between variables and random sampling error is the only explanation for any observeddifferences you see

And then contrast that with HA, which is the alternative hypothesis

And this really just says there is a systematic effect, that there is in fact a correlationbetween variables, that there is in fact a difference between two groups, that this variabledoes in fact predict the other one

Let's take a look at the simplest version of thisstatistically speaking

Now, what I have here is a null distribution

This is a bell curve,it's actually the standard normal distribution

Which shows z-scores in relative frequency,and what you do with this is you mark off regions of rejection

And so I've actuallyshaded off the highest 2

5% of the distribution and the lowest 2


What's funny about thisis, is that even though I draw it +/- 3, it looks like 0

It's actually infinite and asymptotic

But, that's the highest and lowest 2

5% collectively leaves 95% in the middle

Now, the idea isthen that you gather your data, you calculate a score for you data and you see where itfalls in this distribution

And I like to think of that as you have to go down one pathto the other, you have to make a decision

And you have to decide to whether to retainyour null hypothesis; maybe it is random, or reject it and decide no I don't think it'srandom

The trick is, things can go wrong

You can get a false positive, and this iswhen the sample shows some kind of statistical effect, but it's really randomness

And sofor instance, this scatterplot I have here, you can see a little down hill associationhere but this is in fact drawn from data that has a true correlation of zero

And I justkind of randomly sampled from it, it took about 20 rounds, but it looks negative butreally there's nothing happening

The trick about false positives is; that's conditionalon rejecting the null

The only way to get a false positive is if you actually concludethat there's a positive result

It goes by the highly descriptive name of a Type I error,but you get to pick a value for it, and

05 or a 5% risk if you reject the null hypothesis,that's the most common value

Then there's a false negative

This is when the data looksrandom, but in fact, it's systematic or there's a relationship

So for instance, this scatterplotit looks like there's pretty much a zero relationship, but in fact this came from two variables thatwere correlated at

25, that's a pretty strong association

Again, I randomly sampled fromthe data until I got a set that happened to look pretty flat

And a false negative isconditional on not rejecting the null

You can only get a false negative if you get anegative, you say there's nothing there

It's also called a Type II error and this is avalue that you have to calculate based on several elements of your testing framework,so it's something to be thoughtful of

Now, I do have to mention one thing, big securitynotice, but wait

The problem with Hypothesis Testing; there's a few

#1, it's really easyto misinterpret it

A lot of people say, well if you get a statistically significant result,it means that it's something big and meaningful

And that's not true because it's confoundedwith sample size and a lot of other things that don't really matter

Also, a lot of otherpeople take exception with the assumption of a null effect or even a nil effect, thatthere's zero difference at all

And that can be, in certain situations can be an absurdclaim, so you've got to watch out for that

There's also bias from the use of cutoff

Anytime you have a cut off, you're going to have problems where you have cases that wouldhave been slightly higher, slightly lower

It would have switched on the dichotomousoutcome, so that is a problem

And then a lot of people say, it just answers the wrongquestion, because "What it's telling you is what's the probability of getting this dataat random?" That's not what most people care about

They want it the other way, which iswhy I mentioned previously Bayes theorem and I'll say more about that later

That beingsaid, Hypothesis Testing is still very deeply ingrained, very useful in a lot of questionsand has gotten us really far in a lot of domains

So in sum, let me say this

Hypothesis Testingis very common for yes/no outcomes and is the default in many fields

And I argue itis still useful and information despite many of the well substantiated critiques

We'llcontinue in "Statistics and Inference" by discussing Estimation

Now as opposed to HypothesisTesting, Estimation is designed to actually give you a number, give you a value

Not justa yes/no, go/no go, but give you an estimate for a parameter that you're trying to get

I like to think of it sort of as a new angle, looking at something from a different way

And the most common, approach to this is Confidence Intervals

Now, the important thing to rememberis that this is still an Inferential procedure

You're still using sample data and tryingto make conclusions about a larger group or population

The difference here, is insteadof coming up with a yes/no, you'd instead focus on likely values for the populationvalue

Most versions of Estimation are closely related to Hypothesis Testing, sometimes seenas the flip side of the coin

And we'll see how that works in later videos

Now, I liketo think of this as an ability to estimate any sample statistic and there's a few differentversions

We have Parametric versions of Estimation and Bootstrap versions, that's why I got theboots here

And that's where you just kind of randomly sample from the data, in an effortto get an idea of the variability

You can also have central versus noncentral ConfidenceIntervals in the Estimation, but we are not going to deal with those

Now, there are threegeneral steps to this

First, you need to choose a confidence level

Anywhere from say,well you can't have a zero, it has to be more than zero and it can't be 100%

Choose somethingin between, 95% is the most common

And what it does, is it gives you a range a high anda low

And the higher your level of confidence the more confident you want to be, the widerthe range is going to be between your high and your low estimates

Now, there's a fundamentaltrade off in what' happening here and the trade off between accuracy; which means you'reon target or more specifically that your interval contains the true population value

And theidea is that leads you to the correct Inference

There's a tradeoff between accuracy and what'scalled Precision in this context

And precision means a narrow interval, as a small rangeof likely values

And what's important to emphasize is this is independent of accuracy,you can have one without the other! Or neither or both

In fact, let me show you how thisworks

What I have here is a little hypothetical situation, I've got a variable that goes from10 to 90, and I've drawn a thick black line at 50

If you think of this in terms of percentagesand political polls, it makes a very big difference if you're on the left or the right of 50%

And then I've drawn a dotted vertical line at 55 to say that that's our theoretical truepopulation value

And what I have here is a distribution that shows possible valuesbased on our sample data

And what you get here is it's not accurate, because it's centeredon the wrong thing

It's actually centered on 45 as opposed to 55

And it's not precise,because it's spread way out from may be 10 to almost 80

So, this situation the datais no help really at all

Now, here's another one

This is accurate because it's centeredon the true value

That's nice, but it's still really spread out and you see that about 40%of the values are going to be on the other side of 50%; might lead you to reach the wrongconclusion

That's a problem! Now, here's the nightmare situation

This is when youhave a very very precise estimate, but it's not accurate; it's wrong

And this leads youto a very false sense of security and understanding of what's going on and you're going to totallyblow it all the time

The ideal situation is this: you have an accurate estimate wherethe distribution of sample values is really close to the true population value and it'sprecise, it's really tightly knit and you can see that about 95% of it is on the correctside of 50 and that's good

If you want to see all four of them here at once, we havethe precise two on the bottom, the imprecise ones on the top, the accurate ones on theright, the inaccurate ones on the left

And so that's a way of comparing it

But, no matterwhat you do, you have to interpret confidence interval

Now, the statistically accurateway that has very little interpretation is this: you would say the 95% confidence intervalfor the mean is 5

8 to 7


Okay, so that's just kind of taking the output from your computerand sticking it to sentence form

The Colloquial Interpretation of this goes like this: thereis a 95% chance that the population mean is between 5

8 and 7


Well, in most statisticalprocedures, specifically frequentist as opposed to bayesian you can't do that

That impliesthe population mean shifts, that's not usually how people see it

Instead, a better interpretationis this; 95% of confidence intervals for randomly selected samples will contain the populationmean

Now, I can show you this really easily, with a little demonstration

This is whereI randomly generated data from a population with a mean of 55 and I got 20 different samples

And I got the Confidence Interval from each sample and I charted the high and the low

And the question is, did it include the true population value

And you can see of these20, 19 included it, some of them barely made it

If you look at sample #1 on the far left;barely made it

Sample #8, it doesn't look like it made it, sample 20 on the far right,barely made it on the other end

Only one missed it completely, that sample #2, whichis shown in red on the left

Now, it's not always just one out of twenty, I actuallyhad to run this simulation about 8 times, because it gave me either zero or 3, or 1or two, and I had to run it until I got exactly what I was looking for here,

But this iswhat you would expect on average

So, let's say a few things about this

There are somethingsthat affect the width of a Confidence Interval

The first is the confidence level, or CL

Higher confidence levels create wider intervals

The more certain you have to be, you're goingto give a bigger range to cover your basis

Second, the Standard Deviation or larger standarddeviations create wider intervals

If the thing that you are studying is inherentlyreally variable, then of course you're estimate of the range is going to be more variableas well

And then finally there is the n or the sample size

This one goes the other way

Larger sample sizes create narrower intervals

The more observations you have, the more preciseand the more reliable things tend to be

I can show you each of these things graphically

Here we have a bunch of Confidence Intervals, where I am simply changing the confidencelevel from

50 at the low left side to

999 and as you can see, it gets much bigger aswe increase

Next one is Standard Deviation

As the sample standard deviation increasesfrom 1 to 16, you can see that the interval gets a lot bigger

And then we have samplesize going from just 2 up to 512; I'm doubling it at each point

And you can see how theinterval gets more and more and more precise as we go through

And so, let's say this tosum up our discussion of estimation

Confidence Intervals which are the most common versionof Estimation focus on the population parameter

And the variation in the data is explicitlyincluded in that Estimation

Also, you can argue that they are more informative, becausenot only do they tell you whether the population value is likely, but they give you a senseof the variability of the data itself, and that's one reason why people will argue thatconfidence levels should always be included in any statistical analysis

As we continueour discussion on "Statistics and Data Science", we need to talk about some of the choicesyou have to make, some of the tradeoffs and some of the effects that these things have

We'll begin by talking about Estimators, that is different methods for estimating parameters

I like to think of it as this, "What kind of measuring stick or standard are you goingto be using?" Now, we'll begin with the most common

This is called OLS, which is actuallyshort for Ordinary Least Squares

This is a very common approach, it's used in a lotof statistics and is based on what is called the sum of squared errors, and it's characterizedby an acronym called BLUE, which stands for Best Linear Unbiased Estimator

Let me showyou how that works

Let's take a scatterplot here of an association between two variables

This is actually the speed of a car and the distance to stop from about the ‘20's Ithink

We have a scatterplot and we can draw a straight regression line right through it

Now, the line I've used is in fact the Best Linear Unbiased Estimate, but the way thatyou can tell that is by getting what are called the Residuals

If you take each data pointand draw a perfectly vertical line up or down to the regression line, because the regressionline predicts what the value would be for that value on the X axis

Those are the residuals

Each of those individual, vertical lines is Residual

You square those and you add themup and this regression line, the gray angled line here will have the smallest sum of thesquared residuals of any possible straight line you can run through it

Now, anotherapproach is ML, which stands for Maximum Likelihood

And this is when you choose parameters thatmake the observed data most likely

It sounds kind of weird, but I can demonstrate it, andit's based on a kind of local search

It doesn't always find the best, I like to think of ithere like the person here with a pair of binoculars, looking around them, trying hard to find something,but you could theoretically miss something

Let me give a very simple example of how thisworks

Let's assume that we're trying to find parameters that maximize the likelihood ofthis dotted vertical line here at 55, and I've got three possibilities

I've got myred distribution which is off to the left, blue which is a little more centered and greenwhich is far to the right

And these are all identical, except they have different means,and by changing the means, you see there the one that is highest where the dotted lineis the blue one

And so, if the only thing we are doing is changing the mean, and weare looking at these three distributions, then the blue one is the one that has themaximum likelihood for this particular parameter

On the other hand, we could give them allthe same meaning right around 50, and vary their standard deviations instead and so theyspread out different amounts

In this case, the red distribution is highest at the dottedvertical line and so it has the maximum value

Or if you want to, you can vary both the meanand the standard deviations simultaneously

And here green gets the slight advantage

Now this is really a caricature of the process because obviously you would just want to centerit on the 55 and be done with it

The question is when you have many variables in your dataset

Then it's a very complex process of choosing values that can maximize the association betweenall of them

But you get a feel for how it works with this

The third approach whichis pretty common is MAP or map for Maximum A Posteriori

This is a Bayesian approachto parameter estimation, and what it does it adds the prior distribution and then itgoes through sort of an anchoring and adjusting process

What happens, by the way is strongerprior estimates exert more influence on the estimate and that might mean for example largersample or more extreme values

And those have a greater influence on the posterior estimateof the parameters

Now, what's interesting is that all three of these methods all connectwith each other

Let me show you exactly how they connect

The ordinary least squares,OLS, this is equivalent to maximum likelihood, when it has normally distributed error terms

And maximum likelihood, ML is equivalent to Maximum A Posteriori or MAP, with a uniformprior distribution

You want to put it another way, ordinary least squares or OLS is a specialcase of Maximum Likelihood

And then maximum likelihood or ML, is a special case of MaximumA Posteriori, and just in case you like it, we can put it into set notation

OLS is asubset of ML is a subset of MAP, and so there are connections between these three methodsof estimating population parameters

Let me just sum it up briefly this way

The standardsthat you use OLS, ML, MAP they affect your choices and they determine which parametersbest estimate what's happening in your data

Several methods exist and there's obviouslymore than what I showed you right here, but many are closely related and under certaincircumstances they're all identical

And so it comes down to exactly what are your purposesand what do you think is going to work best with the data that you have to give you theinsight that you need in your own project

The next step we want to consider in our "Statisticsand Data Science", are choices that we have to make

Has to do with Measures of fit orthe correspondence between the data that we have and the model that you create

Now, turnsout there are a lot of different ways to measure this and one big question is how close isclose enough or how can you see the difference between the model and reality

Well, there'sa few really common approaches to this

The first one has what's called R2

That's kindof the longer name, that's the coefficient of determination

There's a variation; adjustedR2, which takes into consideration the number of variables

Then there's minus 2LL, whichis based on the likelihood ratio and a couple of variations

The Akaike Information Criterionor AIC and the Bayesian Information Criterion or BIC

Then there's also Chi-Squared, it'sactually a Greek c, it looks like a x, but it's actually c and it's chi-squared

Andso let's talk about each of these in turn

First off is R2, this is the squared multiplecorrelation or the coefficient of determination

And what it does is it compares the varianceof Y, so if you have an outcome variable, it looks like the total variance of that andcompares it to the residuals on Y after you've made your prediction

The scores on squaredrange from 0 to 1 and higher is better

The next is -2 Log-likelihood that's the likelihoodratio or like I just said the -2 log likelihood

And what this does is compares the fit ofnested models, we have a subset then a larger set, than the larger set overall

This approachis used a lot in logistic regression when you have a binary outcome

And in general,smaller values are considered better fit

Now, as I mentioned there are some variationsof this

I like to think of variations of chocolate

The -2 log likelihood there's theAkaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) and whatboth of these do, they adjust for the number of predictors

Because obviously you're goingto have a huge number of predictors, you're going to get a really good fit

But you'reprobably going to have what is called overfitting, where your model is tailored to specificallyto the data you currently have and that doesn't generalize well

These both attempt to reducethe effect of overfitting

Then there's chi-squared again

It's actually a lower case Greek c,looks like an x and chi-squared is used for examining the deviations between two datasets

Specifically between the observed dataset and the expected values or the model you create,we expect this many frequencies in each category

Now, I'll just mention when I go into thestore there's a lot of other choices, but these are some of the most common standards,particularly the R2

And I just want to say, in sum, there are many different ways to assessthe fit that corresponds between a model and your data

And the choices effect the model,you know especially are you getting penalized for throwing in too many variables relativeto your number of cases? Are you dealing with a quantitative or binary outcome? Those thingsall matter, and so the most important thing as always, my standing advice is keep yourgoals in mind and choose a method that seems to fit best with your analytical strategyand the insight you're trying to get from your data

The "Statistics and Data Science"offers a lot of different choices

One of the most important is going to be featureselection, or the choice of variables to include in your model

It's sort of like confrontingthis enormous range of information and trying to choose what matters most

Trying to getthe needle out of the haystack

The goal of feature selection is to select the best featuresor variables and get rid of uninformative/noisy variables and simplify the statistical modelthat you are creating because that helps avoid overfitting or getting a model that workstoo well with the current data and works less well with other data

The major problem hereis Multicollinearity, a very long word

That has to do with the relationship between thepredictors and the model

I'm going to show it to you graphically here

Imagine here forinstance, we've got a big circle here to represent the variability in our outcome variable; we'retrying to predict it

And we've got a few predictors

So we've got Predictor # 1 overhere and you see it's got a lot of overlap, that's nice

Then we've got predictor #2 here,it also has some overlap with the outcome, but it's also overlaps with Predictor 1

Andthen finally down here, we've got Predictor 3, which overlaps with both of them

And theproblem rises the overlap between the predictors and the outcome variable

Now, there's a fewways of dealing with this, some of these are pretty common

So for instance, there's thepractice of looking at probability values and regression equations, there's standardizedcoefficients and there's variations on sequential regression

There are also, there's newerprocedures for dealing with the disentanglement of the association between the predictors

There's something called Commonality analysis, there's Dominance Analysis, and there areRelative Importance Weights

Of course there are many other choices in both the commonand the newer, but these are just a few that are worth taking a special look at

First,is P values or probability values

This is the simplest method, because most statisticalpackages will calculate probability values for each predictor and they will put littleasterisks next to it

And so what you're doing is you're looking at the p-values; the probabilitiesfor each predictor or more often the asterisks next to it, which sometimes give it the nameof Star Search

You're just kind of cruising through a large output of data, just lookingfor the stars or asterisks

This is fundamentally a problematic approach for a lot of reasons

The problem here, is your looking individually and it inflates false positives

Say you have20 variables

Each is entered and tested with an alpha or a false positive of 5%

You endup with nearly a 65% chance of a least one false positive in there

That's distortedby sample size, because with a large enough sample anything can become statistically significant

And so, relying on p-values can be a seriously problematic approach

Slightly better approachis to use Betas or Standardized regression coefficients and this is where you put allthe variables on the same scale

So, usually standardized from zero and then to eitherminus 1/plus 1 or with a standardized deviation of 1

The trick is though, they're still inthe context of each other and you can't really separate them because those coefficients areonly valid when you take that group of predictors as a whole

So, one way to try and get aroundthat is to do what they call stepwise procedures

Where you look at the variables in sequence,there's several versions of sequential regression that'll allow you to do that

You can putthe variables into groups or blocks and enter them in blocks and look at how the equationchanges overall

You can examine the change in fit in each step

The problem with a stepwiseprocedure like this, is it dramatically increases the risk of overfitting which again is a badthing if you want to generalize your data

And so, to deal with this, there is a wholecollection of newer methods, a few of them include commonality analysis, which providesseparate estimates for the unique and shared contributions of each variable

Well, that'sa neat statistical trick but the problem is, it just moves the problem of disentanglementto the analyst, so you're really not better off then you were as far as I can tell

There'sdominance analysis, which compares every possible subset of Predictors

Again, sounds reallygood, but you have the problem known as the combinatorial explosion

If you have 50 variablesthat you could use, and there are some that have millions of variables, with 50 variables,you have over 1 quadrillion possible combinations, you're not going to finish that in your lifetime

And it's also really hard to get things like standard errors and perform inferential statisticswith this kind of model

Then there's also something that's even more recent than theseothers and that's called relative importance weights

And what that does is creates a setof orthogonal predictors or uncorrelated with each other, basing them off of the originalsand then it predicts the scores and then it can predict the outcome without the multicollinearbecause these new predictors are uncorrelated

It then rescales the coefficients back tothe original variables, that's the back-transform

Then from that it assigns relative importanceor a percentage of explanatory power to each predictor variable

Now, despite this verydifferent approach, it tends to have results that resemble dominance analysis

It's actuallyreally easy to do with a website, you just plug in your information and it does it foryou

And so that is yet another way of dealing with a problem multicollinearity and tryingto disentangle the contribution of different variables

In sum, let's say this

What you'retrying to do here, is trying to choose the most useful variables to include into yourmodel

Make it simpler, be parsimonious

Also, reduce the noise and distractions in yourdata

And in doing so, you're always going to have to confront the ever present problemof multicollinearity, or the association between the predictors in your model with severaldifferent ways of dealing with that

The next step in our discussion of "Statistics andthe Choices you have to Make", concerns common problems in modeling

And I like to thinkof this is the situation where you're up against the rock and the hard place and this is wherethe going gets very hard

Common problems include things like Non-Normality, Non-Linearity,Multicollinearity and Missing Data

And I'll talk about each of these

Let's begin withNon-Normality

Most statistical procedures like to deal with nice symmetrical, unimodalbell curves, they make life really easy

But sometimes you get really skewed distributionor you get outliers

Skews and outliers, while they happen pretty often, they're a problembecause they distort measures like the mean gets thrown off tremendously when they haveoutliers

And they throw off models because they assume the symmetry and the unimodalnature of a normal distribution

Now, one way of dealing with this as I've mentionedbefore is to try transforming the data, taking the logarithm, try something else

But anotherproblem may be that you have mixed distributions, if you have a bimodal distribution, maybewhat you really have here is two distributions that got mixed together and you may need todisentangle them through exploring your data a little bit more

Next is Non-Linearity

The gray line here is the regression line, we like to put straight lines through thingsbecause it makes the description a lot easier

But sometimes the data is curved and thisis you have a perfect curved relationship here, but a straight line doesn't work withthat

Linearity is a very common assumption of many procedures especially regression

To deal with this, you can try transforming one or both of the variables in the equationand sometimes that manages to straighten out the relationship between the two of them

Also, using Polynomials

Things that specifically include curvature like squares and cubed values,that can help as well

Then there's the issues of multicollinearity, which I've mentionedpreviously

This is when you have correlated predictors, or rather the predictors themselvesare associated to each other

The problem is, this can distort the coefficients youget in the overall model

Some procedures, it turns out are less affected by this thanothers, but one overall way of using this might be to simply try and use fewer variables

If they're really correlated maybe you don't need all of them

And there are empiricalways to deal with this, but truthfully, it's perfectly legitimate to use your own domainexpertise and your own insight to the problem

To use your theory to choose among the variablesthat would be the most informative

Part of the problem we have here, is something calledthe Combinatorial Explosion

This is where combinations of variables or categories growtoo fast for analysis

Now, I've mentioned something about this before

If you have 4variables and each variable has two categories, then you have 16 combinations, fine you cantry things 16 different ways

That's perfectly doable

If you have 20 variables with fivecategories; again that's not to unlikely, you have 95 trillion combinations, that'sa whole other ball game, even with your fast computer

A couple of ways of dealing withthis, #1 is with theory

Use your theory and your own understanding of the domain to choosethe variables or categories with the greatest potential to inform

You know what you'redealing with, rely on that information

Second is, there are data driven approaches

Youcan use something called a Markov chain Monte Carlo model to explore the range of possibilitieswithout having to explore the range of possibilities of each and every single one of your 95 trillioncombinations

Closely related to the combinatorial explosion is the curse of dimensionality

This is when you have phenomena, you're got things that may only occur in higher dimensionsor variable sets

Things that don't show up until you have these unusual combinations

That may be true of a lot of how reality works, but the project of analysis is simplification

And so you've got to try to do one or two different things

You can try to reduce

Mostlythat means reducing the dimensionality of your data

Reduce the number of dimensionsor variables before you analyze

You're actually trying to project the data onto a lower dimensionalspace, the same way you try to get a shadow of a 3D object

There's a lot of differentways to do that

There's also data driven methods

And the same method here, a Markovchain Monte Carlo model, can be used to explore a wide range of possibilities

Finally, thereis the problem of Missing Data and this is a big problem

Missing data tends to distortanalysis and creates bias if it's a particular group that's missing

And so when you're dealingwith this, what you have to do is actually check for patterns and missingness, you createnew variables that indicates whether or not a variable is missing and then you see ifthat is associated with any of your other variables

If there's not strong patterns,then you can impute missing values

You can put in the mean or the median, you can doRegression Imputation, something called Multiple Imputation, a lot of different choices

Andthose are all technical topics, which we will have to talk about in a more technically orientedseries

But for right now, in terms of the problems that can come up during modeling,I can summarize it this way

#1, check your assumptions at every step

Make sure thatthe data have the distribution that you need, check for the effects of outliers, check forambiguity and bias

See if you can interpret what you have and use your analysis, use datadriven methods but also your knowledge of the theory and the meaning of things in yourdomain to inform your analysis and find ways of dealing with these problems

As we continueour discussion of "Statistics and the Choices that are Made", one important considerationis Model Validation

And the idea here is that as you are doing your analysis, are youon target? More specifically, the model that you create through regression or whateveryou do, your model fits the sample beautifully, you've optimized it there

But, will it workwell with other data? Fundamentally, this is the question of Generalizability, alsosometimes called Scalability

Because you are trying to apply in other situations, andyou don't want to get too specific or it won't work in other situations

Now, there are afew general ways of dealing with this and trying to get some sort of generalizability

#1 is Bayes; a Bayesian approach

Then there's Replication

Then there's something calledHoldout Validation, then there is Cross-Validation

I'll discuss each one of these very brieflyin conceptual terms

The first one is Bayes and the idea here is you want to get whatare called Posterior Probabilities

Most analyses give you the probability value for the datagiven; the hypothesis, so you have to start with an assumption about the hypothesis

Butinstead, it's possible to flip that around by combining it with special kind of datato get the probability of the hypothesis given the data

And that is the purpose of Bayestheorem; which I've talked about elsewhere

Another way of finding out how well thingsare going to work is through Replication

That is, do the study again

It's consideredthe gold standard in many different fields

The question is whether you need an exactreplication or if a conceptual one that is similar in certain respects

You can arguefor both ways, but one thing you do want to do is when you do a replication then you actuallywant to combine the results

And what's interesting is the first study can serve as the Bayesianprior probability for the second study

So you can actually use meta-analysis or Bayesianmethods for combining the data from the two of them

Then there's hold out validation

This is where you build your statistical model on one part of the data and you test it onthe other

I like to think of it as the eggs in separate baskets

The trick is that youneed a large sample in order to have enough to do these two steps separately

On the otherhand, it's also used very often in data science competitions, as a way of having a sort ofgold standard for assessing the validity of a model

Finally, I'll mention just one moreand that's Cross-Validation

Where you use the same data for training and for testingor validating

There's several different versions of it, and the idea is that you're not usingall the data at once, but you're kind of cycling through and weaving the results together

There's Leave-one-out, where you leave out one case at a time, also called LOO

There'sLeave-p-out, where you leave out a certain number at each point

There's k-fold whereyou split the data into say for instance 10 groups and you leave out one and you developit on the other nine, then you cycle through

And there's repeated random subsampling, whereyou use a random process at each point

Any of those can be used to develop the modelon one part of the data and tested on another and then cycle through to see how well itholds up on different circumstances

And so in sum, I can say this about validation

Youwant to make your analysis count by testing how well your model holds up from the datayou developed it on, to other situations

Because that is what you are really tryingto accomplish

This allows you to check the validity of your analysis and your reasoningand it allows you to build confidence in the utility of your results

To finish up ourdiscussion of "Statistics and Data Science" and the choices that are involved, I wantto mention something that really isn't a choice, but more an attitude

And that's DIY, that'sDo it yourself

The idea here is, you know really you just need to get started

Rememberdata is democratic

It's there for everyone, everybody has data

Everybody works with dataeither explicitly or implicitly

Data is democratic, so is Data Science

And really, my overallmessage is You can do it! You know, a lot of people think you have to be this cuttingedge, virtual reality sort of thing

And it's true, there's a lot of active developmentgoing on in data science, there's always new stuff

The trick however is, the softwareyou can use to implement those things often lags

It'll show up first in programs likeR and Python, but as far as it showing up in a point click program that could be years

What's funny though, is often these cutting edge developments don't really make much ofa difference in the results of the interpretation

They may in certain edge cases, but usuallynot a huge difference

So I'm just going to say analyst beware

You don't have to necessarilydo it, it's pretty easy to do them wrong and so you don't have to wait for the cuttingedge

Now, that being said, I do want you to pay attention to what you are doing

Acouple of things I have said repeatedly is "Know your goal"

Why are you doing this study?Why are you analyzing the data, what are you hoping to get out of it? Try to match yourmethods to your goal, be goal directed

Focus on the usability; will you get something outof this that people can actually do something with

Then, as I've mentioned with that Bayesianthing, don't get confused with probabilities

Remember that priors and posteriors are differentthings just so you can interpret things accurately

Now, I want to mention something that's reallyimportant to me personally

And that is, beware the trolls

You will encounter critics, peoplewho are very vocal and who can be harsh and grumpy and really just intimidating

And theycan really make you feel like you shouldn't do stuff because you're going to do it wrong

But the important thing to remember is that the critics can be wrong

Yes, you'll makemistakes, everybody does

You know, I can't tell you how many times I have to write mycode more than once to get it to do what I want it to do

But in analysis, nothing iscompletely wasted if you pay close attention

I've mentioned this before, everything signifies

Or in other words, everything has meaning

The trick is that meaning might not be whatyou expected it to be

So you're going to have to listen carefully and I just want toreemphasize, all data has value

So make sure your listening carefully

In sum, let's saythis: no analysis is perfect

The real questions is not is your analysis perfect, but can youadd value? And I'm sure that you can

And fundamentally, data is democratic

So, I'mgoing to finish with one more picture here and that is just jump write in and get started

You'll be glad you did

To wrap up our course "Statistics and Data Science", I want to giveyou a short conclusion and some next steps

Mostly I want to give a little piece of adviceI learned from a professional saxophonist, Kirk Whalum

And he says there's "There'sAlways Something To Work On", there's always something you can do to try things differentlyto get better

It works when practicing music, it also works when you're dealing with data

Now, there are additional courses, here at datalabb

cc that you might want to look at

They are conceptual courses, additional high-level overviews on things like machine learning,data visualization and other topics

And I encourage you to take a look at those as well,to round out your general understanding of the field

There are also however, many practicalcourses

These are hands on tutorials on these statistical procedures I've covered and youlearn how to do them in R, Python and SPSS and other programs

But whatever you're doing,keep this other little piece of advice from writers in mind, and that is "Write what youknow"

And I'm going to say it this way

Explore and analyze and delve into what you know

Remember when we talked about data science and the Venn Diagram, we've talked about thecoding and the stats

But don't forget this part on the bottom

Domain expertise is justas important to good data science as the ability to work with computer coding and the abilityto work with the numbers and quantitative skills

But also, remember this

You don'thave to know everything, your work doesn't have to be perfect

The most important thingis just get started, you'll be glad you did

Thanks for joining me and good luck!