NTBB: Stats

VoodooMike · Post by **VoodooMike** » Sun Feb 03, 2013 4:37 pm

koadah wrote:If we run the stripped down stats and they show that the rosters is fit the tiers then that would be good enough for me.

With very few exceptions that is already the case, assuming we're applying the win% to the overall data. I don't think that's a good way to do it, but if we DO do it that way then most teams fall right where they should. That doesn't mean that some aren't outside of their tier range at low TVs, or some are outside of it at high TVs, making them "unbalanced" in that range... which is why I think we should be looking to keep them inside their range, when facing similar TV teams, all the time. When there are large TV differences, we'll expect to see higher variation.

koadah wrote:If we are concentrating on early games then is there any evidence that even the CPOMB changes are necessary/desired?

There has never been any evidence that changes are necessary, regardless. The best argument for making changes to that combo is that it's a bit silly that players have a higher chance of being sent off the field if they're standing and getting hit by it, than if they're on the ground being fouled. Statistically, we haven't seen much effect from CPOMB.

koadah wrote:If we leave the longer term data in then it looks to me as though orcs need a buff not a nerf.
Isn't the longer term data the 'overall' that you objected to earlier?. Do you mean longer term low TV only?

I mean that we know TV and TV difference have an effect on win%... number of games played does not... or rather, analysis will remove games played from the calculation if you do a stepwise regression because games played mildly correlates to TV, which is what has the effect on win%. What I am saying is that we should be looking at the vicissitudes of win% across the TV ranges and making changes based on that - bring the highs down, the lows up, such that we stay within the tier range as best as possible. You want short-term play then you look at low TV range play... if you try to use games played you're using a variable that has no direct effect on the variable we're looking at (win%), only a third-party variable relationship, which means you're going to end up with more variation than we actually have to deal with using the same data.

koadah wrote:I didn't think that we were looking at a TV range for NTBB. I thought it was number of games. Some teams will grow some won't. If zons min/max but can't beat the bigger teams then do they really need a nerf? I mention the ranges because that is what I have to hand.

Number of games is a dumb thing to subdivide the games by, for the reason I mentioned above... and above that, and above that, and above that. I'm not sure why people have such a hard time understanding this. Probably because they're not TRYING to understand it, they're just trying to find things to support what they already think.

As to whether the amazons need a nerf... I think they do need a low-TV nerf, and a higher-TV buff. They're a nightmare at low TVs, and no threat at all at high TVs - they should be a moderate threat at all TVs. It comes back to the hyperbolic example I used in the previous posts... a team that has a 100% win rate at 1000 TV, then gets steadily worse until they drop to 0% after about 1200. That may well average out to a 50% win rate, and thus, seem "balanced" in tier1... but what you really have is a roster that has no reason to be represented in 1200+ TV play, and which will make everyone frothingly unhappy at TV 1000 (other than the people playing them). Having a goal of a specific win% range using the total data is a meaningless goal if you know that there's very heavy variation along the TV path.

koadah wrote:I don't think many people do call that min/maxing.

Funny, because that's pretty much all the FUMBBL people could talk about when we were discussing min/maxing. Seems like plenty of your brethren over there believe it is. I just can't get behind that concept - on the one hand they want to talk about "sweetspotting" for rosters, but if that so-called "sweet spot" happens to be at a low TV, then that team is just redefined as minmaxing for trying to stay at their "sweet spot". How dare coaches try to win?

koadah wrote:Retiring players who get a 2nd skill to keep TV low is what most are on about I think. Going 10/11 line girls to get blodgers for 70k instead of 90k for blitzers.

I don't see the issue... for an amazon team that 2nd skill *is* bloat, and linemen ARE better than blitzers in that regard. I don't disagree that the roster is badly designed because it can easily accomplish all that, but I don't think the coach is being an flower of an undescript sort for playing the roster that way, nor is he picking on lowbie teams in some dickish fashion... his team is a lowbie team, its just that his roster happens to be ridiculously good in that fashion. Of course, that's back to the 100%/0% thing, right? Amazons still average out to being more or less inside their tier range, herp derp...

koadah wrote:In the Fumbbl [L]eague division I see amazons 52% overall and still only 56.25% under 1200tv. compared to 59,38 & 62.89 in Box. And another shocker 59.99 & 63.16 in ranked. Cherry picking worse than min/maxing?

Forgive me if I'm wrong here, but... can't people invent their own everything in League, and set whatever rules they feel like? I'd be pretty skeptical of data pulled from a source that isn't required to play using the standard array of rosters and standard rules. Are you sure that the difference between the overall and sub1200 is actually statistically significant between Box and Ranked?

Hitonagashi · Post by **Hitonagashi** » Mon Feb 04, 2013 5:00 am

VoodooMike wrote:
koadah wrote:In the Fumbbl [L]eague division I see amazons 52% overall and still only 56.25% under 1200tv. compared to 59,38 & 62.89 in Box. And another shocker 59.99 & 63.16 in ranked. Cherry picking worse than min/maxing?
Forgive me if I'm wrong here, but... can't people invent their own everything in League, and set whatever rules they feel like? I'd be pretty skeptical of data pulled from a source that isn't required to play using the standard array of rosters and standard rules. Are you sure that the difference between the overall and sub1200 is actually statistically significant between Box and Ranked?

While your general point is true, they can, in the last 2 years or so, most of the joke leagues have died down to no membership.

I'd say a good 70-80% (based on personal observation of games played when I'm online) of the games played in L are in the 'competitive' leagues. I've heard more than a few top coaches say that the most competitive environment on FUMBBL is the leagues. I know when I was running in the WIL premiership for a while, I would be playing a top coach (the sort you find in last 16 of Majors) every game.

VoodooMike · Post by **VoodooMike** » Mon Feb 04, 2013 9:16 am

Hitonagashi wrote:While your general point is true, they can, in the last 2 years or so, most of the joke leagues have died down to no membership.

What we know is that [L]eague is not required to use CRP rules or rosters. You speculate that it's mostly CRP now. Do you think that's sufficient to treat it as accurate data on which to base conclusions about CRP on?

Hitonagashi wrote:I'd say a good 70-80% (based on personal observation of games played when I'm online) of the games played in L are in the 'competitive' leagues. I've heard more than a few top coaches say that the most competitive environment on FUMBBL is the leagues. I know when I was running in the WIL premiership for a while, I would be playing a top coach (the sort you find in last 16 of Majors) every game.

And 60% of the time you're right every time? Your personal observation isn't serious support for the data - the entire point of this is that the numbers quite often do NOT agree with people's personal observations. This isn't about you specifically - I wouldn't take my OWN personal observations as serious data either. Many of those "top coaches" as you call them, honestly believe CPOMB is having a numeric effect on the success of teams that can field the combination... and such a thing has never been found.

Box and Ranked both have enforced CRP compliance, and they both report the same figures for Amazons in this case. The numbers are so close for those two datasets that there's a very good chance the difference isn't even statistically significant, much less practically significant, and that speaks volumes given that Ranked is not TV matching. I'd love to see the 95% CI on those figures, though if Koadah isn't calculating them from the raw data there's a decent chance they won't be available.

Hitonagashi · Post by **Hitonagashi** » Mon Feb 04, 2013 3:34 pm

Box and Ranked also have enforced TV matching, which skews the data as to the teams in it's own way (due to sweetspotting). It does get relaxed in R, but the relaxation is more than countered by the tendency towards only accepting matchups that are not damaging to your team.

All I'm saying is that if you want a non-TV matched version of the data-set (which deals with what a real league would probably be like), then taking the top 5-10 leagues on FUMBBL and analysing the data from their games is almost certainly going to provide a better view on the races than taking R/B data.

If you look at WIL (the first that comes to mind), then you'll see a wide variety of teams/builds/TV, in an environment where every game counts.

plasmoid · Post by **plasmoid** » Mon Feb 04, 2013 11:16 pm

Hi Mike,

The point is that there will be no change in the results by adding your data to it... by all means go do it, and take a look at the effect it has on the results (basically zero). An analogy would be having the average score of 1000 students in a school district, and someone saying "but wait, Suzy here scored really well... lets put her score in the mix and see if that changes things!". Do it, but thinking that it'll make things "interesting" just suggests you don't understand how aggregation works.

Interesting analogy as always.
Interesting how you grabbed 1 in 1000 out of thin air to make it sound better too.
In reality it would be 20-40 students.
But just to be absolutely clear: Is there any good reason to ignore one 40th of the data?
That rather jars with your call for more data below.

Unless the win% raises steadily as TV increases, then the roster has to have peaks. I'm unconvinced they're meaningful, especially since not every roster has such a pattern.

Not meaningful how?
If most (or all) rosters have peaks, and coaches try to get to those peaks, won't they have an advantage over teams they face at the same TV, if the opposing race has it's peak somewhere else?

Riddle me this, plasmoid... if people run the lower power statistics and it is shown that the win%s for rosters do not support any alteration in order to bring them into line with the tiers they're placed in, or even that the lower tiers 95% confidence range extends higher than their tier would normally restrict them to... are you going to abandon NTBB?

First, why would I abandon NTBB completely? It has different elements.
But if we get reliable data that one (or more) teams are neither outside of their tier in overall stats, nor in the stats for short term play, then why would I want to use the modified rosters?
I have no vested interest in any one particular change.

We both know the answer is "no". You're looking for ways to cook the numbers to support what you already think, not to use the numbers to guide your actions... that's not statistics, that is deliberate deception.

You're totally entitled to your opinion.

You already asked about lowering the confidence interval... you're looking for confirmation of what you think, not information on what is actually happening.

Actually, as I think I explained, I was trying to undertand how 'margin of error' works.
That way I can lay out on the website what can truthfully be said about the numbers and let visitors make up their own minds.

Number of games is a dumb thing to subdivide the games by, for the reason I mentioned above... and above that, and above that, and above that. I'm not sure why people have such a hard time understanding this.

I suspect because you don't make a very compelling case.
I'm interested in short term play because I believe - as you do - that some teams start out strong and finish weak, and those teams will look fine in the 'overall' stat, but could very well be short term "broken" (for lack of a better word).

You want to examine all TV-ranges. Fine. I wish you the best of luck. Someone statistically challenged like myself don't know how much data you'd need to gather for that to generate significant numbers, but you obviously do.

I'm trying for something less ambitious. I want to examine short term league play - (which I think has a loose relation to tournament play).
Examining short term play, I can't for the life of me understand why you'd muck up the data with long term play data that you just hope is going to behave in a similar way. Yes, I understand that it would add volume. And that it would add 'power' if - and only if - your added low-TV data happens to mimic short term play accurately. If not, it would destroy the reliability of the data. Adding data like that sounds like just another way to cook the numbers.
I think you'd first need to prove that long term low-TV teams are equally as powerful as short term low-TV teams before adding it into the model.

Here's where it breaks down for me:
You've previously said that the TV system is by no means an accurate portrayal of a team's power. I completely agree.
I assume you mean by that that, say, TV1250 isn't always the same thing. Again I agree.
IMO, a lot of teams are a lot less powerful when they first hit TV1250 (in what, 5 games) than if they take the time to refine their team, stack skills on the right players, trim away excess rerolls, players and even staff. It doesn't take many blodgers or (c)pombers to leave a 5-game opponent in a very tight spot.
So comparing an early play TV1250 team to a TV1250 groomed team - or either of those to a TV1500 that took a beating and randomly dropped to 1250TV seems to me to be a very poor idea.

Cheers
Martin

PS - oh, did you see the numbers I added up on page 1 near the bottom.
It has Undead and Amazon above tier 1. It stands to reason that if their overall is higher than 55% then they are extremely likely to have isolated periods in their lifespan even higher than that. As you know I don't know how to calculate margins of error, and I don't have the google-fu to pretend I know everything, so, how wide or narrow a margin comes with 21521 games and 13424 games?

VoodooMike · Post by **VoodooMike** » Tue Feb 05, 2013 12:40 am

Hitonagashi wrote:Box and Ranked also have enforced TV matching, which skews the data as to the teams in it's own way (due to sweetspotting). It does get relaxed in R, but the relaxation is more than countered by the tendency towards only accepting matchups that are not damaging to your team.

You present a lot of your suspicions as facts, hitonagashi. You can't "counter" the difference between Box and Ranked - you can have different forced influencing the win%, but it's still quite interesting to note that despite those differences, the figures are less than a single percent different.

Now, since you and plasmoid clearly can't wrap your head around why it's better to use TV-matched data for this purpose, let me try to give you both a very quick lesson in statistics, such that you might finally understand why what you keep imagining would be "better" data for this, is actually significantly worse data... not to mention non-existent data.

First, lets introduce the hero of our story, the normal curve:

The normal curve (also called the "bell curve" by students) represents the distribution of all data within sampling. The mean of our sampling values is the center of the curve, represented in a standardized unit called the "Standard Deviation", meaning a standard unit of measurement of deviation from the mean. There's a formula to translate actual values into these SD units (also called "Z scores") but we both know you don't care what the formula is, so lets move on.

Variance, Power, and Error

In social sciences (basically anything that isn't drug testing, when it comes down to it) we talk about using a 95% Confidence Interval. What this means is that we've calculated that the ACTUAL VALUE for a population (which is to say, what we're looking for when we sample from that population) falls within a certain bounds around our sample mean. The quick and dirty way to do this is to say that we know it falls within 2z of the mean (+/- 2 SD). That's how dode does it, and bookies, because its very fast and takes no real calculation past figuring out the SD... it represents a bit more than 96% of the area, but its close enough that outside of science nobody cares.

What does this really mean? It means that we're 95% certain that the real value is in that range. The Mean value is not gospel, it is simply where our sample has clustered. The mean value is NOT the actual value, it is our sample mean and it serves primarily to anchor our range. If someone reports only the mean of a sample then it is utterly useless numbers as far as the population is concerned.

What causes the range? Why are there Standard Deviations from the mean at all? In samples, there is variance.. individual data points are rarely all the same, and even groups of datapoints will rarely be the same. This is what is called "statistical error" - it doesn't mean we did anything wrong, it just means these are sources of variance that we can't get rid of. We can't make all students identical. We can't make every day of the year give the same weather. We just deal with these fluctuations by calculating the range, and trying to let the variation even itself out over time.

The "over time" thing is really a way of saying "across a lot of sampling". Lets say we have a single die, but we have no idea how many sides it has, and we want to know what the average roll on that dice is. We know that when we roll a given die, it will give us a number from 1 to N, where N is the number of sides it has. This fact is a source of error because it contributes to the variation in the data points we're going to have... so how do we deal with it? We roll it *a lot*, and know that while each individual roll has plenty of expected variation, over time that variation will balance itself out. How much time? Depends on how accurate you want things to be.

Increasing your sample size, or controlling for variation, increases your statistical power... which really means your ability to FIND effects. Error of all sorts decreases your power, and decreases your ability to find effects. In simpler terms, more power means that each SD/Z represents a smaller number... more error means each SD/Z represents a larger number. Power decreases our range, error increases our range.

Statistical Significance

If you've watched statistics discussions before, you will have seen mention of things like "null hypothesis" and statistical significance. I'll explain these in terms of this particular discussion, and what has been explained above.

Lets say we're examining the Amazon roster's overall win%. As a Tier 1 team, it is stated that their win% should be between 45% and 55%. Since we're looking at whether or not this is true such that we can justify altering the roster we are really seeking to DISprove the existing assumptions. The existing assumption is the null hypothesis - that the amazon win% is, in fact, between 45% and 55%... we're going to see if the data supports the idea or not.

Now, lets say we find that the average win% in our data for Amazon teams is 59%... someone might say "well there we go, that proves its above the tier 1 range", but as I mentioned above, the straight mean is mean-ingless without the confidence interval. If the data happens to be 59% +/- 5%, then the lower part of our data's range is actually inside the 45-55% range, and we cannot say that the data supports the idea that the actual win% for the amazon roster is outside the expected range for a tier 1 team.. in statistical parlance, we'd say we cannot reject the null hypothesis.. because the difference between the expected value, and our value, is NOT statistically significant.

Only if the expected value (according to the null hypothesis) falls completely outside our range can we reject the null and say the data supports the idea that the roster does not fall within expected tier 1 values. That means that the difference between the expected value and the value in our data has achieved statistical significance.

In our example with 59% +/- 5% the 5% is our (arbitarily chosen for this example, of course) 95% confidence interval or "margin of error" as some people have been calling it. What we're really saying is that our data says the win% for amazons is somewhere in 54% to 63%. The mean itself has no inherent value and should NOT be used for anything on its own.

As we increase error by, for example, including new sources of variance... that 5% will become 6% then 7%, and so on. As we increase power the 5% will drop to 4%, then 3%, and so on.

Threats to Validity

This is where most of the arguing happens when we discuss statistics and statistical models. Confounds are things about our data that might make it more or less useful for examining whatever it is we're examining. The most hyperbolic example would be using the average number of apples eaten each day by our focus group in Iceland, to predict the number of oranges eaten each day in Australia. First off... apples aren't oranges, so we can wonder if apple consumption really relates in any way to orange consumption... and Iceland is not Australia, so we can wonder if the fruit consumption habits in each country are going to be similar.

Now, such things do not mean that the data is irrelevant - they are simply points for academic debate, and possibly for further study. There are, in fact, different types of validity.. but for our purposes we're mostly talking about "external validity", which is to say, how well will our model relate to the outside population we're trying to draw conclusions about. If there are too many confounds, then the conclusions you come to will likely be ignored as unsupported by the model. Think "vaccinations cause autism" - someone did publish a paper that "showed a link", but on examination of his dataset it was found that it was ridiculously flawed due to multiple major confounds.

Let me assure you, however, that the biggest, most damning confound of the all is "you have no data to support your claim", which applies to everything starting with statemetns like "well, in my experience..." or "I've played a lot of games, and I think...". Punch yourself in the mouth any time those words try to escape your lips.

Confounds are debate points... error is not.. you can't debate away variation.. it's going to affect your numbers whether you believe in it or not, which is why error is a lot easier to talk about.

------

Now, if you've managed to understand what has been said above, lets move on to the argument at hand:

Why MM data is better than League data

Sources of error and amount of data, plain and simple. Leagues have two additional sources of error that MM play lacks - composition, and TV difference. Now read what I said.. these are not confounds, they are sources of error, meaning they will increase the variation in the data and in doing so, widen the confidence interval. I will explain why each of those is a source of error, briefly.

There are 24 rosters in CRP (well, as far as we're concerned). Leagues are in no way required to contain all of those rosters, and when they do NOT contain all the rosters, we're adding variation to our data... because the win% in that league is UNaffected by the presence of the omitted rosters. If you have a league that lacks any Dwarf or Chaose Dwarf teams (the two rosters Amazons have the roughest time against) then the Amazon win% in that league will go up. In a league of all dwarf and chaos dwarf teams except for a single amazon team, their win% will probably be wretched. Across enough leagues and enough games, this will work itself out... but it requires proportionally more data to do so.

Lets look at "WIF" as Hitonagashi suggested. 19 Rosters are represented in the league (no Elf, Pact, Halfing, Goblin, or Ogre teams), and several rosters that are represented have only one or two teams (Chaos Dwarf, Dwarf, Khemri, Norse, Nurgle, Slann, Underworld, Vampire, Wood Elf). The win% of a given roster in WIF, then, is unaffected by the missing rosters, and more heavily affected by the skill of the coach of the poorly represented rosters. This will result in even wider intervals.

So far every examination of TV difference has displayed an inverse relationship between TV difference and underdog win%, across all rosters. This means, in essence, that all examinations of TV difference have suggested that the win% for a roster goes up as their TV superiority over an opponent increases, and goes down as the TV difference favoures the other guy. This is quite obviously a source of variation in win% (by definition!) which means it is a source of error. The more TV difference you allow in your data, the more variation you are introducing. Again, if you use enough datapoints this will work itself out in the end, but it requires more data to do so WITH higher TV differences, than it does with lower TV differences.

The more sources of error you include, the wider your intervals will be. The less datapoints you have, the wider your intervals will be. League data introduces more sources of error, and there is exponentially LESS data available. Your intervals are going to be very, very wide, which means you're very very unlikely to find any statistical significance, and thus will have no statistical justification for making any changes whatsoever. This applies to alterations WITHIN tiers as well, as you'd have to demonstrate that there's significant differences in the win% of two same-tier rosters, in order to justify nerfing one and/or buffing the other.

We know that the data you (plasmoid and hitonagashi) want to use will be much loser in power/higher in error. You think that MM data is confounded, but have no data to support that they represent different outcomes. This would be a good reason to take no action at all, but it is in no way an excuse to not use numbers and instead change things by feel. That is the equivalent of saying "Well, unless you can tell me conclusively why the universe came into being, I'm right in saying it was God".

By all means, go run the data (assuming you know how) and see what happens. I've been trying to tell you why your models will fail you... but as I say, it is no skin off my ass if you want to run them anyway, they'll just give you results that will not support the idea that NTBB is in any way necessary because you are almost certain to lack enough statistical power to find effects worthy of fixes.

If you have no properly collected and calculated data to support what you want to do, you're making shit up, plain and simple. While you can do that, some people (myself included) will point out that you are just making shit up, especially when you try to claim you're working with real numbers and data (which you aren't).

Hitonagashi · Post by **Hitonagashi** » Tue Feb 05, 2013 1:34 am

Great writeup.

I knew most of that at some level ( I did a maths/cs degree, but my entire maths half concentrated on number theory and other 'useless' pursuits), but it was interesting to see it laid out. Thanks for putting the time in to write it up.

I could write a lot more on this topic, making arguments and counterpoints but I feel, even reading that, you've missed my entire point (as you no doubt feel about me).

If the source itself has a natural bias, is the data-set it produces valid for drawing conclusions about a source with no bias? For example, if you had a d6 that had a 22% chance of rolling a 2, would 10 million rolls of that dice give you enough information to tell you what the distribution of an unbiased dice would be?

This is the argument that me and Plasmoid are making (that it can't). If it can, then I'm interested to see the techniques used.

VoodooMike · Post by **VoodooMike** » Tue Feb 05, 2013 2:35 am

Hitonagashi wrote:I could write a lot more on this topic, making arguments and counterpoints but I feel, even reading that, you've missed my entire point (as you no doubt feel about me).

Your "point" is that there are potential confounds - I didn't miss it, I addressed it. The confounds are speculation with no data to support that they make a difference - maybe it has a significant effect, maybe it does not. You feel it will, but what does that really mean? If you want to demonstrate that it does, you can absolutely collect and run comparison models on data you feel is less confounded, and try to show a statistically significant difference between that and the MM data.

You're feeling your way through statistics the same way plasmoid is feeling his way through game design.

Hitonagashi wrote:If the source itself has a natural bias, is the data-set it produces valid for drawing conclusions about a source with no bias? For example, if you had a d6 that had a 22% chance of rolling a 2, would 10 million rolls of that dice give you enough information to tell you what the distribution of an unbiased dice would be?

Assumed bias. Again, you need to wrap your head around the difference between what you think is true, and what is verifiably true.

As to your analogy, it is actually very useful, but in the opposite direction you imagined. If we were looking at the average dice roll of a d6, then the average dice roll of a so-called "unbiased d6" would actually be irrelevant, and outside the scope of inquiry. The population in that case would be all the rolls ever made with the d6 we were using (past, present, and future), while our sample would be all the rolls we'd recorded. After a sufficiently high number of rolls, we would note that the average roll was NOT 3.5, and eventually the confidence interval would shrink (because our sample size would increase, increasing our statistical power) to the point that the 95% CI did not cover 3.5, at which point we could say that there is a statistically significant reason to say that the dice itself is not a typical, balanced 1d6 (since we know the theoretical mean for such a creature).

What we could not justify doing is say "we don't have enough data, so lets assume the average is 2... we must fix/discard this die!" - that's what has happened with NTBB thus far, and is why the numbers people balk at NTBB's changes as related to its stated goals.

Finding that the dice does not give an average roll of 3.5 is exactly the same as what we'd be looking for in regard to the (supposed) principles guiding NTBB - we know what the value is supposed to be IN THEORY, but someone thinks that the theoretical value is incorrect, which is what statistical analysis is all about. The more power your model has, the better your ability to prove yourself correct in that thinking. The weaker the power of your model, the less ability you'll have to find a difference even if one exists.

So, you're trying to say that because you can envision a confound in the data, it should be automatically disregarded. Without supporting evidence your point is purely an academic exercise, and does not actually negate the value of the data in question. I'd be fascinated if you could support your theory with some numbers, but we both know you have none at the moment.

Hitonagashi wrote:This is the argument that me and Plasmoid are making (that it can't). If it can, then I'm interested to see the techniques used.

You and plasmoid have different arguments, actually. Plasmoid imagines that reducing the amount of data will provide a clearer picture, and that is what I attempted to explain away in the long post above. You think the data should be discarded for this purpose because it doesn't exactly match every aspect of the environment we'd like to be examining... and that's simply not a justified position (the discarding part... it's justified to look at all the differences between the sample and the population). Again, we have no data to support that theory, so its just your feeling at the moment...

Would it be better to have 600,000,000 league games to use as data? Maybe! We don't have that. We have 100-200k of MM games to work from, and a pitifully small number of league games to work from... given that league data has more sources of error, you need even more data of that type, to get equal statistical power to the MM data.

So, work from whatever data you'd like, but understand that if you work from a small sample of high variance data, you'll basically never be able to statistically justify taking any action at all... which is why I asked plasmoid if he'd throw NTBB away if the numbers, using the (unnecessarily) subdivided data he's asking for, failed to show a statistically significant difference from the expected values. If not then what he's really trying to do is find numbers that support what he feels, and whenever the numbers fail to, he'll discard them. That's not research, that's religion.

Hitonagashi · Post by **Hitonagashi** » Tue Feb 05, 2013 4:13 am

You didn't quite get my meaning there (though you got the point I was trying to make). Given two populations, if you have a model that describes population A, you cannot take the uncorrelated population B and use the data from that population to validation your model of A. It's like trying to measure dogs to work out facts about cats! Obviously, to some extent, this is reductio ad absurdium. MM is at the heart of it Bloodbowl, as is Leagues, so they have a core data set they share in common.

However, I feel that TV based matchmaking is fundamentally broken and not how the designers envisaged the game being played. I also think that, because of the ease of predicting opposing builds (or in the case of R, choosing favourable ones), any attempt to fix it will be easily gamed as well. As you say though, I haven't evidence for this, and to be quite honest, it's a thought experiment for me. My hypothesis is that if the L data was collected, the win percentages would be quite drastically different to R/B (by up to 5-10% in the case of the sweetspotted races such as Zons).

I would say there are around 9000 CRP league games on FUMBBL to use, running a estimation by doing teams*seasons*games_in_season on the popular CRP leagues. There is also at least 5000 major games(probably nearer 10k), which I would class as competitive (but have their own bias due to the lack of teams below 1700).

To me, inducements aren't an error bar, they are part of the game. Most of the inducements don't affect a single game...but they do affect an entire season (they inflict damage, rather than win games). To be honest though, what Plasmoid wants is up to him. To put this in a way that makes the 'numbers people' happier, I think before thousands of hours are put into analysing the B/R data-set in full for NTBB, the data should be compared to the L data to see if there is any statistically significant difference. I think there will be.

VoodooMike · Post by **VoodooMike** » Tue Feb 05, 2013 7:37 am

Hitonagashi wrote:You didn't quite get my meaning there (though you got the point I was trying to make). Given two populations, if you have a model that describes population A, you cannot take the uncorrelated population B and use the data from that population to validation your model of A. It's like trying to measure dogs to work out facts about cats! Obviously, to some extent, this is reductio ad absurdium. MM is at the heart of it Bloodbowl, as is Leagues, so they have a core data set they share in common.

And again, your feelings stated as though they were facts. Do you have any data whatsoever to support the idea that in spite of the fact that League play and MM play both use the same rules and same rosters, there is NO COVARIANCE AT ALL across... well, whatever unspecified changing variable you're talking about, since correlation requires that. You don't. You're talking shit. Again.

Do you have any data that shows that there is a statistically significant difference between results in League play, using unchanged CRP rules and rosters, and MM play? If your answer is no then everything you're saying and have been saying is just assumption based on your gut feelings.

Hitonagashi wrote:As you say though, I haven't evidence for this, and to be quite honest, it's a thought experiment for me. My hypothesis is that if the L data was collected, the win percentages would be quite drastically different to R/B (by up to 5-10% in the case of the sweetspotted races such as Zons).

You think the win%s will end up being statistically significant in their difference? I think it's highly unlikely. You'll have much larger variance in league play, which will make the CI range much wider. The mean may end up different, but as I explained above the mean has no inherent meaning without the CI ranges, and if the CI ranges for two groups overlap then the difference is, by definition, not statistically significant and we cannot, with our agreed upon level of confidence, say the two populations are at ALL different in that respect.

That said, I bet you won't lift a finger to collect or analyze data. This isn't a thought experiment, this is just random speculation on your part... thought experiments involve pure logic and deduction in the deliberate absence of pragmatic effect and data... that's not what you're doing.

Hitonagashi wrote:I would say there are around 9000 CRP league games on FUMBBL to use, running a estimation by doing teams*seasons*games_in_season on the popular CRP leagues. There is also at least 5000 major games(probably nearer 10k), which I would class as competitive (but have their own bias due to the lack of teams below 1700).

Awesome... serve 'em up. I'm certainly interested in what the actual data says. Not at all interested in what you or plasmoid guess.

Hitonagashi wrote:To me, inducements aren't an error bar, they are part of the game.

You obviously didn't read what I said in the long post above. Error is just another name for variance, and we know for a fact that TV difference results in variance. Regardless of what you "feel" or how things are "to you", statistically speaking they are error. Anything that causes variance is a source of error if it isn't EXACTLY the variable we're looking at.

Hitonagashi wrote:To be honest though, what Plasmoid wants is up to him.

Absolutely. If he were to call this "Plasmoid's BB rules" then that'd be fine. What he's doing, however, is claiming that he's making changes to adjust the incorrect numbers.... without actually using the numbers. It's dishonest and ridiculously arrogant on his part to assume that he can intuit the needs of the game better than the designers, without bothering to use data (because he doesn't know how, and refuses to learn).

Hitonagashi wrote:To put this in a way that makes the 'numbers people' happier, I think before thousands of hours are put into analysing the B/R data-set in full for NTBB, the data should be compared to the L data to see if there is any statistically significant difference. I think there will be.

Don't care what you think, only what you have numbers to support. B/R data has already been collected and analyzed in detail. I certainly have no objection to people collecting league data and running comparisons with MM data, but you need to have SOME data that supports the idea that the rosters you plan to alter actually require alteration in the first place, so any of this data collection and comparison quite rationally belongs in the preparation stages, not the "after I'm done making the changes" stage.

VoodooMike · Post by **VoodooMike** » Tue Feb 05, 2013 8:18 am

plasmoid wrote:Interesting how you grabbed 1 in 1000 out of thin air to make it sound better too.
In reality it would be 20-40 students.
But just to be absolutely clear: Is there any good reason to ignore one 40th of the data?

What a ridiculous straw man. First off, the real-world stuff is utterly irrelevant. Second, what school district has only 40 students... !censored!, tibet? Third, the point is that small amounts of data added to much larger amounts of data will have little relevant effect. If the average of 999 people was 50%, then even if you tossed in one person with 0% or 100% the most you can shift that percentage by is 1/100th of a percent. Imagining that such a scenario will create a major shift is a failure to comprehend basic arithmatic.

plasmoid wrote:Not meaningful how?

I don't have to explain how something lacks meaning - that's asking someone to prove a negative. What meaning do they have? Especially considering not all teams have demonstrated such a spot.

plasmoid wrote:If most (or all) rosters have peaks, and coaches try to get to those peaks, won't they have an advantage over teams they face at the same TV, if the opposing race has it's peak somewhere else?

Not necessarily, no. You're making the (baseless) assumption that each roster's peak is higher than anything but the peak of another roster - there's no data to support that idea. Instead, these peaks represent TV ranges where the roster tends to have the highest win% for that roster... it doesn't mean that peak win% is higher than the win% for another roster at the same TV, even if that other roster's highest peak is elsewhere.

plasmoid wrote:First, why would I abandon NTBB completely? It has different elements.

If the data doesn't support the idea that the teams are significantly different from their tier's stated win% range, or that any given roster within a tier is different in its win% than other members of that tier, then there will quite literally be no need to "narrow the tiers" and, indeed, you'll have no data on which to base any tier-changing alterations. Certainly you can just dick around with the rules to suit your fancy, but lets call a spade a spade... even now, what you call NTBB is "plasmoid's dicking around + what plasmoid imagines will reshape the tiers" you'd just be dropping off everything after the plus.

plasmoid wrote:You're totally entitled to your opinion.

It's an opinion I will continue to express while you claim that you're working with real data, and making logical, data-supported changes to change the numbers... which you're not using in the first place.

plasmoid wrote:I suspect because you don't make a very compelling case.

Just because you don't get it, doesn't mean it was stated unclearly.

plasmoid wrote:Examining short term play, I can't for the life of me understand why you'd muck up the data with long term play data that you just hope is going to behave in a similar way. Yes, I understand that it would add volume. And that it would add 'power' if - and only if - your added low-TV data happens to mimic short term play accurately. If not, it would destroy the reliability of the data. Adding data like that sounds like just another way to cook the numbers.
I think you'd first need to prove that long term low-TV teams are equally as powerful as short term low-TV teams before adding it into the model.

You...have...no...data. I repeat: you...have...no...data. We have a lot of MM data that has been collected and analyzed. To justify changes made to rosters with the idea of "fixing" their place in tiers and so on, you need to have data that shows they're not already there. If you can find statistically significant differences in the data that actually exists, then you have at the very least a foundation on which to base your changes. You're categorically rejecting the data that exists in favor of having no data at all, and imagining that's better. It's a mindlessly stupid position.

plasmoid wrote:Here's where it breaks down for me: blah blah blah blah

Ok, try to understand this: unless you think that the result will be a very significantly different mean (which is unlikely) then all you're going to do is widen the CI, which will make it harder for you to find statistical significance in any difference between that mean and the expected value. You're not going to make things "more accuracte", you're going to make them less accurate, because you'll be casting a much wider statistical net... you're almost certain to catch the expected value and thus, lose statistical justification for making any changes. It...doesn't...help...you.

plasmoid wrote:PS - oh, did you see the numbers I added up on page 1 near the bottom.
It has Undead and Amazon above tier 1. It stands to reason that if their overall is higher than 55% then they are extremely likely to have isolated periods in their lifespan even higher than that.

Without the confidence intervals, means are meaningless... you can draw no conclusions whatsoever from them. Imagine the following conversation:

The President: Have you decided on our proportional response to the attack on our embassy?
Gen. Plasmoid: Yes sir. We are dropping a bomb on their intelligence HQ in baghdad.
The President: Good, what's the blast radius?
Gen. Plasmoid: Sir?
The President: How much area are we talking about? The destructive radius.
Gen. Plasmoid: I don't follow...
The President: 10 feet? 100 miles?
Gen. Plasmoid: Sure, it could be one of those.
The President: Well it's sort'v important that we know, don't you think?
Gen. Plasmoid: Why? We know exactly where the bomb will land.
The President: Well, if the radius is 10 feet we're probably accomplishing nothing. If its 100 miles, we're killing millions of civilians.
Gen. Plasmoid: Ok....
The President: How do you not see this as important?
Gen. Plasmoid: Well, sir, you're not making a very compelling case...

plasmoid wrote: As you know I don't know how to calculate margins of error, and I don't have the google-fu to pretend I know everything, so, how wide or narrow a margin comes with 21521 games and 13424 games?

You can't base it on total, it's based on the collective variation in the data. Without the raw data nobody can calculate it for you, nor even guess.

Shteve0 · Post by **Shteve0** » Tue Feb 05, 2013 9:12 am

VoodooMike wrote:
plasmoid wrote:First, why would I abandon NTBB completely? It has different elements.
If the data doesn't support the idea that the teams are significantly different from their tier's stated win% range, or that any given roster within a tier is different in its win% than other members of that tier, then there will quite literally be no need to "narrow the tiers" and, indeed, you'll have no data on which to base any tier-changing alterations. Certainly you can just dick around with the rules to suit your fancy, but lets call a spade a spade... even now, what you call NTBB is "plasmoid's dicking around + what plasmoid imagines will reshape the tiers" you'd just be dropping off everything after the plus.

Boom.

spubbbba · Post by **spubbbba** » Tue Feb 05, 2013 10:43 am

VoodooMike wrote:So far every examination of TV difference has displayed an inverse relationship between TV difference and underdog win%, across all rosters. This means, in essence, that all examinations of TV difference have suggested that the win% for a roster goes up as their TV superiority over an opponent increases, and goes down as the TV difference favoures the other guy. This is quite obviously a source of variation in win% (by definition!) which means it is a source of error. The more TV difference you allow in your data, the more variation you are introducing. Again, if you use enough datapoints this will work itself out in the end, but it requires more data to do so WITH higher TV differences, than it does with lower TV differences.

But where did you get that data from?

In R, B and MM then it’s very rare to play matches with big TV differences. So surely those would have to come from league play. By your own standards do we have enough games to show the advantages a higher TV gives, and whether this is mitigated by inducements and if so by how much?

Also I’m not sure how you’d factor in the differences in the metagame between TV based matching and league play. The crp rules were written for league play and NTBBL is primarily concerned with this too.
The trouble with matching by TV and open play is that each game is effectively a 1 off. If you play a 2000 vs 2000 game and win but lose a bunch of players so end up 1500 TV your next game won’t be against another 2000 team but someone roughly equal to you. That is why many argue that min-maxing isn’t an issue in leagues.

It’s why I don’t think tabletop tournaments are much use for looking at low TV balance, not only do they reset every game but also have lots of house rules. Things like being able to assign skills really makes a huge difference to some races, Lizardmen instantly spring to mind.

Another potential issue is that the open leagues have imbalances of teams just like leagues do. If you look at B it is dominated by bashers, so how would that be factored in?

koadah · Post by **koadah** » Tue Feb 05, 2013 10:54 am

Nice posts Mike. There's is probably a little bit more understanding now but even so, I don't think that things have moved on a whole lot from where I came in.

I don't know how many people actually have more faith in all this maths than the 'gut feeling suck it and see'.

Here's some more data though if anyone is interested.

plasmoid · Post by **plasmoid** » Wed Feb 06, 2013 1:34 pm

Hi VoodooMike,
thanks for the write-up on statistics. Very constructive.

What a ridiculous straw man. First off, the real-world stuff is utterly irrelevant. Second, what school district has only 40 students... ![snip] Imagining that such a scenario will create a major shift is a failure to comprehend basic arithmatic.

You honestly thought I was making a comment about school districts? OK...
Well I wasn't.
Let me summarize the conversation so far:
Me: I'll add up My Data (aka the BBRC data) + FOL + MM.
You: You'll do it wrong, meaning you're on drugs. And if you do it right, it's like 1/1000th of the data and won't make a difference.
Me: I've done at right already. And it's between a 20th and 40th. Depending on the team. Roughly a 30th in total - very far from a 1000th.
You: You don't understand arithmatic.

I think my asking "But just to be absolutely clear: Is there any good reason to ignore one 40th of the data?" was a serious clue.
The combined data is still there, 3rd post from the bottom of page 1, should you ever wish to check it out.

Plasmoid imagines that reducing the amount of data will provide a clearer picture, and that is what I attempted to explain away in the long post above.

Again, I'm a bit surprised that's what you got out of it.
I'm saying that if I add data from an entirely different population than the one I'm trying to make a conclusion about - and I expect the 'booster' population to have a different mean, then I'll be adding a factor (a confound?) that would give me distorted data.

Like trying to get a more precise number for the risk of being the subject of a violent crime in Copenhagen by adding in statistics for Baltimore.

Or to put it differently, I could borrow your lovely story about President VoodooMike and General Plasmoid:
it continues like this:
...
Gen. Plasmoid: OK. So how do figure out the blast radius of the bomb on the plane ahead of time then.
President: Easy, we blow up all the different bombs we have of that size.
Gen. Plasmoid: Really? The same size? So it doesn't matter of it's TNT, C-4 or Nuclear?
President: Hell no, we'll just calculate the average and go from there.
Gen. Plasmoid: Sir, that sounds wrong. The bomb on the plane is TNT. Shouldn't we focus on TNT then?
President: General Plasmoid, you're an idiot.

You...have...no...data. I repeat: you...have...no...data. We have a lot of MM data that has been collected and analyzed.

You think the win%s will end up being statistically significant in their difference? I think it's highly unlikely.

I think it is highly likely. Because the TV-system is flawed.
I'd rather collect the data that I know to be relevant, rather than the data you speculate is similar.
In fact, since you're the one wanting to add in a second population that is 'long term' to describe something that is 'short term', why don't you show some significant and statistically reliable data that it's all the same?

To Hitonagashi you said:

You obviously didn't read what I said in the long post above. Error is just another name for variance, and we know for a fact that TV difference results in variance. Regardless of what you "feel" or how things are "to you", statistically speaking they are error. Anything that causes variance is a source of error if it isn't EXACTLY the variable we're looking at.

Like Hitonagashi I'm not interested in one particular variable, I'm interested in the on-table total performance.
If a team is consistently a slow starter, not gaining skills, not getting good (rerolled) winnings then it will more often become an underdog. And will have to rely on inducements. That's how the game works. And if it consistently underperforms as such, then that is a balance problem. In league play.
Similarly, if a team is super strong out of the gates, rerolling winnings and picking up early powerful skill combos, getting the overdog advantage and overperforming then that's a balance problem. In league play. Not MM (as you'd just be getting different opponents).

(Coincidentally, I think the impact of inducements can be seriously overstated. With a TV difference of 1-9 the impact of inducements is rather limited - and we'll have a lot of those in short term play.)

I intended to use FOL and Box MM-data, provided that both teams were in their 1st 10 games. I think that would be quite a lot of games. Such data could be similar to short term league games, so they might be relevant. But thinking things through, I do worry that those numbers would not reflect proper league play, even if the games could in principle have occured in a short term league.
Either way I most certainly will not use data for teams that have spent 30+ games to morph into a very different version of a low-TV team than what the team looked like when it reach the same TV the first time around.

First, why would I abandon NTBB completely? It has different elements.
If the data doesn't support the idea that the teams are significantly different from their tier's stated win% range, or that any given roster within a tier is different in its win% than other members of that tier, then there will quite literally be no need to "narrow the tiers" and, indeed, you'll have no data on which to base any tier-changing alterations.

Even if the data for all 5 überteams were to put them within 45-55 for short term play, there would still be the CRP+10 list of house rules and the changes to the tier 2 and tier 3 teams.

If he were to call this "Plasmoid's BB rules" then that'd be fine.

I already said I'll give the site a look.
I'd have no problem sticking Plasmoids in the title. Why would I?
After all, these are house rules for anyone whose experience of the game matches mine.
I never said or intended for these rules to replace the CRP rules (other than in a house rule way).

Finally - this is what interests me:

You can't base it on total, it's based on the collective variation in the data. Without the raw data nobody can calculate it for you, nor even guess.

Right. You know the total number, and you know that all data is either W, T or L. And you know the mean.
What else do you require then? The specific win, draw and loss numbers? I've got those too.
Something else?
[I'm not being flippant. I'm asking. What would be required to calculate the SD/z value?]

Cheers
Martin

Talk Fantasy Football

NTBB: Stats

Re: NTBB: Stats

Re: NTBB: Stats

Re: NTBB: Stats

Re: NTBB: Stats

Re: NTBB: Stats

Re: NTBB: Stats

Re: NTBB: Stats

Re: NTBB: Stats

Re: NTBB: Stats

Re: NTBB: Stats

Re: NTBB: Stats

NTBB: Stats

Re: NTBB: Stats

Re: NTBB: Stats

Re: NTBB: Stats