-
cloudops-ansible
data/missioncontrol build #191 deployed to dev
-
wlach
standups: fighting with csp settings, victory is mine
-
standups
Ok, submitted #56192 for
standu.ps/user/wlach
-
wlach
chutten: do you have any theories on why content_shutdown_crashes is higher on beta than content_crashes?
data-missioncontrol.dev.mozaws.net/#/beta/windows irccloud.mozilla.com/file/b0ESinVt/image.png
-
wlach
possibly related to
bug 1413172?
-
firebot
bugzil.la/1413172 — ASSIGNED, chutten⊙mc — Analyze counts of shutdownkill crash pings
-
chutten
It should never happen unless someone's messing with the system. content_shutdown_crashes should be a strict subset of content_crashes (modulo reporting delay)
-
chutten
(( I was pretty sure both C and S were reported on the same exact pings, but I suppose it's possible that they are reported on different pings ))
-
wlach
chutten: I will file a bug to investigate
-
frank
wlach: fyi filed bugs 1449281 and 1449283
-
wlach
I think I figured out the negative crash counts, it's the stupid negative aggregate we had a couple weeks back
-
wlach
er, s/negative crash counts/the case where content_shutdown was greater than content/
-
wlach
-
wlach
frank: awesome
-
wlach
nope, check that, lots of individual instances where content_shutdown is greater than content (same query)
-
wlach
chutten: is there anything we can do to identify and exclude content_shutdown crashes from the content crashes measure at the aggregation level?
-
chutten
wlach: On the client side we may be able to accumulate content-process crashes_with_dumps only so long as the notes don't contain an ipc_channel_error of "ShutDownKill"
-
chutten
But I'm not 100% certain of the order... whether we see SUBPROCESS_CRASHES_WITH_DUMP first or after SHUTDOWN_KILL_HARD
-
wlach
chutten: having to do this magic math on the client side is pretty frustrating, in addition to the data issues we seem to be having
-
chutten
By "client side" I mean "on Firefox"
-
wlach
yep, I got your meaning :)
-
chutten
If we can learn whether, for a given shutdownkill happening, the SUBPROCESS_CRASHES_WITH_DUMP and SHUTDOWN_KILL_HARD appear on the same ping or not, that would be huge
-
wlach
how does crash stats distinguish them? it looks like the numbers on the awsy dashboard (the left one) exclude content-shutdown
-
chutten
Because then, at aggregation, you could discard any pings that have more SHUTDOWN_KILL_HARD/ShutDownKill than SUBPROCESS_CRASHES_WITH_DUMP/content
-
chutten
The ones on the left are based on submitted crash reports
-
chutten
And magic
-
wlach
so we don't submit content_shutdown?
-
chutten
Users do. That's why IPC ShutDownKill is usually one of the top signatures.
-
wlach
so why are the numbers on awsy so low?
-
chutten
Because at most 4% of crashes are reported
-
wlach
at most 4% of the content shutdown crashes?
-
chutten
at most 4% of crashes, period. I'd have to dig up my old numbers, but that's the ratio of people who decide to hit "submit crash report" on the dialog that shows up.
-
chutten
I have no idea how many shutdown crashes we don't even pop a dialog for due to them happening at OS shutdown or something
-
wlach
chutten: that doesn't explain why browser crashes is actually higher than content; given the higher volume of the latter I would expect the reverse if what you're saying is true. unless there's something I'm missing
-
chutten
I really don't want to spend too much effort explaining the LHS of arewestableyet. It's the number of socorro-received crash _reports_ divided by ADI
-
chutten
Crash _reports_ are submitted when the user hits the "submit" button on the dialog or the in-content UI, or checks "always submit" in the prefs
-
chutten
The number of crash reports divided by the number of crash pings for the same period was at most 4%
-
chutten
(though it varied based on type of crash and day of week)
-
chutten
Oops, I made a mistake
-
chutten
It wasn't divided by the number of crash pings. I did that analysis before we had crash pings. It was divided by the crash telemetry we had at the time (SUBPROCESS_CRASHES_WITH_DUMP, SHUTDOWN_OK/false, and friends)
-
wlach
chutten: so to give you some context, the reason this is a problem is that I'm trying to get relman to transition off of awsy to use mission control. this week we put the mc numbers side-by-side with the awsy numbers, and we found that they're basically uncomparable due to the aforementioned content vs. content_shutdown crash issue
-
chutten
With the LHS of AWSY?
-
wlach
yes
-
wlach
that's what they're using right now
-
chutten
Good grief, they'll never line up
-
chutten
(unless users all, simultaneously, check the box that says "always submit crash reports")
-
chutten
MC is a meaningless number, by the way. MC-S is the relevant one. No one cares about shutdown kills besides you and me (evidence: the content shutdown crash rate of Fx59)
-
wlach
I don't think they need to line up exactly, but shouldn't they be roughly proportional
-
chutten
Nope. When left to users' whims, they will submit a crash or not based on the flip of a coin
-
chutten
Wait. Which "they" should be roughly proportional?
-
wlach
I guess we can't assume that they will be
-
chutten
LHS of AWSY vs MC-S? or MC and MC-S?
-
wlach
for simplicity let's just say "browser crashes" on awsy vs. "main_crashes" in mission control
-
wlach
"browser crashes" on the LHS of MC
-
wlach
LHS of awsy I mean. too many acronyms!
-
chutten
Oh, and now it occurs to me that we're talking about two different "MC" as well
-
chutten
So. "browser crashes" from the LHS of AWSY vs. "main_crashes" in mission control
-
chutten
I _suppose_ a case could be made that a prevalent browser crash would be reported more... but users are fickle, so they may amplify crashes that don't happen frequently but instead make users -want- to submit them
-
wlach
yes. I understand that these are collected using different means and the denominator is different (ADI vs. usage_hours), but all things being equal I would expect the quantities to be proportional
-
chutten
Or, conversely, users could under-report prevalent crashes that they don't care about (see shutdown crashes)
-
chutten
But, yes, theoretically for worse product quality both should elevate and for more stable releases both should depress
-
chutten
And they should do so roughly in-step with each other. (though how long it takes crash reports and crash pings and main pings to all get to their respective endpoints is another long, boring discussion)
-
wlach
right, that's ok (they do comparisons over the entire length of a release, typically)
-
chutten
So, what are the problems. I see 1) Relman is holding on to AWSY, 2) Content Shutdown Crashes are insufficently-well-explained
-
chutten
What problems do you see?
-
wlach
chutten: (1) is really more "wlach needs to convince relman that missioncontrol data is at least as accurate as what they're seeing on the left hand side of awsy".
-
wlach
so if it's showing something different from awsy, we need to be able to explain why
-
wlach
which brings us to (2)
-
wlach
I don't really know how to explain the fact that there are so many aggregates where content_shutdown is greater than content. it seems like we're not measuring something correctly
-
chutten
Yup. Thus my investigations into
bug 1413172
-
firebot
bugzil.la/1413172 — ASSIGNED, chutten⊙mc — Analyze counts of shutdownkill crash pings
-
chutten
Since leaving ddurst's team I haven't really been able to prioritize this sort of analysis work
-
wlach
chutten: so basically I can't see us being able to transition to using mission control for measuring this stuff until that's either resolved or we determine that awsy is also giving completely bogus numbers
-
wlach
since I (think) content crashes is something we actually do care about, and it sounds like we don't have an accurate way of measuring it
-
chutten
Well, if your problems can be solved by showing how bad AWSY is... compare the count of PROCESS_CRASH_SUBMIT_SUCCESS/content-crash == True to the count of SUBPROCESS_CRASHES_WITH_DUMP/content
-
chutten
According to
mzl.la/C0jvih about 5k content-crash crash reports were submitted
-
chutten
According to
mzl.la/CV6ssD about 57.2k content processes crashed
-
chutten
This is Nightly, so I'm not surprised the ratio's closer to 10%
-
chutten
So if they're okay with operating on a non-representative self-selected subsample of crashes, I guess we can shutter mission control and find something else to do?
-
wlach
I'm not sure, it's definitely a flawed methodology but might still be giving something closer to an accurate count than missioncontrol's
-
chutten
Certainly which crashes users care to tell us about is a useful filter to have
-
wlach
or at least an accurate proportion (since it will always only be counting a subset)
-
chutten
I mean, what does it matter if the browser crashes lots if users don't care?
-
chutten
(well, security will care, but...)
-
chutten
So... without all of my ranting... the tl;dr is we need to explain content shutdown numbers to ourselves and to others.
-
wlach
you mean figure out why they're sometimes greater than content?
-
chutten
That requires
bug 1413172, which I'm not going to be able to work on for at least a week (by my current load)
-
firebot
bugzil.la/1413172 — ASSIGNED, chutten⊙mc — Analyze counts of shutdownkill crash pings
-
chutten
wlach: Yup. At this point I figure it's going to take a longitudinal study of some crashing clients' information flow
-
wlach
chutten: what a pain :( guess it'll be good to know what's going on though
-
chutten
So you have a shutdown crash. At what point do we see your crash pings? How many? On which main ping do we see SHUTDOWN_KILL_HARD? How many? On which main ping do we see SUBPROCESS_CRASHES_WITH_DUMP? How many?
-
chutten
I should write this in the bug
-
chutten
(er, SUBPROCESS_KILL_HARD, rather)
-
wlach
chutten: thanks for your help, I'll update some of the missioncontrol bugs accordingly
-
wlach
i'll also dump my query showing content_shutdown_crashes > content_crashes into the bug above