00:01:00 data/missioncontrol build #191 deployed to dev 00:04:30 standups: fighting with csp settings, victory is mine 00:04:30 Ok, submitted #56192 for https://www.standu.ps/user/wlach/ 15:42:13 chutten: do you have any theories on why content_shutdown_crashes is higher on beta than content_crashes? https://data-missioncontrol.dev.mozaws.net/#/beta/windows https://irccloud.mozilla.com/file/b0ESinVt/image.png 15:42:35 possibly related to bug 1413172? 15:42:37 https://bugzil.la/1413172 — ASSIGNED, chutten⊙mc — Analyze counts of shutdownkill crash pings 15:59:23 It should never happen unless someone's messing with the system. content_shutdown_crashes should be a strict subset of content_crashes (modulo reporting delay) 16:00:01 (( I was pretty sure both C and S were reported on the same exact pings, but I suppose it's possible that they are reported on different pings )) 16:59:45 chutten: I will file a bug to investigate 18:42:18 wlach: fyi filed bugs 1449281 and 1449283 18:45:25 I think I figured out the negative crash counts, it's the stupid negative aggregate we had a couple weeks back 18:45:49 er, s/negative crash counts/the case where content_shutdown was greater than content/ 18:45:51 https://sql.telemetry.mozilla.org/queries/52261/source 18:45:55 frank: awesome 18:57:48 nope, check that, lots of individual instances where content_shutdown is greater than content (same query) 19:00:05 chutten: is there anything we can do to identify and exclude content_shutdown crashes from the content crashes measure at the aggregation level? 19:03:41 wlach: On the client side we may be able to accumulate content-process crashes_with_dumps only so long as the notes don't contain an ipc_channel_error of "ShutDownKill" 19:04:25 But I'm not 100% certain of the order... whether we see SUBPROCESS_CRASHES_WITH_DUMP first or after SHUTDOWN_KILL_HARD 19:04:47 chutten: having to do this magic math on the client side is pretty frustrating, in addition to the data issues we seem to be having 19:04:59 By "client side" I mean "on Firefox" 19:05:11 yep, I got your meaning :) 19:06:35 If we can learn whether, for a given shutdownkill happening, the SUBPROCESS_CRASHES_WITH_DUMP and SHUTDOWN_KILL_HARD appear on the same ping or not, that would be huge 19:06:55 how does crash stats distinguish them? it looks like the numbers on the awsy dashboard (the left one) exclude content-shutdown 19:07:07 Because then, at aggregation, you could discard any pings that have more SHUTDOWN_KILL_HARD/ShutDownKill than SUBPROCESS_CRASHES_WITH_DUMP/content 19:07:25 The ones on the left are based on submitted crash reports 19:07:36 And magic 19:07:37 so we don't submit content_shutdown? 19:07:58 Users do. That's why IPC ShutDownKill is usually one of the top signatures. 19:08:29 so why are the numbers on awsy so low? 19:08:53 Because at most 4% of crashes are reported 19:09:22 at most 4% of the content shutdown crashes? 19:10:20 at most 4% of crashes, period. I'd have to dig up my old numbers, but that's the ratio of people who decide to hit "submit crash report" on the dialog that shows up. 19:11:14 I have no idea how many shutdown crashes we don't even pop a dialog for due to them happening at OS shutdown or something 19:12:09 chutten: that doesn't explain why browser crashes is actually higher than content; given the higher volume of the latter I would expect the reverse if what you're saying is true. unless there's something I'm missing 19:14:46 I really don't want to spend too much effort explaining the LHS of arewestableyet. It's the number of socorro-received crash _reports_ divided by ADI 19:15:24 Crash _reports_ are submitted when the user hits the "submit" button on the dialog or the in-content UI, or checks "always submit" in the prefs 19:15:55 The number of crash reports divided by the number of crash pings for the same period was at most 4% 19:16:19 (though it varied based on type of crash and day of week) 19:17:24 Oops, I made a mistake 19:18:18 It wasn't divided by the number of crash pings. I did that analysis before we had crash pings. It was divided by the crash telemetry we had at the time (SUBPROCESS_CRASHES_WITH_DUMP, SHUTDOWN_OK/false, and friends) 19:18:51 chutten: so to give you some context, the reason this is a problem is that I'm trying to get relman to transition off of awsy to use mission control. this week we put the mc numbers side-by-side with the awsy numbers, and we found that they're basically uncomparable due to the aforementioned content vs. content_shutdown crash issue 19:19:23 With the LHS of AWSY? 19:19:28 yes 19:19:34 that's what they're using right now 19:19:35 Good grief, they'll never line up 19:20:16 (unless users all, simultaneously, check the box that says "always submit crash reports") 19:21:22 MC is a meaningless number, by the way. MC-S is the relevant one. No one cares about shutdown kills besides you and me (evidence: the content shutdown crash rate of Fx59) 19:21:38 I don't think they need to line up exactly, but shouldn't they be roughly proportional 19:22:03 Nope. When left to users' whims, they will submit a crash or not based on the flip of a coin 19:22:33 Wait. Which "they" should be roughly proportional? 19:22:34 I guess we can't assume that they will be 19:22:44 LHS of AWSY vs MC-S? or MC and MC-S? 19:23:30 for simplicity let's just say "browser crashes" on awsy vs. "main_crashes" in mission control 19:24:21 "browser crashes" on the LHS of MC 19:24:52 LHS of awsy I mean. too many acronyms! 19:24:56 Oh, and now it occurs to me that we're talking about two different "MC" as well 19:25:17 So. "browser crashes" from the LHS of AWSY vs. "main_crashes" in mission control 19:26:25 I _suppose_ a case could be made that a prevalent browser crash would be reported more... but users are fickle, so they may amplify crashes that don't happen frequently but instead make users -want- to submit them 19:26:47 yes. I understand that these are collected using different means and the denominator is different (ADI vs. usage_hours), but all things being equal I would expect the quantities to be proportional 19:26:58 Or, conversely, users could under-report prevalent crashes that they don't care about (see shutdown crashes) 19:27:26 But, yes, theoretically for worse product quality both should elevate and for more stable releases both should depress 19:28:51 And they should do so roughly in-step with each other. (though how long it takes crash reports and crash pings and main pings to all get to their respective endpoints is another long, boring discussion) 19:29:33 right, that's ok (they do comparisons over the entire length of a release, typically) 19:30:32 So, what are the problems. I see 1) Relman is holding on to AWSY, 2) Content Shutdown Crashes are insufficently-well-explained 19:30:37 What problems do you see? 19:32:13 chutten: (1) is really more "wlach needs to convince relman that missioncontrol data is at least as accurate as what they're seeing on the left hand side of awsy". 19:32:48 so if it's showing something different from awsy, we need to be able to explain why 19:32:53 which brings us to (2) 19:34:08 I don't really know how to explain the fact that there are so many aggregates where content_shutdown is greater than content. it seems like we're not measuring something correctly 19:34:34 Yup. Thus my investigations into bug 1413172 19:34:36 https://bugzil.la/1413172 — ASSIGNED, chutten⊙mc — Analyze counts of shutdownkill crash pings 19:35:05 Since leaving ddurst's team I haven't really been able to prioritize this sort of analysis work 19:35:25 chutten: so basically I can't see us being able to transition to using mission control for measuring this stuff until that's either resolved or we determine that awsy is also giving completely bogus numbers 19:36:47 since I (think) content crashes is something we actually do care about, and it sounds like we don't have an accurate way of measuring it 19:38:44 Well, if your problems can be solved by showing how bad AWSY is... compare the count of PROCESS_CRASH_SUBMIT_SUCCESS/content-crash == True to the count of SUBPROCESS_CRASHES_WITH_DUMP/content 19:39:46 According to https://mzl.la/C0jvih about 5k content-crash crash reports were submitted 19:40:20 According to https://mzl.la/CV6ssD about 57.2k content processes crashed 19:40:42 This is Nightly, so I'm not surprised the ratio's closer to 10% 19:41:39 So if they're okay with operating on a non-representative self-selected subsample of crashes, I guess we can shutter mission control and find something else to do? 19:43:28 I'm not sure, it's definitely a flawed methodology but might still be giving something closer to an accurate count than missioncontrol's 19:43:49 Certainly which crashes users care to tell us about is a useful filter to have 19:43:49 or at least an accurate proportion (since it will always only be counting a subset) 19:44:11 I mean, what does it matter if the browser crashes lots if users don't care? 19:44:22 (well, security will care, but...) 19:45:46 So... without all of my ranting... the tl;dr is we need to explain content shutdown numbers to ourselves and to others. 19:46:10 you mean figure out why they're sometimes greater than content? 19:46:13 That requires bug 1413172, which I'm not going to be able to work on for at least a week (by my current load) 19:46:14 https://bugzil.la/1413172 — ASSIGNED, chutten⊙mc — Analyze counts of shutdownkill crash pings 19:46:46 wlach: Yup. At this point I figure it's going to take a longitudinal study of some crashing clients' information flow 19:47:04 chutten: what a pain :( guess it'll be good to know what's going on though 19:47:36 So you have a shutdown crash. At what point do we see your crash pings? How many? On which main ping do we see SHUTDOWN_KILL_HARD? How many? On which main ping do we see SUBPROCESS_CRASHES_WITH_DUMP? How many? 19:48:00 I should write this in the bug 19:49:02 (er, SUBPROCESS_KILL_HARD, rather) 19:52:53 chutten: thanks for your help, I'll update some of the missioncontrol bugs accordingly 19:53:15 i'll also dump my query showing content_shutdown_crashes > content_crashes into the bug above