Posts filed under Research (207)

June 21, 2025

Reeling them in

Q: One News says fishing can improve your mental health!

A: That sounds fairly plausible, actually. Did they say how they know?

Q: “research from the UK”

A: A bit non-specific, innit?

A: I think it’s this paper. The number matches (“Almost 17% less likely”) and it’s from the UK and there doesn’t seem to be a better match

Q: And people who fished more had less mental illness?

A: People who fished more often had less history of depression, suicidal thoughts, and self-harm. People who fished longer had more suicidal thoughts.

Q: How often did people have to fish to be in the “17% less likely” group

A: It’s not clearly described. The model in the paper actually has 17% more likely, so maybe it’s a model for “not mental health problem”. If the 17% is for a one-step difference in the survey question then it’s a surprisingly large effect of a very small difference: 5-6 times a week is a different category from 3-4 times; once every two weeks is different from once per month.

Q: Could the anglers just be healthier anyway, or richer or something? Did they collect that information?

A: They did collect it, but they didn’t use it in the analysis, at least in this paper.

Q: How did they recruit the people?

A: “an online survey that was advertised through the Instagram, Facebook, and Twitter accounts of Angling Direct and Tackling Minds. Angling Direct also sent the survey link to their mailing list, and the link was distributed via the Anglia Ruskin University Twitter account, as well as the authors’ own networks.”

Q: That … sounds like it might not be perfectly representative

A: 98% of the respondents were men, for example. And 40% were in the top 20% of household income nationally.

Q: Would I be right in guessing that Angling Direct is some sort of fishing magazine?

A: It’s actually a chain of fishing supply stores in the UK. Claims to be the UK’s leading fishing-tackle retailer

Q: Ok, and Tackling Minds is maybe some sort of fishing education thing?

A: It’s a charity that uses fishing as a mental health intervention.

Q: Couldn’t that have some impact on the correlations between fishing and mental health in the sample?

A: Indeed it could

March 29, 2019

Ihaka Lectures – videos for your viewing pleasure

By Atakohu Middleton

We’re three weeks into the month-long Ihaka Lecture Series, and it has been well received – thank you to those who have turned up in person and online.

Our final speaker, Robert Tibshirani, right, is up on Weds April 3 at the University of Auckland (details here). Robert is Professor of Statistics and Biomedical Data Science at Stanford University.

He is best known for proposing the ‘lasso’, a sparse regression estimator, and describing its relationship to the idea of boosting in supervised classification. He will talk about modern sparse supervised learning approaches that extend the lasso.

In the meantime, you might like to check out the films of the last three speakers. First up on March 13 was by Bernhard Pfahringer, left, who is Professor of Computer Science at the University of Waikato.

He is a member of the Weka project, New Zealand’s other famous open-source data science contribution, and here talks about the design and development of Weka and more recent projects.

Next was supposed to be JJ Allaire, the founder and CEO of RStudio, and the author of R interfaces to Tensorflow and Keras. However, ill-health prevented him coming, and our very own Professor Thomas Lumley stepped in.

Thomas talked entertainingly about deep learning, in particular how deep convolutional nets are structured and how they can be remarkably effective, but can also fail, as he puts it, “in remarkably alien ways”.

Following was Dr Kristian Lum, Lead Statistician at the Human Rights Data Analysis Group. Her research has concretely demonstrated the potential for machine learning-based predictive policing models to reinforce and, in some cases, amplify historical racial biases in law enforcement.

She talked about algorithmic fairness, and about ways in which policy, rather than data science, influences the development of these models and their choice over non-algorithmic approaches.

February 2, 2019

Meet Simon Goodwin, Statistics summer scholar

By Atakohu Middleton

Every summer, the Department of Statistics offers scholarships to high-achieving students so they can work with staff on real-world projects. Simon Goodwin, below, is working with Dr Jesse Goodman on random graph dynamics and hitting times.

Simon’s summer scholarship is related to the study of random graphs, looking at how to investigate networks that look as if they are random or pseudo-random, like social networks, family trees or the global flight network. His task in particular is looking at the nodes in these structures that are hard to reach by moving randomly, and what this means for the structure of the graph as a whole.

You can conceptualise it like this: Produce a random graph by connecting pairs of vertices uniformly at random. Then run a random walk on this random graph: at each step, move to a uniformly chosen neighbour of the current position. The hitting time is the number of steps needed to reach a particular target vertex, and it varies in a particular way depending on the size of the random graph.

Simon’s work looks at the effect of changing the random graph. Between each random walk step, he might “rewire” some edges: pick a fraction of edges, disconnect the vertices on either side, then randomly reconnect those vertices to see if these graph dynamics make it faster (or slower) to reach the target vertex.

“Looking at the structure of these random theoretical objects we can learn about vast real-world networks that have no clearly apparent structure,” Simon explains. “The results I am trying to find would also have theoretical applications in the study of random graphs.”

Simon is about to start his third year studying maths and statistics: “My main interest is in pure maths, but I am also very interested in theoretical statistics, mainly in probability. I am intrigued by all things random.”

In fact, he dropped physics for statistics last year, “and I haven’t regretted it for one moment – sorry physics! I am mainly interested in probability but I have also enjoyed learning about data analysis and I have an interest in statistical computing.”

He adds, “Probability is such an interesting field, as it has a strong theoretical backing while also having many obvious applications such as games with dice and cards, as well as many less obvious applications, from financial-market analysis to quantum physics.”

Simon is hoping to become an academic: “I hope to continue into postgraduate study and then spend the rest of my life studying and teaching what I love.” When he’s not studying, Simon loves playing video games and roleplaying games like Dungeons and Dragons, as well as walking around the scenic spots of Auckland.

For general information on University of Auckland summer scholarships, click here.

January 31, 2019

Meet Statistics summer scholar Grace Namuhan

By Atakohu Middleton

Every summer, the Department of Statistics offers scholarships to high-achieving students so they can work with staff on real-world projects. Grace Namuhan, below, is working with Professional Teaching Fellow Anna Fergusson on the design of interactive data visualisation tools for large classes.

Stage one Statistics courses are enormously popular at the University of Auckland – there are more than 2,000 students per semester, and single lectures may contain up to 600 students. Anna Fergusson, who is part of the stage one teaching team, is a keen developer of in-class web apps to engage these students. For example, you might get students to respond to questions via their own devices, with the data collected to a Google sheet that can then be analysed in class. Working alongside Anna, Grace has been exploring the principles of designing such data visualisation interactives for large-scale learning.

In particular, she is working on an interactive to collect finer-grained data on how students carry out a hypothesis test – in particular, a Chi-square test for independence. This particular app is not for live analysis – rather, she is tracking every point, click, and selection students make as they work through the interactive.

She’s had to work out what data to collect and how to store it, and also develop a plan to analyse this very rich and complex data set – even this one app involves thousands and thousands of rows of data. She also has to consider what an educator would want to know from the data.

Grace, a third-year Bachelor of Science student undergraduate majoring in Data Science, says the project is exercising what she has learned so far, “which are my programming skills for creating the interactive and statistical skills for analysing the information extracted from the interactive”.

However, Grace didn’t start out her undergraduate studies in statistics – she did a year of biomedical science “but I didn’t really enjoy it. Data science just came out as a new major when I wanted to change my major – it involves half statistics courses and half computer science courses, so I thought it would be a really suitable major for me.”

Statistics appeals to Grace as she is “quite a practical person; turning what might look meaningless data into something useful is really fascinating. There are a lot of invisible data around us in our daily lives; being a data interpreter makes me feel like I am useful”.

For general information on University of Auckland summer scholarships, click here.

To find out more about Anna’s work in developing resources for large-class teaching, click here.

January 29, 2019

Meet summer scholar Monica Merchant

By Atakohu Middleton

Every summer, the Department of Statistics offers scholarships to high-achieving students so they can work with staff on real-world projects. Monica Merchant, right, is working with Professor Chris Wild on iNZight, the free data visualisation and analysis software he developed.

Monica, a BCom (Hons)/BSc student, is working on developing the predictive analytics module of the iNZight software, a toolbox that allows that allows users to build their own predictive model from a real-world dataset of their choice.

The module – whose interface is menu-driven and doesn’t require any knowledge of R, the environment in which iNZight is developed – guides the user through the model-building process, from data pre-processing and model training to tuning and validation.

Most importantly, says Monica, the module goes beyond traditional modelling methods by giving the user access to the full suite of machine learning algorithms available in R. Users can apply multiple algorithms to the data to explore differences in fit, predictive performance and generalisability.

This project is useful, says Monica, because it gives us another way to make sense of the data around us: “There is a lot of it and not all of it is created equal. We need ways to intelligently convert these large volumes of data into meaningful insights and actionable knowledge.”

She adds, “This is where machine learning comes in – the basic idea is to let the machine iteratively learn from the data to uncover underlying relationships and patterns or predict outcomes.”

Monica points out that while machine learning as a concept isn’t new – much of the theoretical groundwork behind many of its algorithms was laid in the mid-to-late 20th century – it has been only in recent years that advances in computational power have enabled us to make large-scale use of these algorithms in the real world.

Today, these algorithms are used everywhere, from bioinformatics and medical diagnosing to software engineering, financial markets, agriculture, astronomy and self-driving cars – but as Monica says, “this barely scratches the surface – check out Google Brain and DeepMind”.

Monica started her university career studying a Bachelor of Commerce in Finance, Accounting and Economics. Her Honours dissertation looked at the predictive power of option-implied risk-neutral densities, which sparked an interest in statistical computing. She added a BSc in Statistics.

Asked why statistics appeals, she says, “a degree in statistics is powerful since it offers a diverse and nearly limitless range of applications. I don’t have to limit myself to any one industry. Monica describes herself as inquisitive by nature, “so using data to solve real-world problems is always very rewarding”.

For general information on University of Auckland summer scholarships, click here.

January 23, 2019

Meet Statistics Summer Scholar Xin Qian

By Atakohu Middleton

Every summer, the Department of Statistics offers scholarships to high-achieving students so they can work with staff on real-world projects. Xin Qian, in the picture, is working with Dr Ben Stevenson, an expert in statistical methods for estimating animal populations.

How can you work out how many creatures inhabit a space when they are elusive, small and have lots of places to hide? Sitting in the bush for months and trying to count what you hear won’t be accurate – and it’s probably not a good use of time.

Another way is to estimate animal abundance is through acoustic surveys, which use microphone arrays to record animal chirps and calls; statistical techniques are then used to estimate the population. This is called spatial capture-recapture (SCR), and at present we have several ways of analysing the data.

That’s where summer student Xin Qian comes in. He is working with SCR expert Dr Ben Stevenson on a simulation project that compares two ways of analysing acoustic data. They are using statistics gathered from surveys of the rare moss frog, which exists only on South Africa’s Cape Peninsula.

“We want to find out which is the best method for providing an accurate and stable estimation of frog density, factoring in the time each method takes,” says Xin. The existing method, he explains, requires that you go and collect independent data about how often individual frogs chirp in order to estimate animal density, which takes time.

However, the new method, developed by Ben Stevenson’s former MSc student Callum Young, promises to estimate both call rates and therefore animal density from the main survey alone. Says Ben: “This can save time, but may possibly leave you with a less accurate answer. What we are hoping to do is resolve the trade-off. How is the precision of our estimates affected if we switch to the new method? My guess is that it will be worse. Is this sacrifice worth the saving in fieldwork time?”

For this work they are using R, a programming language for statistical computing and graphics developed in the Department of Statistics in the mid-1990s and now used all over the world.

The project is ideal for Xin, a third-year University of Auckland BSc student majoring in Statistics and Information Systems. “It is always interesting to get information from data; it makes me feel like I am having some secret conversation with data that people can’t hear,” he says. “I normally won’t get bored dealing with numbers, and I prefer things having a logic or a reason behind them.”

Xin was born and raised in China, in the small east-coast city of Jiaxing near Shanghai. After finishing secondary school in China, he moved to New Zealand to pursue tertiary studies, starting his degree in 2016.

The University of Auckland appealed to him “because of its good reputation and ranking.” Although education rather than environment drew him to this country, he says that “New Zealand is a beautiful place with splendid natural views, and most people here are nice and welcoming; I have made lots of friends here. I have also became more outgoing and willing to try various outdoor activities that I wouldn’t get a chance to try if staying in my hometown.”

For general information on University of Auckland summer scholarships, click here.

March 8, 2018

“Causal” is only the start

By Thomas Lumley

Jamie Morton has an interesting story in the Herald, reporting on research by Wellington firm Dot Loves Data.

They then investigated how well they all predicted the occurrence of assaults at “peak” times – between 10pm and 3am on weekends – and otherwise in “off-peak” times.

Unsurprisingly, a disproportionate number of assaults happened during peak times – but also within a very short distance of taverns.

The figures showed a much higher proportion of assault occurred in more deprived areas – and that, in off-peak times, socio-economic status proved a better predictor of assault than the nearness or number of bars.

Unsuprisingly, the police were unsurprised.

This isn’t just correlation: with good-quality location data and the difference between peak and other times, it’s not just a coincidence that the assaults happened near bars, nor is it just due to population density. The closeness of the bars and the assaults also argues against the simple reverse-causation explanation: that bars are just sited near their customers, and it’s the customers who are the problem.

So, it looks as if you can predict violent crimes from the location of bars (which would be more useful if you couldn’t just cut out the middleman and predict violent crimes from the locations of violent crimes). And if we moved the bars, the assaults would probably move with them: if we switched a florist’s shop and a bar, the assaults wouldn’t keep happening outside the florist’s.

What this doesn’t tell us directly is what would happen if we dramatically reduced the number of bars. It might be that we’d reduce violent crime. Or it might be that it would concentrate around the smaller number of bars. Or it might be that the relationship between bars and fights would weaken: people might get drunk and have fights in a wider range of convenient locations.

It’s hard to predict the impact of changes in regulation that are intended to have large effects on human behaviour — which is why it’s important to evaluate the impact of new rules, and ideally to have some automatic way of removing them if they didn’t do what they were supposed to. Like the ban on pseudoephedrine in cold medicine.

View comments (4)

February 19, 2018

Ihaka Lecture Series – live and live-streamed in March

By Atakohu Middleton

The theme of this year’s Ihaka Lecture Series is “A thousand words: Visualising statistical data”. The distillation of data into an honest and compelling graphic is essential component of modern (data) science, and this year, we have three experts exploring different facets of data visualisation.

Each event begins at 6pm in the Large Chemistry Lecture Theatre, Building 301, 23 Symonds Street, Central Auckland, with drinks, nibbles and chat – just turn up – and the talks get underway at 6.30pm. Each one will be live-streamed – details will be on the info pages, the links to which are given below.

On March 7, Professor Dianne Cook from Monash University (right) looks at simple tools for helping to decide if the patterns you think you see in the data are really there. Details. Statschat interviewed Di last year about the woman behind the data work, and it was a very popular read. It’s here. Di’s website is here.

On March 14, Associate Professor Paul Murrell from the Department of Statistics, The University of Auckland (left) will embark on a daring statistical graphics journey featuring the BrailleR package for visually-impaired users, high-performance computing, te reo, and XKCD. Details. Paul was a student when R was being developed by Ross Ihaka and Robert Gentleman, and has been part of the R Core Development team since 1999.

On March 21, Alberto Cairo, the Knight Chair in Visual Journalism at the University of Miami (below right) teaches principles so we all become more critical and better informed readers of charts. This lecture is non-technical – if you have any journalist friends, let them know. Details. His website is here.

The series is named after Ross Ihaka, Associate Professor in the Department of Statistics at the University of Auckland. Ross, along with Robert Gentleman, co-created R – a statistical programming language now used by the majority of the world’s practicing statisticians. It is hard to over-emphasise the importance of Ross’s contribution to our field, so we named this lecture series in his honour to recognise his work and contributions to our field in perpetuity.

View comments (2)

January 18, 2018

Maps and models

By Thomas Lumley

This spectacular map from the National Geospatial-Intelligence Agency was circulating yesterday on Twitter. I got it from Christopher Jackson (@seis_matters). It shows antineutrino emissions from around the earth

Our local (sub)continent of Zealandia shows up nicely at the bottom right. The black dots are nuclear reactors, and the dark smudge is just the immense rock mass of the Himalayas.

This next map is a style you’ve seen before. It shows New Zealand’s winds at the moment: the storm is passing over.

What these maps have in common is a very high ratio of model to actual data. The `live’ wind map isn’t based on detailed live reports from a fine grid of weather stations. There aren’t any — especially out in the Pacific. It’s a map of the NOAA Global Forecast System, but forecasting the very near future rather than the long range. It isn’t going to give you more up-to-date information than the Met Service.

The antineutrino map is even more model-based. In the scientific paper I was struck by the sentence

Recently, the blossoming field of neutrino geoscience, first proposed by Eder¹⁵, has become a reality with 130 observed geoneutrino interactions^12,13 confirming Kobayashi’s view of the Earth being a “neutrino star”¹⁶.

It looks like the map has well over a million pixels per observed geophysical neutrino. When it comes to nuclear reactors, the paper says “These exciting geophysical capabilities have significant overlap with the non-proliferation community where remote monitoring of antineutrinos emanating from nuclear reactors is being seriously considered“. That is, the reactors are black dots on the map because they know where the reactors are and how many neutrinos they’d make, not because they measured them. The observations do go into the model, and they probably provide actual information about the deeper bits of the earth’s crust, but the map is of the model, not the observaations.

December 15, 2017

Jenny Bryan: “You need a huge tolerance for ambiguity”

By Atakohu Middleton

Jenny Bryan @JennyBryan was one of several leading women in data science who attended this week’s joint conference of the New Zealand Statistical Association, the International Association of Statistical Computing (Asian Regional Section) and the Operations Research Society of New Zealand at the University of Auckland, so we couldn’t miss the opportunity to talk with her (Jenny’s conference presentation, titled “Zen and the aRt of workflow maintenance”, is here). A brief bio: Jenny is a software engineer at RStudio while on leave from her role as Associate Professor in Statistics at the University of British Columbia, where she was a biostatistician. Jenny serves in leadership positions with rOpenSci and Forwards and is a member of The R Foundation. She takes special delight in eliminating the small agonies of data analysis.

Statschat: When did you first encounter statistics as a young person? Jenny: I was an economics major which had exactly one required statistics paper, which I took, and then continued to try and make that degree as un-quantitative as I possibly could. I had started out thinking I would major in some form of engineering, and therefore was taking math and physics and the technical track.

I was one of very few women in the course, and the culture of the course was to pull an all-nighter once a week [to do the weekly problem set]. The average mark on the exam would be 20 out of 100, and I was mentally not prepared for this type of sort of stamina-driven culture.

Was it a macho culture? That’s how it felt to me, and you needed enough innate confidence to never worry about the fact that you were getting marks you had never seen before in your life – everyone failed miserably all the time. After the first semester or two of this, I decided it wasn’t for me and declared my major to be German literature, which I saw through. But in the last two years at university, I realised I needed to be employable when I graduated, so I added economics as a means to making sure I could make a living later.

I worked as a management consultant for a couple of years and that’s where I learned that I was actually at my happiest when they locked in a room by myself with a huge spreadsheet and I had some data task ahead of me … and so then I gradually worked my way back to what I think I’m really good at.

Did you pursue statistics qualifications? I did. After my two years of management consulting, the normal track would be to be sent off to business school. But thanks to what I learned about myself, I was pretty sure that wasn’t the right track for me. But I had learned how to give talks, how to extract questions from people and go and make it quantitative and then translate my solution back into their language. So the management consulting experience was super-useful.

At that point, I had met my husband, and I followed him to his first postdoc with no particular plans. He’s a mathematician – he knew he wanted to be a mathematician when he was 6. I never had that kind of certainty about what I was meant to do! It took me a lot longer to figure it out.

So I followed him, and basically played a lot of tennis at first (laughs) while were living in Southern California … I decided some form of statistics would be ideal for me, but I didn’t have enough of a math background to take the specialised math exams in the US, called the GREs [Graduate Record Examinations] that a lot of statistics departments want to see. So I started taking as many prerequisites as I could at the university where he was doing his postdoc. I did well and started working as a teaching assistant in these classes as well.

Then we moved together, two years later, for him to start his second postdoc and for me to start biostatistics grad school. Also during this time, I supported myself doing fancy Excel work as a temp … so I did a PhD in Biostatistics at Berkeley in five years – the first two years are the masters, and three years of writing the thesis.

What’s your academic career path been since then? I got my job at University of British Columbia before I graduated, and I was there until I went on leave earlier this year. I’ve since been working in Hadley Wickham’s group at RStudio. My title is software engineer, which I still find a bit peculiar.

Why? Because I feel I should have more formal training in engineering to have that title, but I’m getting more comfortable with it.

What’s the essence of your role there? I spend about two-thirds of my effort on package development and package maintenance. Hadley is starting to gradually give maintainership of his packages to other people … so I took over readxl. I already had an existing line of work in making R talk to Google APIs [application programming interface], so I worked with an intern this summer and we created a package from scratch so that you can use Google Drive from R. Now I’m revisiting some general tools for authenticating with Google APIs, and I have another package that talks to Google spreadsheets. I also do quite a bit of talking and teaching.

You put a lot of your work on the internet. Why do you feel that is important to share it this way? I decided this was how I was going to interpret what it meant to be a scholar. Several years ago, I decided that teaching people about the process of data analysis was super-important to me, and was being completely undertaught, and I was going to dedicate a lot of my time to it. Luckily, I already had tenure at that point, but it still looks a bit like career suicide to make this decision, because it means that you’re not producing conventional statistical outputs like methodological papers. I also felt like putting my stuff out there and having a public course webpage and pushing things out would be my defence against [any suggestion] that I wasn’t doing anything.

You’re clearly not satisfied that the current academic system is serving the subject well. Not at all! We have a really outdated notion that only publications matter, and publications where there’s novel methodology. I think that’s leaving a ton of value on the table – making sure that statistical methods that exist are actually used, or used correctly. But the field is not set up to reward that – the majority of papers are not widely read and cited, and many of these methods are not used or implemented in any practical way …. it’s been enshrined that academic papers are what counts, but they’re not a directly consumable good by society. We need knowledge-translation activity as well.

So you’re rebelling. Well, I felt that the only way you could do it was to start doing the things you thought were valuable. Being able to put your course material online, to have a dialogue with people in your field on Twitter … you can finally remove a lot of these gatekeepers from your life. They can keep doing their thing, but I know people care and read this stuff. Since I was able to wait until I had security of employment, I decided that if that meant I didn’t go from associate to full [professor], I could live with that. It’s not that my department isn’t [supportive] – it’s either neutral or positive on all this. But it’s true that everyone else I was hired with is a full professor and I’m not.

Does that bug you? Yes and no. I think I could have pushed harder. But every time you push on these things, you’re basically asked, “Well, can you make what you do look more like a statistics publication? Each package that you write, can you write a stats paper around it?” and I’ve decided the answer is, “No. Can we agree that is not a helpful way to evaluate this work? The only reason to repackage it in that way is to check some box.”

Academics are becoming increasingly dissatisfied with academic publishing structures. Do you think that perhaps data scientists might take the lead in dismantling structures that aren’t helping the subject? Maybe, and I think things are changing. But I decided that it’s like turning the Titanic and it’s not going to happen on a time-scale consistent with my career. I can’t wait for academia to gradually reshape itself.

Is that one of the reasons you went off to RStudio? Oh, absolutely. I feel the things I do are tolerated in academia, and often found very useful, [but that said], I lost my grant funding the more applied I became. It’s harder to get promoted. You’re pressured to sell your work as something it’s not, just because that’s what the status quo rewards. Working at RStudio, I’m actually allowed to say what I do is what I do, and be proud of it, and be told that you are excellent at it, which is not currently possible in academic statistics.

So tell me about your typical day, working for RStudio. It’s a remote company. There is an office in Boston and a large enough group in Seattle that they rent a space, but the rest of us are on our own. So it’s just me alone at home working on my projects. We use Slack as a communication channel; the team I’m on maintains two channels for two separate groups of packages. We might have a group conversation going and it can be completely silent for three days, or we can have 100 messages in a morning. It really depends when someone raises an issue that other people care about, or can help out with. And then, I have private one-off conversations with Hadley or other members of the group, and similarly, they can be very quiet or suddenly light up.

Who do you live with? My husband’s a professor, so he’s mostly on campus but sometimes he’s around – we both like working at home and being alone together. The kids are all at home; they go to school from 9am until 3pm or 4pm. My oldest is 14 and I have twins who are about to turn 12.

So how do you manage work-life balance, given that you work from home? Well, I work when they are not there, then I try to work from 3pm to 6pm, or 4pm to 6pm, with mixed success, I would say. Then there are a couple of hours which are explicitly about driving people here and there. I do a second shift from 9pm to 1am or 2am.

Are you a night owl? Yeah, which I don’t love, but that’s just how things are in my life right now. I have to do it that way. I have one productive shift while the children are at school, then one productive shift after they go to bed.

Let’s talk about women in data science. I have the impression that maths remains male-dominated and that statistics is less so, but that data science appeals to women and that the numbers are quite good. What’s your take on that? The reason I liked statistics, and particularly liked applied statistics, is I was never drawn to math for maths’ sake, or the inherent beauty of math. I enjoyed doing it in the service of some other thing that I care about … I think it’s possible that there’s something about me that’s typical of other women, where having that external motivation is what makes you interested in, or willing to do, the math and the programming. For its own sake, it never really appealed to me that much. Programming appeals to me more on its own than math does. Programming actually can motivate me just because I love the orderliness of it and accomplishing these little concrete tasks – I love checking lists (laughs) and being able to check my work and know that it is correct … When you combine it with, “This is going to enable us to answer some question”, then it’s really irresistible.

So it’s the real-world nature of it that is really appealing to you. Yeah – I care about that a lot.

What skills and attributes make a good data scientist? I think being naturally curious, doing something for the sake of answering the question versus a “will-this-be-in-the-test?” mentality – just trying to do the minimum.

You need a huge tolerance for ambiguity. This is a quality I notice that we’re spending a lot of time on in our Master of Data Science programme at UBC. Half the students have worked before and about half are straight out of undergrad, and the questions they ask us are so different. The people straight out of undergrad school expect everything to be precisely formulated, and the people who’ve worked get it, that you’re never going to understand every last thing; you’re never going to be given totally explicit instructions. Figuring out what you should be doing is part of your job. So the sooner you develop this tolerance for ambiguity [the better] – that makes you very successful, instead of waiting around to be given an incredibly precise set of instructions. Part of your job is to make that set of instructions.

How much room for creativity is there in data science? I think there’s a ton. There’s almost never one right answer – there’s a large set of reasonable answers that reasonable people would agree are useful ways of looking at it. I think there’s huge scope to be creative. I also think being organised and pleased by order frequently makes this job more satisfying. People come to you with messy questions and messy data, and part of what you’re doing is this sort of data therapy, helping them organise their thoughts: “What is your actual question? Can the data you have actually answer that question? What’s the closest we can get?” Do that, then package it nicely, you do feel like you’ve reduced entropy! It feels really good.

You work from home and that suits you, but not every woman is able to do that. What needs to change to help women scientists’ progress through life and career, balancing what they need to balance? I don’t how specific this is to data science, but three things were helpful to me. One is I live in Canada, where we have serious maternity leave – you can take up to a year, and because that’s what the Government makes possible, that means it’s normal. In both cases, I took between six and nine months – I was begging to come back before a year! But having a humane amount of time for maternity leave is important.

Also, what’s typical in Canada, and what and UBC does, is that they pause any sort of career clock for a reasonable amount of time. So every time I went on maternity leave it added one year to my tenure clock.

You don’t end up out of synch with people who hadn’t been away. Yeah. It [parenthood] still slows your career down, but this helps immensely. So there are the structural policies.

Secondly, I do have a really supportive spouse. I feel like maybe I was lead parent when the kids were little, but since I made this career pivot and became much more interested in my work, he’s really taken the lead. I feel that there were many years where I was the primary parent organising the household, and now it’s really the other way around … that’s huge.

Third, I’m in my mid-late 40s now and I’m embarking on what feels to me like a second career; certainly, a second distinct part of my career and focusing more on software development. I think you also have to be willing to accept that women’s careers might unfold on a different time-scale. You might lose a few years in your 30s to having little kids … but you often find awards that are for people within five years of their PhD or for young investigators and they assume that you don’t have all this other stuff going on. I think another thing is [employers] being willing to realise that someone can still be effective, or haven’t reached their peak, in their 40s. The time-frame on which all of this happens needs to be adjusted. You need to be flexible about that.

Posts filed under Research (207)

Reeling them in

Ihaka Lectures – videos for your viewing pleasure

Meet Simon Goodwin, Statistics summer scholar

Meet Statistics summer scholar Grace Namuhan

Meet summer scholar Monica Merchant

Meet Statistics Summer Scholar Xin Qian

“Causal” is only the start

Ihaka Lecture Series – live and live-streamed in March

Maps and models

Jenny Bryan: “You need a huge tolerance for ambiguity”

Latest posts

All topics

Subscribe:

Receive our posts via email:

Posts filed under Research (207)

Latest posts

All topics