Assistant professor at the Institute for Computational Biomedicine, Weill Cornell Medicine, New York
Krumsiek lab @Cornell
Discussed paper by Krumsiek et. al.
Gender-specific pathway differences in the human serum metabolome
Getting started with Gaussian graphical models (GGMs)
MoDentify R package
Modular metabolomics pipeline
maplet R toolbox
Alice: In today’s episode of the podcast, I am joined by Jan Krumsiek. Jan, I start by introducing you, because some members of the audience might not know you, but I’m sure some of them are familiar with your work.
You studied bioinformatics at the technical university of Munich in Germany, and then you did a PhD at the Helmholtz Center in Munich as well. You stayed at Helmholtz for a few more years as a team leader and junior group leader in Systems medicine of diabetes. In 2018 you joined the Weill Cornell medicine in New York, where you are now an assistant professor. Your group focuses their work on the development of new bioinformatic tools to analyze metabolomics and other omics – so to integrate datasets. Would you like to tell us maybe a bit more about the work of your group at Cornell?
Jan: Yeah. So when I came here, I think it was still that metabolomics wasn’t used as much, or people were using it as a tool, for metabolic research, but they didn’t have necessarily the methods. Together with a colleague from Qatar, Karsten Suhre we convinced the department here that they need a computational metabolomics person. That’s sort of the reason I came here and, as you said, we are developing methods all the way from pre-processing data, quality control, all the way down to pathways and networks. The only thing we don’t do is working on peaks and spectra directly from the mass spec[trometer] – usually we rely on the person on the platform for that and take it up right after that.
Alice: And which platforms do you use to develop those, those tools?
Jan: In the very beginning, a lot of our work was based on MATLAB. There was not even a specific choice – It was because my PI back then came come from the second processing physics field and he just was a MATLAB person so we were all doing that. If I could go back in time, I don’t know if I would do it the same way simply because MATLAB is very expensive, and the toolboxes are not necessarily made for metabolomics analysis. So these days I would say we do 95% R!, maybe 4% Python, and the rest, sometimes whatever we need in a specific scenario. The majority of the work is done R! based because it’s free and it is supported by the community a lot.
Alice: You worked on diabetes and also on cancer and also other diseases. How do you choose which disease you work on? Is it based on the collaborations you have or is it also your own, driver driving force for certain, diseases?
Jan: A little bit of both, but I would say. A lot of times it is driven by the opportunity of collaboration. We have to admit that. We worked in diabetes and then cancer and also Alzheimer’s disease. So what else? It is almost all of the diseases – sounds a bit much, but they do come together at the level of metabolism where metabolomics methods can be used for 80% of the same things. Right? But then the biomedical applications is driven by who’s there to work with you. Here at Weill Cornell in New York, we have a lot of oncologists and oncology researchers and clinicians, and that matters. If you do have people that are specialized in the field around you and if you can convince them to obtain patient samples for you and are ready to run a project with you.
Alice: So let’s move to the first topic I would like to discuss with you. It’s based on one of your papers from 2015, the paper is entitled “Gender specific pathway differences in the human serum metabolome” and I wanted to discuss this paper with you for two reasons. One is the methods: You use a combination of pathway enrichment and Gaussian graphical models or GGMs; and the second reason is the very topic of the paper, which is sex differences in biology, and specifically in metabolomics. In this paper you use two methods that I would call like one biology driven and one more data driven. This is how I would categorize them. From what I gathered from the methods, there is a pathway enrichment method, which is a kind of home-brewed method from what I understood. So did you design your own pathways in which metabolites belong and then do the calculations?
Jan: First of all, I like that biology driven and data driven. We always call it knowledge driven and data driven, but it’s the same concept of what you were referring to, so for the pathway methods the assignments of pathways of the metabolites are actually from the metabolon platform as they deliver them to us which is a very laborsome process. The have teams in the background making those choices that you later on work with.
And, there’s a lot of questions as we know that. So, you could do it differently. [For example] Why a certain glycolysis metabolite is in the carbohydrate pathway and not the energy pathway and so on. Those choices seem sometimes arbitrary. And you also get criticism from reviewers for that, but at least it’s something we have – That’s always our counter argument. At least we can work with it. Then, we can still ask questions later.
Alice: I think as long as you’re open about what’s in each list, then people can also make up their own mind about it. If you look at the map of the metabolome, you see that everything is connected to everything. So where do you draw the line; where does one pathway stop and the next one begins? Sometimes it’s not so clear.
Jan: Exactly. We have also worked with some versions partially from them partially from other databases where you don’t have that constraint of you have to annotate each metabolite with a single pathway, but you can do many – but life doesn’t always get better with that. So it’s really complicated if you have all these assignments, that you work on. So, the method that you were referring to, let´s say we have all the annotations and we can believe which metabolite comes from which pathway – What we found is that the classical pathway enrichment analysis, as we know it from genes, transcripts and so on, it doesn’t work that well in metabolomics, in my opinion. It’s being used and it’s this classical idea of enrichment. So, is there a pathway that has more hits in it? We’ve also had a PhD student work on that topic. It doesn’t work that well if your background is small. So we only measure a couple of hundred metabolites, maybe a thousands, and it’s not the entire genome as in the transcriptomic case and it creates some statistical artifacts. That will be bit too much to talk about here, but those [artifacts] matter, it makes a difference. So we come up with all these problems that these enrichment methods had in how we observed when we did it in this case – the male versus female analysis. So we came up with this new way, which was to aggregate them, to create a pathway score. That in itself is not new. Just the way we did it was new. So the idea is instead of asking, let´s say, “Does TCA cycle have an increased number of hits”, we ask “is the average concentration of the TCA’s cycle higher”. Slightly different question, but it matters. And that gave us these results that are described in the paper.
Alice: Okay. And then you had your other approach. That is GGMs. Could you explain for a broad audience, what GGMs are and then what that brings compared to the traditional pathway analysis or so other types of data driven analysis ways that’s especially interesting in metabolomics?
Jan: Yes. GGMs are first and foremost a statistical tool that don´t know anything about biology or biochemistry, but work with data. So what you can put into a GGM could be anything. The idea behind GGMs is correlation based analysis. It’s also been referred to as “guilt by association”. We do that all the time. So, if two things correlate (they go up and down at the same time) across samples, they must have something to do with each other. – That’s the idea of any co-expression correlation based analysis. The thing that GGMs add to that is that they get rid of confounding effects, that try to tease out using statistics what are the direct correlations and not what is the correlation partner of another correlation partner of another correlation partner distant in the pathway using regression based methods.
That method is not new – We didn´t invent that. And that has been out there for a long time and there are books about it. What was new was that we applied to metabolomics data. That’s easy. It’s a click of a button. What our research of the 10 years after the first paper (and including the first paper) was, is to prove that these correlation based networks that come out are not just pretty to look at, but that these statistical (not biological) models actually reflect pathways. Across platforms across tissues, across species we could show that and that was the big contribution of GGMs.
And you asked them about what is the advantage over regular pathway methods? So the interesting thing is that for GGM, you only need the data! For example, one of the biggest problems in Metabolomics is, as we all know, are unidentified peaks (you have a peak that you see consistently but don’t know what it is). You call it something X or different naming versions. You can do pathway analysis or anything with it. GGM will put that just in there. It’s just in the network and you can use it still. You don’t know what it is, but at least to see it in context and ask questions later, what did my.
Alice: And so one of my questions would be: How is it to interpret the results? If there’s a biological reason that could explain why they ended up in the same network.
Jan: Yes. We also use, and that was a paper we had later if that’s what you’re referring to, the information of networks actually to predict what they [the metabolites] are. It turns out to be so precise that you can read them from the context. With some inaccuracies it is reducing the number of candidates not telling you exactly what it is, but, it´s precise enough to show you what the metabolites are. And we found that particularly interesting, not just from a metabolomics centric view (We want to identify metabolites that no one knows about), but also because this is in blood, right? It’s not in a cell or in the liver or in the biopsy, it’s in blood and still the pathway footprints that we can find in it with statistics are still there so that we can recover all these pathways.
Alice: From your explanation of how the work happens under the hood, I was wondering how you take confounding effects into account. Does that mean that if you don’t specify that, the two sexes are two different groups, this would be kind of blended into the mixer?
Jan: That’s a good question. So what we’ve found over time, and that is more of a summary of published work and some unpublished work and work by other authors is that these networks that we reconstruct from the data are pretty stable across conditions in humans, for example.
So now you might say, can’t, we make a network for males and for females, two of them and compare, right. Or could we make, get diabetes into non-diabetes network, our cancer network. And at least from what we’ve found so far, it turns out that the networks themselves are remarkably stable, which at first was disappointing because we wanted to find the differential networks.
But it’s also actually interesting because that means I don’t have to really worry about it. I can use the same network as we used them, the paper for men and women alike. We just put it all together. And these confounding factors are more relating to other metabolites – because metabolites tend to all go up and down at the same time with each other. That creates a lot of correlation. That’s real, but it’s not direct. Whereas across different genders or age groups or disease groups, it seems as though those structures actually not confounded as much and are very stable, which is interesting, and it kind of makes sense. Metabolism is hardwired and then you just do make changes.
Alice: Okay. As any tool I expect GGMs have limitations. So what would you say are the primary limitations to the application of GGM to metabolomics?
Jan: Yeah, absolutely. There are two big limitations. The first one is sample size and we get that question a lot. You cannot do this in 20 samples. How many exactly do you need is a very difficult question because it also depends on how many metabolites do you have, how correlated are those? We’ve seen decent results with fifty to a hundred samples. Below that, I don’t know if I would use that method. That’s also why it’s often used more in human studies, rather than for example, in mouse studies, because you have very specific groups.
The other limitation is the first G in GGM (-Gaussian). So it requires normal distributions. That sounds very statistical, but it matters. Especially when you have clearly non continuous data such as a binary variable or an ordinal variable. For example, you cannot statistically put gender into the network as a node. That would be interesting to see gender float around with metabolites in the network and see where it attaches and what it does (you cannot do that you need a completely different type of methods – MGM (for Mixed distributes and graphical models), which is much more complicated from a statistics computation point of view.
So that is a limitation that is real and needs to be taken care of if you have mixed data.
Alice: Okay. For people who are interested in GGMs, do they have to call you to try and collaborate with you? Or can they play around by themselves already? Are there other software’s or code available for people to try this on their own.
Jan: There’s a couple of packages out there. If you just google partial correlation, there’s one called ppcor. There is another famous paper and package called GeneNet from the lab of Korbinian Strimmer, a German statistician which is widely used, and those two boxes are really one click (both packages are linked in the shownotes). Again, you have to take care of all those things, the Gaussian distribution that’s on you to figure out that that’s all okay. But then executing the calculation and getting the network from it is something that an undergraduate R! coder could easily do. No problem.
Alice: Okay. I also have the question what pushed you to study this topic in the first place? Is it looking at data and the experience with data that’s made you say, okay, there is something here. Someone should really describe these differences in the metabolome or how would it come from?
Jan: I wish that was the answer. What happened is, and I think that is an interesting anecdote: I was one of the first PhD students working with metabolomics data at Helmholtz. Back in the days we had biocrates data on the KORA study, and I was just trying stuff. I didn’t really know what I was looking for. And then my boss said: “Why don’t you try this thing? – The, what is it called? GGMs? Just, just try it out and see what happens.”
That is actually how it started. And I just pressed that button and looked at the Excel sheet, and then we inspected it for a while and was like, wait a minute. The first hits, those are all known pathway reactions. That seems like there’s something here. So I have to admit that this story that we tell (that you can use the GGMs to reconstruct the pathways from the data) is not what the hypothesis was for which we picked a method, but it was really the other way around. And that’s how it happened. I remember that Excel sheet actually at the very beginning of my PhD.
Alice: And about the relevance of sex differences in metabolomics. Do you implement that in your work now?
Jan: The interesting thing about sex difference is that it’s the most simple variable in your data – The most conceivable, one of the easiest to assess, to keep track of. So you always have it, with very little errors, usually. We can even, for example, in those thousands of samples, genetically verified that they crossed the right box on the questionnaire, and we have maybe one error in 1800 – so very easy to keep track off!
But as one of the biggest confounders of all, it really makes a difference. Men and women have very different metabolism. So the idea why we tackled that topic is that if you want to understand something complicated, like cancer, not saying that gender isn’t complicated, just the variability isn’t complicated, but if you want to understand cancer outcomes over time or diabetes complications or stages of Alzheimer’s.
Alice: You have a very good example recently with Alzheimer’s that in the paper from 2010.
Jan: Yes, exactly. And then that is how there’s dimorphism in the associations that the paper you’re referring to between metabolome and Alzheimer’s parameters. If you want to go to that level, you have to first understand what the baseline differences between the two sexes are. The other motivation, of course, there’s always a pharmaceutical idea behind it, in that metabolites do probe the metabolization products of drugs, as well. They can show you how well, how good you are at metabolizing something. And we know that a lot of medications these days still are dosed for adults, youths, and children – Not like men should take 800 mg, and women should take 600 mg, for example. And also for that some baseline research on how general metabolism adjusts in the elderly population in this case. Mostly healthy population.
Alice: I liked this a lot in the in the [Alzheimer´s] paper by Matthias Arnold: There’s a very elegant demonstration of the power of stratifying, the data based on sex and also based on ApoE status for Alzheimer’s patients where you see there’s one figure where you just compare Alzheimer’s versus control. No discriminations made whatsoever. And there’s no difference (proline was the metabolite that’s looked at, very basic amino acid). And I mean, I said, you think, okay, it’s nothing special. And then you start stratifying and you see, male and female looks a bit different. And then when you combine the ApoE phenotype and genotype and the sex, then you see suddenly this actually becomes relevant for women who have this specific genotype. And this was a beautiful example where sex is the point where the differences made.
In this case to address the metabolome of the female population with this specific genotype, because you’re grouping everyone together. And the same way, if you have a response that is so strong, that it might come from a small part of the population, then you’re going to generalize to everyone – And the majority is going to do something that’s useless for them, because the response is so strong in a small part of the population. So this, this was a beautiful example from the paper. I liked it a lot.
Jan: I’m glad you liked it. One of the major challenges in the field and the true challenges of the entire field are these stratifications. So in this case, the factors by which we analyzed, it were clear. It was gender was a good candidate to work on. And ApoE genotype known type was a major factor in Alzheimer’s disease. The factors by which the data were stratified were sort of obvious, but what if we are talking about something else? Something that you’ve never thought of like some age group or a very specific genotype group that we’ve never thought of? That becomes a real statistical problem, we can´t test all of those combinations of stratifications. I personally believe (though many people are working on those methods, of course), that there’s a lot of those hidden and cryptic associations out there that we just simply don’t know about them. And I don’t know what to do about that. Maybe we need really big data sets like the biobanks.
Alice: Is there something else you would like to point out about the sex differences paper?
Jan: There was one interesting story in the paper that shows how complicated it can be to analyze this type of data. We found differences between men and women of piperidine, which is a component of black pepper spice. Right. And it’s higher in men. And as always with correlation and causation, it’s easy to find, but not easy to explain. It could be, for example, that men eat more pepper, that´s conceivable. But it could also be that the metabolization is different because there are a known differences in cytochrome C metabolization xenobiotics that could also be, and we don’t know the final answer. Even in the paper we had to say, well, it could be this, or it could be that. And maybe no one cares about black pepper that much – But if this were a drug is really matters, right. This really matters because what we, what was the origin of those differences and it’s really hard to tell from observational studies. That’s still a major challenge for sure.
Alice: I did my PhD and my academic work in the world of toxicology. For me it would always be interesting cause it could be anything you’re exposed to. Just from what the chemical is, it even could be aftershave or something that has this pepper scent and then you put that on more when you’re a man than when you’re a woman and it’s present possibly with other chemicals that you expose yourself to without really thinking about it. It’s a really interesting field.
I know you’ve worked a lot with, the combination of metabolomics with GWAS as well, but you probably work with other types of omics – which type of omics datasets do you work with? (other than metabolomics?)
Jan: Over time we have worked with all of them, if that makes sense. The standards like transcriptomics proteomics, genomics, metabolomics, and then also epigenomics for sure. Very important topic. And some that are a little more specialized, such as glycomics. (That was again technology driven collaboration partner Gordon Lauc in Croatia) and some other more specialized aspects. The big set of the central dogma of biological omics, I would say all of them in some capacity.
Alice: I’m just thinking about this now on the fly: What is your view of epigenomics? How do you use it?
Jan: What I found most difficult working with epigenomics and I’m assuming everyone who’s done epigenomic research will have encountered this. While the marks on the DNA themselves are supposedly binary, you don’t have the local inheritance structure like with snips.
In a snip I can be somewhat certain that my neighboring snip is very highly correlated to me, which is the equilibrium that doesn’t necessarily count, at all, in epigenetics because it’s a chemical modification. So one mark could mean a lot and the mark three base pairs down could mean nothing. That really makes it complicated. We had one study in the context of type one diabetes and HLA methylation and the functional interpretation, summarization, aggregation of results across many marks on the DNA. I found much harder in epigenetics compared to snips. (which are also not easy, but I thought that [epigenetics] were way more complicated)
Alice: Then going back to metabolomics, you integrated with other omics. Do you have like an opinion of who plays best with metabolomics or is it the same for you? And you’re happy to combine it with anything?
Jan: The first thing we have to say, or we have to sort of explore is that integration is a word that we use and it could mean a lot of things. Right. While, for example, a collaboration partner approaches us with a pure metabolomic study, two groups, we do have a one size fits all approach for that. We have our standard pipeline analysis before the pre-processing and so on, and we spit out the pathways and then they can work that. In multi-omics I’m also being asked a lot. What is your standard approach? How do you integrate. This metabolomics data with the transcriptomics data that I’ve measured and it turns out there is no standard, one size fits all solution because there is no standard, one size fits all question.
What are you asking? Do the metabolites correlate with the transcripts? Okay. So we now can design an analysis for that. It could also be something way more complicated like “Are these enzymes regulating that metabolic pathway or does glycosylation of a protein make a difference in metabolites” for which, by the way, we don’t have any evidence. I think it comes down to that you need to know at least somewhat the question. Let’s stick with the correlation part, I think that’s the most intuitive do. The question is just “Do they go together” – Yes or no? – And then maybe also as pathways. I think the Omics technology that fits the best to metabolomics is proteomics. And the reason being maybe that it’s exactly the next partner in the cascade, transcriptomics is one step away but also the reason being that we often measure it in blood and blood proteomics, as blood metabolomics is something whole body. Come from everywhere, but transcriptomics something very different. Blood transcriptomics is immune cell white blood cell, mostly transcriptomics. So you profile the very specific compartment and that’s important. So when you say I do blood, let’s say transcriptomics and metabolomics, you picture them as just two steps away, you know, transcriptomics, proteomics, metabolomics next to each other, but they’re really not. They actually [spatially] compartments away. It’s more like liver-influenced metabolites or immune cell transplants.
Alice: In tissues, the picture might be different.
Jan: Yes. In tissue, the correlation is generally higher. We have an unpublished cancer dataset with our colleagues from Memorial Sloan Kettering cancer center. There we are exploring on the metabolome-transcriptome correlations in cancer tissue. Even there, it’s not as simple as you might think. It’s not that always the enzyme goes with the substrate and the product of the metabolite, as you would picture it.
Alice: This often comes as a surprise to people who have never done this work before. This is interesting to see then I guess you have to do a lot of explaining when you, work with people who are new to this field.
Jan: Yes. And to ourselves, too! Sometimes you wonder how that enzyme shows a lot of variation? It looks like there is something, but it does not go with substrate or a [reaction] product. And I think that just goes to show that dynamic regulation of biological systems is more complicated than the arrows we draw on a piece of paper.
Alice: So in terms of tools now, are there any software tools or programs with the code available? That you would recommend for people who are interested in integrating metabolomics with other omics.
Jan: I would always encourage people to go out and check the most recent reviews because as we record this today, tomorrow there will be five new methods. So there’s a couple of really interesting methods: for example, the MoFA method from the Oli Stegler lab that is going more in the direction of what I was referring to earlier (the cross correlation analysis). And then there’s several methods out there that do attempt to do joint pathway enrichment analysis. The Metaboanalyst platform famously now integrating multi-omics datasets, as well. But again, it really depends on the actual question at hand to then go out there and pick the right method.
Alice: From what I remember, the most complicated thing for us was to find the common language between the different data sets. Is translation the first step and maybe the most time consuming step in some sense?
Jan: Absolutely. So if you get a new dataset you can’t just go ahead unless you know that someone has already worked on the data so much that you can ignore pre-processing. I just take the data and work with it.
But even then, you find some result or something you need to understand better. Maybe a protein is not really what it’s supposed to say; you have to think about how the platform measures it; you got to learn every platform, every method, all the problems of it. – And we’ve, had this discussion internally in the lab that more often than not, we hope we can use the data blindly but almost every single time we have to walk back to the platform, talk to them. What’s going on here? What does this mean? Is this maybe a miss annotation? I don’t understand this. Why do I have all these zeros in my dataset?
Our take is: You cannot ignore it at almost any level, even though you hope that someone has processed the data to the point for you, that you can just use it. At least in my experience, that’s never true. You always have to go back.
Alice: We often discuss metabolomics in the context of the microbiome as the combination of the host and the microbiome pool of metabolites. And as I was preparing for this discussion with you, I was wondering since we’re talking about multi.omics: Do you know of any tools that combine several omics considering there might be more than one species at work? I’ve never seen this.
Jan: It’s a very active field of research just from the two omics side (the metabolome and the microbiome). I think that is a whole new set of methods that will be required. And maybe a good example for what we talked about earlier that there’s not always one size fits all approaches. For the microbiome with the metabolome, that’s one of the major topics. So let’s say stool samples and you have metabolites in microbes. Anyone who’s ever worked with that. And even if you don’t know much about and just picture it, it’s extremely difficult to really understand what’s going on. You have a biomass that processes metabolites interacts with the bloodstream for the gut. Somehow, some product comes out and we’re trying to interpret how it got there and which organism made it is extremely complicated and a good example for a different set of methods needed. If you want to do pathway analysis and microbes, there is an entire field, just working on the question: “Can I find the pathways that are active in certain microbes that in combination with other microbes then lead to the production of a certain metabolite in the gut that might or might not be beneficial for the human. And again, those methods are completely different than when you look at the metabolome or the proteome. It’s just something else. It’s a different biologic hypothesis. And this is why I think the answer to your question, that the method that integrates a lot of these, I don’t know if that exists or can logically exists at this point in time. I don’t even know how anything plays together here.
Alice: That’s what I expected, but I was curious to see if maybe you knew something that I hadn’t heard about.
Jan: Call me if you know anything… [laughing]
Alice: Yeah, stay in touch [laughing]. About multi omic analysis, were there other points that you wanted to bring up?
Jan: Yes. I think the manual integration of data that say into a figure or into a chart is sometimes undervalued – those pathway methods do not give you the final answer; they do not write the paper for you!
Alice: I am happy to hear you say this. I believe that but it’s really nice to hear it from different sources, too.
Jan: Absolutely. An example was the latest Alzheimer’s paper we wrote: Yes, we are going through the standard process of clean statistical analysis at large scale; then the pathway integration methods that we talked about, all of that needs to be done as a first view off what’s going on. – But our final pathway about near transmitters that we were personally interested in is a figure that my post-doc Richard worked on and no computer method could make that figure. We picked it. And, of course, you have to be careful not to make it biased and so on, but I think we shouldn’t sell what we can do manually under value. You still need the compensation part, but the final interpretation is not going to come out of an enrichment algorithm.
Alice: Do you think we´ll ever get there?
Jan: I think it will always be the case that better and better computational tools will give you more interesting views of the data, things you hadn’t expected, but it’s always more like the first page of Google for you. You still got to pick the right one and go through it and interpret it and figure it out. Only you know the field, the computer doesn’t know the context of the field and other studies in what your question is. So I think the methods will get better in condensing those complex data sets into candidates of results.
Alice: I fully agree. So then we get to the more generic questions about metabolomics and about the interpretation projects. First, do you see particular pitfalls, especially for beginners with metabolomics or people who are diving into their first interpretation projects, which they can avoid in the preparation of the data or in the actual running of the analysis that you know.
Jan: I think there’s, there’s a couple of aspects to this, first of all, and it sounds like a dry standard statement: The pre-processing really matters. I would also encourage people, students also new researchers in the field to think about pre-processing not just as this chore that you have to do. This must do step that you just want to get past as soon as possible, but it’s actually really interesting. There’s a lot of statistics; a lot of computational methods involved on the side of pre-processing the data. I know that’s not what we want to do. We want our results for our question and not work with massaging the data, but it’s absolutely necessary – You can’t skip it.
Alice: Do you also see it as a way of getting to know. This was one of the advantages of doing this manual work myself in the past is that even though I might have only shown like 5% of the data at the end of the paper, you get a knowledge of the data that is really deep. And that also points you in the right direction.
Jan: Very good point: For example, you might exclude an outlier, but if you’ve really spend some time, you know, that a sample or the particular mouse, or maybe what’s going on there; what happened. You really know it at that time and that time needs to be spent. I only know two ways: Either we spent the time and then go forward or we go forward too fast and then go back and spent the time
The challenges and pitfalls among the statistical analysis, I think what is tough in the field. And that is not even metabolomics specific. You have to be a somewhat trained statistician to run the data with all the problems that come. Can I use a method that assumes normality on normal data? Is that just some kind of mass statement that it doesn’t matter in my actual application case? Or do I really need to take care of this?
Alice: Hm. Alternatively, you can collaborate with statisticians.
Jan: Yes. And I think you must. Whenever you are running anything, even if you’re running it on the online toolboxes, unless they really take care of everything, you must have someone on board who has the understanding the statistics or is willing to acquire it, go through the tutorials, go through all these things that we might not have wonderful plots of the distribution of the data.
Again, what I really want is to go forward with my project. Right. But you have to. It can be something very small which makes a large differences in the outcome such as did I take logarithm my data. Did I scale my data before doing a PCA.
Alice: A beautiful example, do I log everything or not? Then you do your whole analysis, your whole interpretation, and then towards the end you learn that, oh, you should have done this – and you have to do everything new. This it’s quite dramatic.
Jan: Exactly. So you asked for pitfalls to avoid, so I think it’s not even possible to you name them one by one, because unfortunately you have to know how to statistically analyze quantitative data, which could also be weather data or stock data. There’s going to be similar questions, but you have to have maybe not a degree in it, but you’d have to have someone on board that has a understanding of those concepts, or you must train someone to get to that point using online resources and so on.
Alice: When it comes to the interpretation of the data, what would you say is the most time consuming step.
Jan: I prepared for this when I read your questions and that one I’ve thought about for a while. It is a good question. I came up with an answer that I think is very practical. What happens to a lot of people is that what we do a lot of times and data analysis in my opinion, is we use the tools we have; try them first and ask questions later – And then later we realized it doesn’t actually really fit our biological question. So we did a correlation analysis between metabolites and transcripts in our new multi-omics data set, because that seems logical, that’s what we do. And then when we look at the results we realized: “Was that even the question that we have asked?” Then it becomes an iterative process. We go back into our methods. So, while the execution of the scientific project should, of course, be a reiterative process. We make a hypothesis, we pick the statistical method to answer it, and then we answer it and then we write a paper. In reality, and everyone knows that, it’s a very interactive process and that is driven by interpretation. Right. It’s not just that you don’t get the results you want, but you don’t even know what to do with the results of the statistical method that you actually got. So you go back and adapt it and iterate go on. And every time a postdoc or a PhD student has to write 200 lines of code and debug it to get that new result done. – That I think is the most time-consuming step in all of it combined with the more fun part that we discussed earlier: The manual interpretation as we set the path we met that doesn’t get you to the last figure in the paper. So you must put in your own interpretation. That is time-consuming, but I think very fruitful. The iterative part where we switched between methods and hypotheses constantly, I think, is not very productive and needs been taken care of.
Alice: Do you still strive to one day, begin with your question, run in a straight line and get to the end or do you know, that this will never happen?
Jan: I have to acknowledge that while that’s what we want and it’s always easy to think about it in hindsight, in reality, that never occurs, but also in our lab and in our discussions, we’re trying to move more toward asking a question for which we picked the method instead of using the method that we know so well, use it first, and then ask the question.
Alice: This really makes sense because sometimes also you have the tools that you have, and they might be very good to answer certain questions. So why not just go for that?
Jan: We are changing our wastes a little bit. I always ask in my lab: “Who cares?” Not in an offensive way, but in an who actually cares. Right. So who’s really interested in the result that you produce. If you don’t have an answer to that and you should immediately drop everything and rethink first.
Alice: That’s a good way to go. Yes; that’s true. So your work in developing tools that requires a lot of creativity, do you also see a need for creativity in the interpretation parts?
Jan: I think yes and no. Interpreting the question also in terms of statistical and computational tools that aid in interpretation. And I think all of us who work in computational research can confirm that the most interesting new methods, the new toolbox that you publish later, all originates in some questions and some creativity. I don’t know what to do with this data. Maybe we can try this – And suddenly, for example, we had this on the networks with the MoDentify toolbox. It started as a “couldn’t you try this thing” for the PhD student and she thought about it and came up with a creative idea of how to make these modules in networks, and then it ended up being a toolbox. It is a new data interpretation toolbox that came from a creative brainstorming of how to go forward. I think therefore the answer your question, is there room for creativity in data interpretation? must be yes, because that is literally what we do.
Alice: Otherwise we end up with the old tools, because then you don’t create anything that gives you, as you said before, the beginning of the answer, but you still have to make the work of connecting it together. So we need creativity for the tools, how to apply them and then how to understand what they’re saying to us and why that makes sense.
Jan. Exactly. Especially in exploratory studies. Yeah. Where you don’t have an outcome that is clear if you know what the outcome is, there’s no creativity. You just have to pick the right statistical tests like in the cancer study. Not that they’re easy to do – but the outcome doesn’t need any creativity. For a new omics data set and you give it to a PhD student and say here, figure it out. – You definitely need creativity.
Alice: To finish, I have two straightforward questions. Which one is your favorite metabolite? And why?
Jan: Yeah, I had to laugh when I saw that question. -I don’t want to pick favorites.
Alice [laughing]: I want to love all my metabolites equally.
Jan: A colleague many years ago also said after working with metabolites (and at the time we only had a couple of hundreds) that you know, each and every one of them personally. I think for me, one of the most fascinating ones is 2-Hydroxyglutarate. It’s a metabolite that is relevant in cancer.
And why I find it fascinating is that it’s a naturally occurring compound, but with a gain of function mutation that happens to the TCA cycle in many different cancers, It suddenly makes the TCA cycle sort of spin out of control and produce masses of this metabolite is really interesting that you could change the enzyme to produce something very similar, but a little different.
And the 2-HG is being debated as one of the first onco-metabolites. So the ones where it is not just a side product off the pathogenesis of cancer that it occurs, but it contributes to it. The presence of it has epigenetic effects and contributes to cancer and the pathway spirals out of control so badly that you can actually see the metabolite even in the bloodstream elevated a hundred fold. I think that story is fascinating and I worked on it.
Alice: Thank you. This concludes our conversation about metabolomics. Thank you, Jan. And I look forward to all the wonderful new tools that you will develop for us. Thank you.
Jan: Thank you very much. And I’m looking forward to your exciting podcast.