mGWAS & metabolite ratios

In this episode, Alice talks to Prof. Karsten Suhre about the added value of combining genomics with metabolomics in mGWAS, tips and tricks to find confounders, and the power of computing metabolite ratios.

Karsten Suhre

Professor of Physiology and Biophysics

Director of Bioinformatics Core at Weill Cornell Medicine-Qatar

Suhre Lab @ Weill Cornell
Virtual metabolomics core facility

Discussed paper by Gieger et. al.
Genetics Meets Metabolomics: A Genome-Wide Association Study of Metabolite Profiles in Human Serum

Karstens blog about metabolomics, genomics, and where the two fields meet

mGWAS server (collaboration project)

Other resources discussed in the podcast in github

Sign-up for The Metabolomist e-mail list

Episode Transcript

Alice: Welcome to this podcast. Thank you for discussing with me today.

I will start by introducing you with a short biography and then we can discuss your work with metabolomics in further detail. It was really interesting to look through what you did over the years. I was quite impressed. Looking at all the different things you’ve done was very eclectic.

So you studied mathematics and physics at the University of Osnabrück in Germany. Then suddenly you had the PhD in atmospheric chemistry and meteorology.

Karsten: Yes, but that wasn’t suddenly. During my studies I was half a year in England and studied fluid dynamics there. And then fluid dynamics led to metrology. Then the specialization in metrology was atmospheric chemistry.

And in the end, honestly, it’s not very different from what I do today. Atmospheric chemistry is like metabolome of just one organism, which is the globe.

Alice: That makes sense. So it’s all about the analytical methods. I get it now.

Karsten: Well, I could pretend that but in reality it’s a lot of personal reasons.

Alice: So did your PhD in Toulouse, France. And then a few years later, you habilitated in bioinformatics and structural biology in 2004, at the University of Aix-marseille.

You stayed in France for a few more years until in 2006 you were appointed professor for bioinformatics at the University of Munich and at the Helmholtz Center in Munich, Germany.

Karsten: Yeah

Alice: And in 2011, you joined to the department of physiology and biophysics at Weill Cornell as a full professor and became the director of the bioinformatics core at the Cornell campus in Qatar.

Karsten: That’s right.

As I was reading through your profile page on the Cornell website there’s one sentence that caught my attention. You wrote, “I identify as a bioinformatician and system biologist.” And if you’re remember writing this, I would like to ask you first. Why did you choose that verb?
Why do you identify as a bioinformatician? I mean, you’re a professor in bioinformatics. I found this really interesting.

Karsten: It’s because in a way I never really like to be boxed in somewhere. So when I was a physicist, I always said “no, I’m more, theoretic physicist and a mathematician”. I’m into computation stuff; always evading a the point I arrived somewhere. And so I said, because being a bioinformatician is also some kind of the definition of bioinformatics also changes a lot between countries.
For some people it’s just the computer guys for others it´s more the interpretation. I like is it’s basically the study of information content in biology. And I think that’s the definition that the NCBI, gives on their webpage. And that’s something that I identify with.

Alice: Something that fits: I love this choice of words. Because it’s also nice to see: You get the degrees, you get the job titles, but still you can still identify as whatever you like to identify. And then another thing I found interesting going through your bio is that it changed quite a lot.

Karsten: Driving this was the curiosity of how biology actually works. Because being a physicist initially when I studied biologists were frowned upon. And when there was the human genome sequenced. That changed everything. Suddenly biology was really mechanistic and much on front of it. I changed totally about this view.

Alice: So this is what pushed you towards biology then also. When the genome was….

Karsten: I mean it was a chance event. I was with CNRS. I could not go back. So I went to industry in my hometown for personal reasons. Did engineering for two years. And seeing engineering and research is not the same thing. The one guys know exactly what they want optimize things and the researchers don’t.
I didn’t fit in there. Luckily I was in CNRS and could go back. Theoretically, I would go back to atmospheric science. Just by chance I ran across this kind of paper of on in Le Monde on the discovery or the publication of the human genome. And when I was just applying for position to go back to the region of Marseille.
I said, well, maybe I just asked him. And it turns out that, in his lab all bioinformaticians were physicists, astrophysicists. He was a physicist. So at that time basically there were no bioinformatician. They were all converted biophysicist. So they were really the driving thing.

Alice: That developed later then. The job of the bioinformatician. That label didn’t even exist probably at the time or it was really a small niche thing. Or…

Karsten: Yeah I mean it was really just the discovery. The problem of really solving the sequencing problem. They had all the sequencing, but having the alignment was what bioinformatics initially was.

Alice: Hm. Genomics is still a big topic now. But metabolomics is a topic that’s grew a bit later. It seems like for the last 15 years you’ve been quite interested in metabolomics.
And that this still keeps you interested. Is there a reason for this?

Karsten: Yes. It’s the functional thing. I started biology or bioinformatics in 2002, when I went to Marseille and learnt it from scratch. And already then I had a colleague who was actually the first author on the first paper of the KEGG metabolomics pathway map.
And he was working in Marseille that point.
With him, I learned to analyze the metabolic capabilities of bacteria. So we sequenced different bacteria and the way of looking how intracellular bacteria live. So the kind of genes they lose, they become dependent on the host cell. That you could computationally analyze that he was really there.
You just took the million base pairs of whatever you had in the bacteria and broke that down in all the enzymes. – Then he could show which ones are missing.
And which bacteria does use what from his host. These were all obligated bacteria. That was already metabolomics in a certain way. Although we didn’t know it at that time.

And the other thing is the link between the function of what the genome really does? How does it really function how do you interpret that? And I think that these were the first steps to interpret function in the bacteria. And that later carried on when we went to Munich – Where the things got into humans.
Working on genome-wide association studies and then doing the GWAS with metabolomics. I think I was lucky to be at the right place, at the right time, with the right colleagues around me and everything.

Alice: Then this started a series of really nice papers. This is the main thing I wanted to discuss with you today. I’m not going to discuss the detailed statistics behind it.
I think it’s interesting for people to have a global understanding of how this is done. Because genomics and metabolomics are two very different data types. So it’s interesting to discuss this. But also to see what metabolomics brings in that environment.
Maybe I can give you my very superficial view of it.
And then you can give a bit more detail. Because one of the most interesting things for me was that there seems to be people who start from the genomics and then the SNPs. And then put their phenotypic trait, whether it’s metabolomics or something else on top of it. And look at where the associations are.
And there seems to be another technique where you start from the phenotype. Whether it’s metabolomics or something else. And then you find what’s interesting there.
And then you find what is associated with it in the genome. Did I understand this correctly firstly and can you comment on this?

Karsten: I think you should go maybe further back about the GWAS. I mentioned already the human genome project being at the beginning of everything.
And the human genome project was already showing that there are genetic variant between people and it wasn’t the genome of one person (but rather at least 10 different genomes were put together). The promise of the human genome project was – we just go and find all the variation and then we say that is the outcome.
And once we have that, we can treat everything. And of course, things turned out to be much more complex; some people say it didn’t work. (I would not agree with that. Absolutely not.) But some people say it wasn´t worth it? In any case, GWAS came and thought initially: “We just find the gene for diabetes and the gene for this and that”… but realized it’s not true.

You just can explain one, two, three percent of what you suspect to explain of the heritability. Nowadays, we know much more about variants and of lots of small effect sizes with many variants and some rare events with the larger effect sizes.
The one other thing that came as a question what do we do with GWAS? What’s the purpose? And there is the misunderstanding that people think you can predict something. You can`t predict what someone dies off this or gets this disease.
Today, it’s much more about understanding on the one in the pathways.
And, I think from a pharmaceutical point of view also target validation or target interpretation. For them, it is important if they want to work on a molecule inhibiting a certain protein, that there’s genetic evidence that if they tinker with this protein, something happens. – And that’s where the metabolomics comes in.
This concept of the intermediate phenotype.
That has been something that has been pretty early on in our first GWAS presented by Florian Kronenberg from Innsbruck brought it up. It was the hypothesis that these intermediate phenotypes are the link between the genome and the disease. And it’s not only metabolomics.
And it’s also, I think there’s no priority for metabolomics or any other Omics. The whole chain from the genetic variance, which is the starting point of the GWAS and the end point, whether someone gets a disease. – And in the meantime is all the other genetic variation that’s influenced by a particular genetic variant. You can measure that in metabolites and proteins, GWAS on glycosylation, on lipidomics, on micro-RNAs on anything you want.

Alice: Yeah. And this grows as the techniques grow as well. Like as the technology improves for glycosylation or things like this, then it’s more applicable to associate with other omics, I guess.

Karsten: Right. And that makes the value of metabolomics. Metabolomics is measuring the true end points of biological processes. Which has maybe also a little bit of an exaggeration whether they’re the true end points. But it was like you go from the gene to the RNA to the protein and then the metabolites where things happen. – Of course diabetes is glucose and gout is urate. So in the end, it’s true. But there’s so many other things as well.

Alice: Yeah, but this is the old paradigm as well. That’s more and more we see interconnections between the different levels and then it’s even difficult to see the levels anymore. Because everything is everything now. Is it correct that in your papers you start usually from the genome and then add on the other methods or at least the metabolomics. Or do you do both versions? Is there any advantage one or the other?

Karsten: No, it’s always together. Association study means one is the dependent and one is the independent variable. But that’s just artificial. Which you take as dependent or independent variable. It’s in the end it’s correlation.

Alice: Yeah. But let’s say if you’re looking at cohorts with diabetes or without diabetes. Then you might use one dataset to focus on the differences between those two groups and then add on the second layer. That would change some of the results. Wouldn’t it?

Karsten: Yeah. But that’s a different kind of study. When you do metabolomics, you should distinguish what you’re studying. Now we started out with GWAS and GWAS from metabolomics. Because that’s a bit what I was focusing on. But in general, you don’t do metabolomics for genetics.
There are two kinds of studies. The ones are this population studies. And that’s a bit where we were working on like the KORA or the Framingham’s, UK biobank. So where you take everybody who is more or less normal. And then you collect as much phenotype information and genotype information everything and then you can do all against all.
And on the other hand is more of this kind of clinical studies. E.g. “I want a cohort of diabetics and see what’s the case control” or “kidney rejection” and things like that.

Alice: That`s make sense.

Karsten: And in terms of data acquisition that also makes a huge difference. In the one case (population studies), it is generating this data once and for all in high throughput for thousands of samples. In the other case, it’s more like you spend less money because there’s less samples. But you enrich them in cases and you have more detailed phenotyping. You generate much more detail on the patient metadata. And in an ideal world, like with UK biobank, you have both. You have so much detail on them that you can do whatever you want. And then you just filter out like the diabetics and non-diabetics from UK biobank and do your case control study or similar kinds of things.

Alice: Okay. So let’s go back to the first mGWAS paper (by Gieger et al.). Could you tell us a bit about that paper and maybe your role also at the time? I’m interested in the people who do the interpretation and in that paper there is something really interesting that is the use of these ratios of metabolites rather than the association to the pure metabolites. Can you say, how you contributed to that paper or to that story and then tell us a bit especially about the ratios. I’m really interested in that.

Karsten: Yeah, I’m pretty fond of that paper. Because it goes a little bit with my move to Munich. – And there I was on a professorship on bioinformatics and it wasn’t really clear what I would be doing there other than teaching bioinformatics. I went to the Helmholtz Center at the time still called GSF. And that’s an interesting place because they host the KORA population study. There are different institutes involved. I was at the Institute of bioinformatics but there was an Institute of Epidemiology where Christian Gieger was working.
And also his colleague, Thomas Illig was involved in that. And then there was a core facility where they were actually running metabolomics samples.
We had them measured by biocrates. Because we were in the process of setting up the platform. I think the kit version wasn’t even officially on sales yet but we knew it was coming. Jurek Adamski was setting up the platform and with KORA, we said let’s generate some data, pay fee for service, measure 300 samples. – And that’s what we did: We got to ‘huge’ dataset. I mean huge 300 samples at the time. It’s not huge today.

Alice: At the time it was huge. Wasn’t it?

Karsten: It was. Especially the kind of data. I was very skeptical. For me it was like, how can you measure a drop of blood with these details how can that be precise? And then having another drop of blood and getting 500,000 gene variants out of that. Then do a correlation between them and find something meaningful. For me was a total surprise that that works.

Alice: Makes you feel very small. Doesn`t it?

Karsten: Yeah. And there was suddenly a lot of things in the data, which was probably also not correct. We found a lot of things later on miss annotations of metabolites which just totally normal. You have to know that. We ran the metabolomics against all other phenotypes of the KORA cohort, not only the genes.

I still remember, we had a project meeting at some point where we had like, I don’t know 15, 20 different phenotypes and we split them up between all the postdocs and the guys interested in one would be working on smoking and one on alcohol consumption and one on diabetes.
All the big topics that were in KORA. We spin them up and I computed the P-values and share them with them.

I think there’s a lot of papers that came out of that time. You can look them up. Where we for the first time we saw associations between phenotype and metabotype. And many of them made sense. But many of them were also complicated to analyze. I mean, we had papers with coffee consumption; we found that variant of nutrition style somewhere. – Sometimes a bit long shots but it was interesting to see that. My central project was the GWAS together with Christian Gieger.

The interesting thing for that project was that we needed more computer resources. The compute center for Munich had the 10th fastest supercomputer in the world at that time. And they were very proud of that. But they needed users. And normally physicists went there and used all the time, but they needed non physicist user. So at some point I got compute power from them and I wondered “how can I use that? How can I spend hundreds of thousands of compute hours on this machine?”

And that’s when this idea with the ratios came up. That initially we did the GWAS already and then we had so much compute time and we had already the idea with the ratios. We had it before, when we looked at the original data we got from biocrates of the mice. We just try it out because we just had the computer and it needed to be burned. And amazingly it worked! It generated tons of data.

Alice: So then with the computer there, you computed all the possible ratios between the metabolites. Okay. And then you checked which ones associated well with the data.

Karsten: Yes. And then you look at what we call the P gain.
So whether the P value really gets significantly stronger. Because you can of course always do ratios and if you do enough testing you will always get a little bit of a P increase. But what we were looking for and found was like we had a p-value of 10 to the minus eight and the GWAS.

And then we went to 10 to the minus 22. And then the very surprising thing was like the metabolites in the ratios. I tested all against all and some people came and ask why are you testing all against all? You should think before you do this? I really loved that because the biology actually tells you what’s right and what’s wrong. It’s not me coming up saying, oh, I know that this and this metabolites are linked together. I just tested really everything. – Also everything that didn’t make sense. And the ones that made sense actually stood out.
So whenever I look for something with a P gain in the end, also in subsequent paper, it really made sense. Even further on we had studies where we had unknown metabolites. Where we didn’t know what one of them was. With the P gain we could say: “This probably is linked to the other metabolite in the ratio”.

Alice: These are building alternative pathways then based on this with the few ratios that you did not expect. Did you look the parent to it to find? And you found the biology behind it?

Karsten: Yeah. There were the paper with Jan Krumsiek on mining the unknowns derived from this work. He was also one of the people involved at the time. And we could reconstruct pathways from partial correlations between metabolites and also from ratios and reconstruct underlying biology. I think one of our Gieger paper phrases was “if the function of FADS1 (one of our top hit genes) was known, we could actually have inferred it from the data alone”. That’s something I found pretty pleasing and something that very often replicated later in larger studies.

But it was already there in the first study. And I think that’s also why, I like the first paper most. Because you first described things for a first time. Then we had another paper in nature later. The first one, we submitted to science that they didn’t want it. So it just got to PLoS Genetics. The second time was easier. I would have said (as a reviewer): “Wait you already said that in your paper in PLoS Genetics”.

It went to Nature maybe also because the study was bigger, of course. And then GWAS became a bit like generating more and more findings which means there was not so much novelty anymore on the concept level. In terms of understanding biology, of course it was contribution more and more.
And even now – if one day we would have a GWAS on metabolomics in the UK biobank it would conceptually not be something new, but individually, on the function of each gene, the overlap with the diseases all that would be a good reason to do that.

Alice: And you think in the case of this type of studies you have really powerful statistics behind it, like holding the story together. But you find that it makes it more difficult to publish when you have new approaches to, like a new problem and new approaches to that problem. It’s not a good start. Even though that’s what everyone is trying to do. Isn’t it?

Karsten: Yeah. Sometimes it’s a bit too early for an ideas to get it right away into papers like nature and science. PLoS Genetics was good and I’m pretty happy about in the paper has almost as many citations as our following papers.

Alice: And the nice thing is that it’s available to everyone as well! You don’t need to have a license to read this which is nice.

Karsten: Yeah ok. If you want to put it like this – Yes!

Alice: Well, Open access is getting more but nature and science are not famous for allowing people to let them read their papers. So for us, the audience, it is nice.
You mentioned, that before computing all the ratios, you had already had the idea for the ratios. How did that come? Was it because you were interested in, for example, for this one and then you looked at the metabolites that were related to it. And then you saw a pattern there or how did you find out the first ratios? The ones that the computer didn’t find?

Karsten: Honestly, I don’t even remember in detail. But it was on this mouse data that we had from biocrates. And it was Elizabeth Altmeyer who was doing a PhD at the time with us. And I think our question was, “how can we make sense out of this data?” Just let’s play with it in every way a bioinformatician comes up with. Later on we had a paper where we thought about what ratios really are? what they mean?

But at the beginning it was just like poking in the dark and just say, oh, we could do ratios. – Let’s see what happens. And suddenly something came up and then you say, oh I can explain it. I think the rationalization came later to understand. I need to say a ratio could be a measure for the throughput of the reaction rate.
That’s one thing. But then it also could be a normalizing factor. Like if you normalize this creatine in urine. If there’s additional variation in your data sets like Fabian Theis from the Systems Biology group called it the French fries factor.

So if some people eat a lot of French fries and others eat little, then there’s a sudden lipid in your blood and if you normalize by that, you reduce the variance. And once you reduce the variance the P values get better. That’s the thing. Same way it is with creatine normalization. You have a signal in the urine. But if you have different dilutionn of the urine, you lose the signal.

Alice: Absolutely. Did you start out with a very defined workflow of how you wanted to analyze the data? For example for the Gieger paper you had a series of things you wanted to try out or did you have trial and error and then the kind of iteration process where you go – This could be improved with this and this. Maybe we could do slightly differently because we saw the results is not really what we expected. Is this something that happens a lot in your work or do you manage to go in a straight line because you know exactly what’s going to happen?

Karsten: I would like to say is the latter but that’s not true.

Alice: I don’t think that’s real. But I prefer to ask. You never know.

Karsten: The one thing which is really good with genetics is that you know that the genetic variance is causal. The genetic variant cannot be confounded. In other studies you can always be misled. You find an association for instance, these are just association. If you want to go for cancer, you do a case control study, you find something. And later on you find out that the metabolite you’re looking at is a metabolite derived from orange juice and you find out that cancer patients get orange juice at the clinic.

So that is not your marker for cancer. In genetics, there is almost no way that something would confound your genetics. So, if in genetics and GWAS, I tune parameters to improve the association like scaling the data or normalizing the data – Every time you repeat a fitting. I look at the P value, you done another test. So if you’re honest with yourself, you should not go for 0.05.
You should have a list and make a tick mark. And every time you tried, you should make a tick mark and lower the P value should be hitting it. Let us be realistic about that. But I think what I’m saying is if especially, you know an association is true, like for instance FADS1 and you later on you replicate that another data set and then you say, oh, should I lock scale my data? Should I filter this way, that way?

If I can tune the selection of a parameters to optimize the already known association without looking at everything else and if I would then go and use that as an adjustment to do the rest of the association. In my view that would be the correct way of doing it. Because then I wouldn’t be biased by the outcome of the test. The problem here is to be disciplined. You shouldn’t tune your thing to all the parameters until I find the most hits possible. In this respect you have to be honest with yourself.
That’s a dangerous thing because it is very frustrating to write a paper and say, I’ve found something spectacular and then it doesn’t replicate and the end comes back and bites you.

Alice: And this is something I also liked again in the Gieger paper. There was one part where – I don’t know if this was the true chronology or if it was just written nicely – the power of association was not good enough and the p-values were too high for this kind of studies. So you say, “this would have ended here. But we found the ratios”. It looks like, “maybe we would have been honest and we would have stopped, but, you know we found the ratios and suddenly it was super powerful”.
I liked this because it’s a nice picture of a kind of limitation that you have sometimes when you do the analysis and you find this sometimes in papers where people go, “We tried this and it didn’t work”, but there are all these experiments or all those tests that are done that we never hear about that are part of the work and that are part of the analysis and that don’t make it in the paper.
Do you have an idea of the time you spend trying things that lead nowhere? – As opposed to the time it took to do the actual thing that’s ended up in the paper?

Karsten: Yeah. I would make a difference here between genetic association study and clinical association study. The genetics is pretty straightforward. The way you do that is now pretty established. You can discuss how you scale your metabolites and things like that. You can maybe have more advanced things. You can do Bayesian or whatever association stuff. But that’s more routine now.

I think in metabolomes you have stronger signals. So in metabolomics GWAS we normally don’t go for the border line significant. Although they could still carry information. But there’s just no point in doing that because the next GWAS with a larger sample size will catch these. However for clinical studies of course it is quite different. Here you always start initially with the idea. I want to analyze my metabolome against this or that endpoint and then you do and that’s okay.
But then you start asking yourself, what is confounding? And you may find an association and you realize there is a weird metabolite coming up or this doesn’t really make sense. I have an example where we did a diabetes case control study. And we took the controls from a dermatology department at the same time as the cases, which was a good idea.

We eliminated the effects of batch separation taking samples of different places. But then, in the end, there were some metabolites coming where say, “are these really markers of the control?”. They were not really controls. They were people at the dermatology department.

Alice: The ones who all got the same lotion from the dermatologist or something like that.

Karsten: Yeah, probably not because it was very heterogeneous. There was one marker that was actually melanoma associated something. Although later on that it replicated in other studies that were not like that. So maybe I was wrong. Wrongly thinking it was wrong.

Alice: This is also the thing with naming of, especially of genes.
This can really be confusing. If genes are not studied much and there may be one or two papers that just led to the genes name. But it’s not so much based on it. And then you think everything is brain. You have to have this?

Karsten: Every name is derived from brain or cancer. That’s a problem with the genes. Well, the metabolites it’s a little bit less, although I think there are some cases. We jumped from one topic to the other. But you come also in the interpretation of metabolite associations. I mean this kind of pathway analyzes. That’s always something I always a bit reluctant as well. Although that’s also well-liked. But it is tricky because you have so many pathways and just saying, oh, this is a metabolite of oxidative stress.

There is not one metabolite of oxidative stress. It could be indicated from oxidative stress. But it could also be a lot of other things. And I think that is what makes maybe metabolomics more hard to interpret than for instance proteomics.
Proteins also have multiple functions, but not so many. Some metabolites could be at the basis of everything and to say, this is a metabolite for xy is very hard to pinpoint. Especially when you go to amino acids or more sensible nucleotides or lipids.

And then you also have the problem that metabolites come from everywhere. So they could come out of the liver. They could come out of the kidney, from the fat system.
The metabolites are in the blood for a purpose, at least most of them. The organism puts them into the bloodstream to carry them elsewhere. But of course, you also have others that are there because as cell died, they were leaked or are there because of other processes.

Alice: And they should be excreted later on.

Karsten: Exactly. Blood is a convoluted medium and you have other media, right? We have done things in saliva where we found a marker for diabetes that we also find in blood. So you don’t really know why does it appear there?

The urine studies people are doing CSF studies now. There’s a lot of interest in this but every time you have this problem of confounding with almost everything.

Alice: With metabolomics there’s this extra layer of what we eat. That’s especially if you look in the blood, but also in the other metrics. Like from when you start this, as you said, every organ contributes to the whole pool, but then there’s also what we eat and if the same person eats differently, then you might get different things. Which makes it extremely interesting but also very complex.
Do you ever use microbiome data or diet related data in your work with metabolomics?

Karsten: When microbiome personally less. Because we just don’t have the studies for it.

People starting doing that. Especially in the twin study they have their separate papers with microbiome already. So I think microbiome I could go on forever as well. I think there’s a lot of catchy things there as well. Concerning nutrition – Yes, there’s a study we did also with the same team in Munich years back. It`s a human study, where we had 15 male healthy volunteers.
And they went for four days closed into the technical university study center and they were 36 hours fasting and they got controlled meals and everything. Gabi Kastenmüllers group (The Metabolomist Gabi Kastenmüller) is just preparing the web server that now has all this data integrated.
There’s also biocrates kit data, the Metabolon data, data from urine, from blood off course. It informs you also about which metabolites are stable over the course of the day (where you wouldn’t have to bother about fasting) and others where you probably would have to bother about fasting.
Like this there’s a lot of confounding there as well. For instance, a diabetes person is more likely not to be fasting than a non-diabetic person. Because a non-diabetic person could go without food. A diabetic person would probably be careful to equilibrate their nutritional intake and then you would have already a confounding factor.

Alice: Yeah, absolutely. But this principle of making databases of metabolomes or other ‘Omes’, is a really interesting thing as you’re working a lot on this as well to combine data about specific diseases or about specific topics to make it available to the community.

Karsten: I didn’t mention that in the introduction. We working in a virtual group together with Gabi Kastenmüllers group in Munich and she is doing all these web server things. They have an Atlas of Alzheimer´s that they presented bringing up where they connect that. And the group of Jan Krumsiek in New York; they are very much into the systems biology of it (like the gaussian graphical modeling and the networks; and how can we put P values not only on a symbol association but on a pathway or on the part of the network (The Metabolomist Jan Krumsiek)).

Alice: Would you like to discuss some specific tools are there, especially to help make sense of the metabolomics or to integrate them with other omics.

Karsten: I think in the end, it all comes down to people who are working on R and R studio. And there are so many packages out there. I think new generation is also doing python, which probably goes to the same thing. I’m not a python person. I always stick with R. But I think both are pretty close to each other.
It is certainly a good thing of looking what other people publish because there are so many new tools out there. A lot of stuff is how do you visualize your data and how do you nicely produce it?
So going away from these hairy balls that come out with network here, network there. – You just see it and say, okay, what do I make out of it?
I’m always thinking about my medical collaborator. Would he really pull something out of what I’m putting into the paper? Or is it just to show that I have big data and I can manage it and nobody else. I think, there’s a big gap between what the bioinformaticians and the systems biologists can do and what the clinicians (who actually are our clients in many cases) really do with it.

They say they want to do a metabolomics study and we run the whole thing and they have their data; you give it back to them and then they’re frustrated. They say, “ So, what next?” “What does PC AA 36.4 associates with smoking?” “What does it mean?” “Does it have implication?” – and many people walk away pretty frustrated to be honest. That’s something we still have to really work on to nail things down. What is really a takeaway message more than just saying this metabolites goes to that and this metabolites goes with that.

Alice: This is exactly the purpose of this project and also the purpose of this podcasts. To give people clues. So, the clinician or the research scientist who is not necessarily an expert in metabolomics and wants to understand these results. So, what are ways to help people to jump that gap?

Karsten: Let me just list the few things. That’s more like throwing buzzwords out there. I think one keyword is certainly Mendelian randomization which is using GWAS data to build causal relationships or confirm causal relationships.

So, if you have a lot of GWAS data, big GWAS data with much power, then you could in a way show that a certain metabolite is on a causative pathway to the disease. Which would then mean in terms of causality that if I thinker with that metabolite, I would change the outcome. And that’s what you want. Right?

Alice: When you work with this kind of associations like you build onto the kind of better annotation of the genome to try and understand where the metabolites fall?

Karsten: Imagine I would have an association with a disease like diabetes and an association with a metabolite and I would try to figure out is the metabolite something that causes the disease? So, if I bring down a certain value would that improve the disease or is it a consequence? – So it might be a good biomarker. But it doesn’t make a sense to target that to bring that into change that value.

So that’s instance I think something that is big at the moment and it really requires even larger GWASs on metabolomics. I think that that alone justifies the effort.
Second, there’s of course the ratios which I still support strongly
Third, gaussian graphical models (GGM), also called partial correlation networks.
That this is really interesting because if you just do correlation networks, you end up with a hairy ball of everything, connecting everything. So that’s something I would look at. You can predict variation in metabolites in people based on their genome (Only the genetic part of course). So, we could do that and link that to the disease.

What I haven’t mentioned yet, we have in the meantime also done EWAS (epigenome wide association studies). And that’s something which compliments in a way GWAS. Because GWAS is genetic, that’s from birth. You’re set to be a fast or slow metabolizer for this, at this gene. The EWAS as I see it is like there are certain parts in the in the genome that get methylated and that are switches to switch on or off genes.

And in some cases, they reflect what your body does.
So, there’s a very strong association of TXNIP with diabetes. And I think it’s a read out of how much of TXNIP the body actually needs to cope with the diabetes. It is not necessarily creating the diabetes. This could be the reverse, it could be showing you what the body actually does at the moment to cope with the disease. And that’s the tricky thing.

Is it a sign of disease or is it a sign of the body coping with the disease? But it can give you ideas on where you would go and try to treat the body. And then also a way of maybe early markers. Because if your body is adjusting and fighting diabetes (but you’re not diabetic yet), you might already have your transcription (epigenetic) profile changed.

Alice: You are already being challenged. Yeah.

Karsten: We did an EWAS paper later on in 2016 with Ann-Kristin Petersen where we did an EWAS on metabolomics with the same data just like before. We found 20 different genes where the CPG was associated with metabolites and independent of genetics. (There were others where the genetics was confounding.)

A large part of them were associated with smoking. I think smoking is a very good signal there. And then we had few isolated ones initially I didn’t know what they had in common. I think the message for people analyzing data was we were always looking at the top hit and that was an error. Because it’s not only the strongest association – It is all the metabolites what we call the metabotype you should look at.

And when we looked at it later again, we didn’t EWAS with BMI and diabetes in a Qatar cohort. And it turned out there were actually three of these genes (all with the same metabotype and marker of hypoglycemia). And they had been reported in the paper before to be the metabotype of diabetes.
All the three genes were diabetes-associated metabotypes. And only later people discovered that these genes were actually associated with diabetes and obesity. Our EWAS actually had the information already. We just didn’t see it.

Alice: No.

Karsten: It is also a good thing because there’s a lot of information in your data set. You could be looking for it. I think, especially for people working on metabolomics this should be very motivating because there’s a lot of stuff hidden in the data that is waiting to be digged out.

Alice: …and that you can’t really see yet if you don’t work with associations. But you work just based on the previous knowledge of biology. You just can’t right now because it hasn’t been discovered yet.

Karsten: There’s another thing which I always like to do. When you have data, normally you collect covariates such as age, gender etc. Always be looking at associations that you know of and use them as a positive control. This way you can be sure that your data is having the information you are looking for.
For example, from a previous study I know the strength of an association with ages. If suddenly my association is weaker or stronger I know that something going on. I can also use this effect in reverse if I know something does not associate. To do an association with a random number, it should always come out negative.
But if you say I do an association with a day off of data collection or this technical covariates. I mean, that’s very important to not just throw them in as covariates but to look at them and to make sense out of them.

Alice: Yeah, and this is a very strong point for annotating the data with as much information as possible and not removing the metadata that is given to us. Maybe it does matter if it was a Tuesday!
Karsten: There are interesting examples. We did a multi-center study on kidney rejection. And there were center specific metabolites. They were only measured in one of the clinics. And when I looked at the thing it was totally perfect because the different clinics did the initiation of the immunosuppression with different drugs.

So the metabolomes actually (only) measured the different drugs that the clinics use. I could have reverse engineered how different clinics initiated the immunosuppression. In another study, when we saw a metabolite appear (or disappear) sometime in the order of sample collection.

And when we looked at that, a clinician told us that they just changed the urine tube or the labels on the blood tubes. It doesn’t mean that your data is bad but metabolomics is very sensitive. For the urine tube case it turned out that they had some kind of preservative in there. And probably the other tube has a different preservative. You just have to be sure.

Alice: You need to know that.

Karsten: And you mustn´t use the one tube on the cases and the other tube on the controls, but you have to know about these changes.

Alice: As I mentioned before, I try to make a point for the place of creativity in interpretation of data. Do you see a place for creativity in your work? Do you think it’s something important for a scientist in general and for the type of work that you do in particular? Or what’s your, what’s your view on this?

Karsten: I think creativity, especially creativity in visualizing data is important. Because the most important thing is that you have to see things. I mean, you may have very strong P values but once you look at it it’s just driven by three data points.
That you can do with it with a simple plot. There’s a new generation of informaticians who really take data presentation tools from totally different fields and bring them over into science. Like this interactive Java visualization tools and things like that. And that’s another thing. The second point here is also to have your data interactive in a way that non-bioinformatician users could play around with the data in an easy way.

Alice: That can also help to communicate your results. If you have very dynamic study structures, like you should have different time points or different stages of a disease, that can really help as well. Do you spend a lot of time playing around with visualization once you have the data or right at the beginning maybe?
Karsten: I personally don’t find so much time for that anymore. But that’s really where the PhD student and the postdocs come in.

I think where they also can make their mark. Look here is really something that speaks to you and is convincing. It shouldn’t be a black box what you’ve done before – But in the end, the idea is how do you synthesize a message out of your data. Of course you can call that creativity if you want. It depends on how strict you are. Some statisticians would hate that word.

Alice: Yeah, but as you said in the visualization you can be very creative. For me it is creativity to go from a black and white table with little stars to a graph that maybe has only one or two colors. But is putting the data in the light where you can actually see the differences where you don’t have to compute it in your head to see it.

Karsten: You should create a hypothesis; it should teach you something, and then ideally you would either carry on with another experiment to prove that what your hypothesis generated is true or at least depending on what the study is do replication, especially in GWAS. It can be as creative as I want in the discovery but what I come up has to replicate in an independent study.

Alice: I have this in my own experience of analyzing data or looking at datasets where sometimes, I did this analysis and then I did a different analysis and then I looked at through different angles and sometimes I would forget to stop. I consider this as a symptom of perfectionism in a way.
That you want to get to this beautiful aha moment where it feels like you´ve finally explained biology. But it is unlikely that you’ll get to this ever, especially with one study. Are you familiar with this? Can you maybe comment on this?

Karsten: Yeah, I think that’s a very risky thing. Especially if you don’t have a fixed position. It could be that in the end you would never publish the thing and just get out of research. Because in the end it’s publish or perish in a way. I think it is a thought that I don’t see that dire but the thing is you need to be there and tell a story to someone. You don’t do it to file it away and when you retire you just say, oh, I have everything in my drawer but I never told anybody about that.

Alice: And do you think that statistics and significance alone can tell the story or you need something more?

Karsten: I think the most important is the statistics. P value always means what’s the likelihood that you see the single that you’re seeing by chance. And if whatever you have in front of you could just be in front of you by chance. It’s not worth it being considered further. So, in the end, everything you do should in one way or the other be supported by statistics.

Alice: Yes.

Karsten: That’s something I like to do very often is to just randomize the identifiers of my data. Because my biggest fear is that someone drops the box with the tubes and put them back into the wrong order and I don’t know about it. That’s why I test for age or gender associations to make sure that there’s not something generally wrong with my data.

Alice: That’s a good point. I didn’t know about this. Of course, I’m used to looking for outliers in groups and things like this. And we’ve had cases like this when I was working in academia where we just didn’t have so many replicates for a signature that were clear enough. But at least we had clear signatures. When you don’t have this, it can be really risky.

Karsten: Yeah. And that’s important to have this kind of markers of especially of sample integrity. It happens very quickly especially if your work as a bioinformatician and never see any sample tube. I make case of seeing the tubes. Normally, you just tell someone to ship the samples and you don’t even see them. We had the case in the past with genomics (not metabolomics).

We gave away a box of samples and they turned it by 90 degrees and took the tubes in that direction. The error was spotted easily and we did know what happened. And then if numerically we turned the thing back by 90 degrees suddenly everything mixed. We checked the sex match at that point. So it was totally unmatched before and after we turned them they matched perfectly.

Alice: Yeah that helps.

Karsten: And I think this kind of thing you would like to have for metabolomics, as well.
Of course, you would like to have somewhere replicates. Although if you go for this fee-for-service thing, it’s hard to just send three times the same tubes. Sometimes, initially we did that. We even sneaked in a few samples that didn’t tell the people that were duplicates in there.
It’s not really that conclusive because I think especially if you sneak them in, it’s hard to bring it up later. That’s the thing: If you have duplicate measures it has influence on your CV, the variance, technical things, everything like that.
When you work with core facilities and fee-for-service providers, you trust them to a certain point.

And once you create trust with them, you prefer working with their data rather than with others. Because suddenly they come to this other paper and say they used mass spec metabolomics to measure but it comes from a platform I’ve never heard about. And I don’t know what to think about it. I would like to see all this kind of standard things done before. Does the platform have the replicates and all standard associations so I can trust the data.

Alice: Yeah, it’s similar to comparing methods for other types of scientific research as well. For the omics, we always kind of expect any omic to be comparable to each other. And sometimes it is like this even though, of course, you have points where you can compare and you have the quality controls that it’s probably more robust than other kinds of biological research.

Karsten: And that’s also why we should not be too shy of replicating things. Everybody wants to be the first to something the for the second paper the Editor already says it’s not interesting. But I find most of this second paper is the one that really confirms it. It’s safe to say. And you discuss what is really replicated and not what the first paper just stretched to the end of how far you could go in your interpretation.

Also, with more data coming out and the community is growing.
It’s a good time now to start in the field. And of course, also in getting more standardization there’s also a gap in the community there. It’s not a gap, but I mean the people who are really the mass spec specialists. And then you have the people who analyze the data in between them. The mass spec specialist is often not too concerned about what people do with the data later. They want to be as precise as possible. Having a 10% coefficient of variance is not bad for something. But for a biochemists it might be already horrendous.

Alice: Yep. So very relative. Is there anything you would like to add on the topic maybe a message he wants to get out to the public?

Karsten: I end with my favorite quotes that I always put on my email. If you torture the data long enough it will confess to anything. And I think that’s really something I always put it there to remind myself. You can find everything if you just look at the data long enough.

In the end, we always are responsible to the people who read our papers are not metabolomics people. Who are critical to what we do while we have to be creative in our methods you have to be critical as well as what we claim out there.

Alice: It was delightful to talk with you. Thank you very much for your time.

Karsten: I wish you good luck with your book and everything.