Minimal genetic knowledge and a good Internet connection: this is all we need in order to hack into anonymous participants’ data in genetic research, according Yaniv Erlich. The then Ph.D. student’s stunning proof came to the world’s attention in 2013. Although he is a renowned scientist and university professor today, Erlich became world-famous under the nickname the “genome hacker”.
In 2013 another one of Erlich’s results merited attention: he worked out the DNA-sequencing method known as “DNA-Sudoku”, with which tens of thousands of specimens could be examined together. A mutation could then be traced back with 97% accuracy to a single specimen. Among other things, this method dramatically increases the efficiency of identifying predisposition to hereditary diseases.
As a fellow of the New York Genome Center and Columbia University, Yaniv Erlich has participated in the worldwide DNA.Land project started last fall. The project’s aim is to collect and analyze the genetic data of several million people in a scientific manner. The database has nearly 20,000 gene maps. The maps’ owners had them prepared earlier by different genetic laboratories for their own purposes (such as genealogy or in connection with medical examinations).
Yaniv Erlich claims several dozen important scientific publications and two patents, and he is a regular presenter at the most renowned science conferences. The broader public can read up on his results in a number of newspaper articles, as the world’s most renowned professional and popular science periodicals and publications regularly feature his research.
The renowned geneticist lectured in Budapest by invitation of PAGEO, where he gave an exclusive interview to HUG magazine.
How would you summarize the basics of your research?
My research team and myself, we attempt to develop algorithms and tools with which we can get to know the genetic basis of complex characteristics. Everyone knows that several characteristics, such as height, weight gain, susceptibility to cancer, and even political orientation is influenced by genes and genetic characteristics. The question is how we can develop such tools with which the study of these characteristics becomes a lot faster.
Already in the early stages of our research we faced the fact that we have a tremendous amount of data at our disposal to get to know the genetic background. Thirty years ago, we thought that a particular gene is responsible for the development of schizophrenia, or for political orientation. Today we know that this is not the case. It seems rather that several different genes are together responsible for this or that characteristic. Thus, in order to be able to precisely determine interactions we need a tremendous amount of data. This is the background of the work which we take as our point of departure. My research team develops such methods with which, using data from community websites, we can map genetic characteristics much more thoroughly. We recognized already at the outset that if we have no access to a large amount of data then we will not get adequate research results. However, people often insist on the protection of their data, including their genetic data. We must overcome these obstacles with common solutions somehow. We invest a great deal of energy in this area.
You have two large projects, FamiLinx and DNA.Land running at this time. What are your most important results and what is their relevance?
Yes, these are the two largest projects, but there are several others in addition. These two overlap on some level. As I said, for our research we need data from a lot of people. For genetic investigations, we need to set up giant family trees. Do you know your second cousins? In all likelihood, you don’t know them all. Had I asked you about your third cousins, you could say even less. Therefore, in our project FamilLinx we applied an entirely different approach to draw large family trees. Instead of asking people or trying to bring them together via a kind of top-down approach, we started out in a different direction. We turned to the webpage Geni.com. This is a community website, one which expressly serves sharing genetic data. People can enter their family trees to the site. Let’s assume both you and I upload our trees. If we have a common relative, the website sends a message: “Hello, you two are related! Perhaps we could integrate the two family trees.” In this way, together, people create one giant family tree. Thanks to the people running the website Geni.com, we could download all publicly available data in one piece.
Today, following detailed analysis of the data that took several years, through assigning demographic characteristics we understand several processes in the context of families. I could say that we have created a stratum with eighty million individuals, and we searched for connections between their characteristics. We can tell, for instance in the case of a husband and wife, what the distance is between their places of birth, and how this has changed over time (tracing back over generations). The historical variation of this distance shows how isolated or how mixed this population is, and this has an effect on their genetic characteristics.
IN TEN YEARS TO THE WORLD’S ELITE
Yaniv Erlich received his university degree in 2006, at the Department of Computational Neuroscience at Tel Aviv University, where he studied biology and psychology. He defended his Ph.D. thesis in 2010 at Cold Spring Harbor Laboratory, New York State. In his dissertation, he investigated how relatively cheap computing analysis methods can be used to analyze a great number of genetic specimens, so that rare genetic mutations can be identified. Between 2010-2014 he held a fellowship at one of the world’s most renowned biotechnology research institutes, the Whitehead Institute. The title of his research project was “Harnessing Web 2.0 technologies in the field of statistical genetics”. He is a member of the forum Genomic Pioneers Gateway. The 36-year-old Israeli-born computational genetics researcher is a Core Member at the New York Genome Center since January 2015, and he is adjunct professor at the Department of Computer Science at Columbia University. Under the aegis of these two institutions he is leading his own research team, named Erlich Lab.
THE FAMILY TREE OF 43 MILLION PEOPLE
The Erlich research team created a global scientific database under the name FamiLinx, which represents the common family tree of about 13 million people with roots reaching back all the way into the fifteenth century. This unique system contains genealogical, demographic and phenotype data on the basis of voluntarily uploaded information reaching back 500 years. FamiLinx received its data from Geni.com, which contains nearly 43 million profiles uploaded by ten million users: these comprise family tree information, photos, and family documents. With the permission of the firm MyHeritage, Geni.com’s owner, Erlich’s team sorted through, systematized and made the enormous data compilation publicly accessible and searchable.
However, this project yielded such entertaining results as how much you would need to travel on average in order to encounter a living relative of yours. This is about what we do in the framework of the FamilLinx project. We attempt to create a very broad stratum of data related to genealogy. We already have at our disposal the “family tree stratum”, to which we can assign DNA information. On the website of DNA.Land people can upload their genetic map, their genetic data. Today about two million people have access to their genetic map (this is the number of people who had their genetic maps prepared). These genetic maps can be voluntarily uploaded to our website as a contribution to further scientific research.
On DNA.Land we treat two strata together: the family tree stratum, or the genetic map stratum, thus, those who upload their genetic maps can also acquire family tree data. The next step is adding a “health stratum”, or getting to know the health-related consequences. The challenge for us lies in the realization. How can we get people to share their health-related data? One possibility is preparing questionnaires. I provide a list of various diseases, you merely need to circle what you have got. However, no one is interested in filling out surveys. After 10-20 questions you get tired, and the whole thing gets boring. Even if I were to put the world’s best questionnaire together, even then, if someone receives an email with it, all they are going to think is this: not another survey to fill out! So, this was not an option, and we needed to come up with a much more effective way of connecting the data. We decided that people would be given an option to “offer up” their already available data on community sites for research purposes.
“Today about two million people have access to their digital genetic maps.”
Facebook stores a great deal more information about us than we might think. The kinds of text, images that we upload, the kind of pages that we view and like, and the amount of time we spend on community sites and at what intervals, all these interactions yield a kind of pattern that is suitable for research purposes. Studies prove that Facebook data and personality traits can be correlated, and with the help of Facebook data we can gather access to similarly reliable results as with an ordinary psychological test. Therefore, we are very interested in data generated by community websites.
What is the future of medical treatment in general? Certainly not that we need to go to doctors ever more often. In the future, an intermediary entity will appear, a kind of automated entity, one which browses through my email, my Facebook interactions and Google searches every day, and lets me know that “Yaniv, today you are not behaving as usual. From what I have experienced in the last few days I conclude that you may have contracted a cold…” On the basis of this, the entity proceeds to advise me, as if it were my mother: “Put on a warm sweater, drink a hot tea!” and so forth. In this project, we are mostly interested in such things.
Would you kindly share something from the newest results of your projects?
I’ll show you something that we just published a few days ago. It’s not connected to these two projects in particular, but it is quite hot off the press.
Today, mapping a DNA sequence and the analysis of a DNA specimen take several days. In order for a DNA sample to be examined at all, it needs to be transported into the nearest laboratory, which might actually be quite far away. Now a British company (but they also have an office in the New York Genome Center, as it happens, located on the same hallway as mine) has developed a tool by the name Oxford Nanopole. This is, in effect, a portable DNA sequencing device. You take it in your hand, and connect it via a USB-cable to your laptop. Then you can start to sequence DNA on your laptop in quite a simple way. Thus, instead of sending a specimen to the laboratory, you merely need to take it to the sequencing, all further operations can be carried out at home on your own. Obviously, we are very interested in this device.
We believe that it provides those students who are interested in genetic technology with a wonderful opportunity. You give the device to them, they can take it into their hands, they are not dealing with some sort of abstract theoretical knowledge with regards to what it means to sequence DNA.
I had my students at Columbia University try to isolate a DNA from the lunch of a Ph.D. student. If he consumed beef, we looked at whether the food truly contained beef DNA, or whether it was contaminated with something else. I gave the specimen to the students, and in the course of sequencing they had to determine what sort of food the DNA originated in. (Beef with tomato.) Next time we are going to test a goulash soup. It is a wonderful thing that we have a device with which we can conduct real experiments, even in the classroom. Naturally the device is still a beta-version, so there are functions that don’t quite function yet. It is, however, just important that the students familiarize themselves with the instrument already in the testing phase. After all, we are not old, either, yet still we grew up with computers that ran DOS. We got used to turning the computer on, then we prepared breakfast while the system was booting. These kids – I love them dearly, but they grew up in a world where everything is ready and perfect. Let’s just take a smart phone as an example. If they touch the screen, and let’s say, nothing happens for half a second, they already complain that “my phone’s sooo slow!”, don’t they?
We work at the university, at the Computer Science department, we train the engineers of the future, and the trouble is when they learn some theory they already want to hold the device in their hands. They did not grow up in a world where something was not yet ready.
This device may frustrate them a little, but it is important that they learn that not everything is perfect. Engineers need to see things in that state, not just in the state that their smart phones are in. Therefore, we only published this device a few days ago, and this is our newest result these days.
What kinds of dangers, opportunities and ethical dilemmas does the collection, storage and research of genetic data create?
DNA data protection depends a great deal on the context, and it can vary greatly from individual to individual. The health consequences of my genetic map are, in all likelihood, quite boring. There is absolutely nothing of interest in there. I have a bit of asthma, and thanks to my ancestors I probably have greater proclivity to diabetes.
“Thanks to Facebook data we may gain similarly reliable results as with ordinary psychological tests.”
Nothing out of the ordinary. So, I am quite open about this, there is really nothing to hide there. I will not be harmed if the data leaks out. Quite the contrary: I would like everyone to know that I have asthma, so that I can be treated appropriately. However, there are people who battle such special diseases who, if everyone knew about them, would feel stigmatized and that they could potentially be disadvantaged if the whole world knew about their disease. In such cases, the loss of privacy is a serious issue. Another such issue is that many people would like to conceal their ancestors. Perhaps this is also relevant for Hungary. After all, in the course of its history several such instances may have occurred. For example, Jews, who would have liked to keep their identity secret following World War II. However, it would be unambiguously revealed on the basis of their DNA. So, the whole question depends a great deal on what sort of society an individual happens to live in.
Therefore, it would be very important to reshape social views so that diversity, multiraciality, and bodily imperfections become acceptable. For if we live in an open society where these things are normal, then it will not be an issue, quite the contrary, it may push an individual to get to know the genetic causes behind their disease, or to want to know more about his/her ancestors. So much about the risks.
“… the future of medicine as a whole? We would certainly not have more doctor’s appointments.”
Concerning the opportunities, this is of vital importance. After all, when people share their data, their histories and their “heritage” will also become accessible, and we may get to know various diseases better, and we could get much better medical treatments for these diseases. This is the basis of medical research, and genetic researchers may assist in this. I believe this needs to be considered when we take stock of the risks and opportunities of data protection.
Can technology keep up with the rapid pace dictated by this enormous quantity of data? Can researchers process and handle this data, or should we expect that too much data will come to hinder detailed and deep analysis? Is there some narrow cross section in the research data set?
Indeed, we are speaking of an enormous quantity of data, and working with such data amounts is a very new field. In truth, training is an issue, too, since very few people are trained to handle such big data volume. One of the challenges is how we can move this mass of data. Let’s say I would like to share one of my studies with you, and the terabyte or even 100 terabytes of data supporting my analysis. Downloading it through the Internet would probably last days. Sometimes it seems that it would be cheaper, and needless to say much faster to copy them to a data carrier and have that shipped by courier, so that even this seemingly simple practical problem can present a challenge in the course of research. Another great challenge of big data is that if you begin to analyze data and make a mistake someplace, that may come at a tremendous cost. Let’s imagine this as a linear chain, where one step leads to another, and each step lasts months, and then we, let’s say, notice that an error occurred in the course of the third step, then you need to return to that step and restart the entire process from that point.
This has happened to me many times, and occasionally it took another entire month to rerun the analysis which had already been completed once – erroneously. A third problem is quite a conceptional question. Truly there is so much data about so many things that you need to select very carefully what you want to investigate to begin with. What exactly do you need an answer to, and how will you find it with the help of the data? The informational content of the human genome is very rich. We cannot run an infinite number of analyses; therefore, we need to be selective. We cannot say that hey, here is this pile of data, let’s do something with it. We need to formulate a concrete question very precisely, and search for its answer.
We are witnessing, and indeed we can be active participants in the revolution of genetic genealogy via Web 2.0 technologies. What effect does this multidisciplinary science have on the world from a social, scientific, medical or health care aspect?
Indeed we are living through a kind of explosion, for instance through social media, with regards to the amount of data available. We attempt to analyze out family tree data by assigning them to genetic data, but this data could speak of may other things, for instance, how far the families are located from one another (geographically speaking), who migrated in which direction, what specimens of women and men are like. This human data, if we look at it from a certain point of view we may draw health care conclusions, from another it may yield social patterns. In sum the question is what sort of point of view we use them for. They can be useful for countless other fields of science, not only for my own research purposes.
written by: László Gere