In today’s episode, we interview Gregory Glatzer, an undergraduate student at the Pennsylvania University, College of Information Science and Technology, studying applied Data Science.
He speaks about his work, especially in the intersection of wildlife conservation and data science. Gregory said more about his last research on understanding the settlement of wildlife animals, particularly elephants. He built a machine learning model that predicted the next settlement of elephants based on their historical track record. This would aid the protection of these animals from poachers. Gregory explained how they got the dataset for this study, and why it was critical to protect such data from the general public.
In the interview, Gregory explained that they used a movement tracking sensor to capture the location and physical movement of elephants. However, he needed more than a migration dataset. Gregory used the temperature data as well for his study. He explained why the temperature data was equally critical. Gregory mentioned that the historical weather data was obtained from an API called Meteostat.
Gregory went on to discuss the algorithms he used for this task. He explained that a combination of two machine learning algorithms were used: DBSCAN and K-Means Clustering.
Gregory extensively discussed why he used both algorithms in one project and how they played out. He mentioned using other algorithms such as Optics to evaluate how it would measure up against the first two.
Going forward, Gregory touched on the anomaly he observed in the data due to environmental situations beyond his control. In the end however, the study results were impressive, even though he acknowledged that this was just the beginning. He virtually presented his results in the Towiri Conference organized by the Tanzania Wildlife Institute.
He rounded up by disclosing his passion for frontend development and how you can reach him at g1776.github.io
Gregory Glatzer is a junior studying Applied Data Science at The Pennsylvania State University. Working with Penn State IST faculty, he has published research regarding the application of clustering algorithms onto elephant movement data. He has also competed in the 2019 & 2020 Nittany AI Challenge. Gregory's fascination with the data science pipeline, especially the development of full stack web applications that utilize the power of real-time data, can be seen throughout his work. In his spare time, Gregory enjoys playing and listening to jazz on clarinet and saxophone, coding for fun, and spending time with friends and family.
Thanks to our sponsors for their support
Kyle: Welcome to data skeptic K-means clustering. The podcast exploring the problem, the algorithms, enhancements and use cases for K-means clustering.
Greg: Hello, my name is Gregory Glatzer. I am currently an undergraduate student at the Pennsylvania State University in the College of Information Science and Technology
Kyle: And what are you studying there?
Greg: I’m studying Applied Data Science
Kyle: So the project we’re going to talk about obviously includes some data science but also some, would you call it life sciences as well?
Kyle: And is that part of your career as well or did you commit this from purely a numeric side of things?
Greg: From the numeric side of things I was doing undergraduate research and the faculty that was leading the research. They had some interest in the domain of wildlife conservation. So that’s where I got into this from.
Kyle: Broadly speaking, can you talk a little bit about what opportunities there are even beyond what you’re doing to overlap between conservation and data science?
Greg: Yes. So in conservation there’s what I’m doing which is looking at the movement of animals and trying to understand that. As you can imagine, there’s lots of data there to do data science with. Especially movement data. We also look a lot at vegetation and elevation data and then there’s also the aspect of trying to prevent and catch poachers that are trying to poach animals in national parks. So there’s a couple studies that have been done with that doing things like computer vision where they put drones in the sky and try to catch the poachers at night. Doing some machine learning with computer vision.
Kyle: These are very virtuous things for sure. I know conservation is very important but I don’t think of it as something that is rich in data where I can, you know, tap into something and apply algorithms. What data sets do you have available
Greg: The data sets in this domain is a little touchy because as you can imagine with conservation the nature of it, the animals and the data related to them can be held very closely by the people who have that data. So with elephants. You know if you know where the elephants are that’s obviously what we don’t want the poachers to know, right? So with data sets they are often held in universities or by governments that run these national parks and even more data about poachers and poacher activity that’s also held very carefully. So with that said, with my research that was the first major hurdle I have to overcome - of finding data that I can use to do what I wanted to do. And I ended up finding there is a wonderful repository of animal movement data. It’s called move bank and on their lots of different studies they link to the studies and the data and you can download tons of public data on animal movement
Kyle: And what is that? I’m thinking of something like my fitness tracker where I get a GPS path. Do I get basically the equivalent but for an animal?
Greg: Yeah. So the way that this data was collected and with lots of other movements studies is they put some kind of tracking collar or a chip maybe on smaller animals and then they ping them at some interval whether that’s every hour or every 10 minutes whatever they want to do. And like you said, it’s just like a giant Fitbit for an elephant I like to think comically that has like a big collar around the elephant’s neck but it’s probably not that yes so they’re pinging the elephants location and gathering it that way
Kyle: And can we skip ahead to some of the goals. What do you want to achieve with this data?
Greg: So with the study that I did, the overall goal was identifying locations of interest for elephants. So when elephants move they exhibit what’s called shuttling motion where they travel for long distances and then they stop in an area and then they travel for long distances again and repeat. And by using clustering algorithms on this movement data our goal is to identify what these locations of interest are. Because if we can cluster and differentiate between those long strands of data points versus when they stop, we can identify those locations of interest and then say okay, is this a village maybe or is it a body of water? Is it shade? And that can give us more insight into why the elephants are moving the way they are and where they’re stopping and then down the road hopefully we can apply this to movement data from other animals as well.
Kyle: Even before there was such fine grain tracking data, certainly some biologists or the right academic. We’re doing at least anecdotal studies and had some intuition or rules of thumb about the movement of elephants. What in general was known before you started looking at the data?
Greg: So we knew that the shuttling motion was happening. There had been other studies done that they figured that out and I’m not an elephant expert. I focus like I said before more on the data science side of stuff but I can imagine park rangers probably had just figured that out beforehand without doing fancy statistics and everything. So the nature of the elephants movement was known beforehand and things like heat having an influence on the movement of elephants. That was also known beforehand from some past studies which was definitely a strong starting point for the study we did
Kyle: One thing my fitness tracker doesn’t tell me is the local temperature for wherever I was. How do you get it?
Greg: The temperature we did our study using a couple of different data sets and one data set which was from Kruger National Park. That one they were collecting temperature with each movement data point they collected the study they were doing was specifically related to how temperature affected elephant movement. So because of that the tracker they had on the elephant also had a sensitive pick up heat. So we had that data point that you know calling that feature in the data to use in our model and then for temperature in other data sets this was a big problem in our study that we saw that temperature allowed us to cluster better. It was a good, should I say, predictor of the clustering movement of elephants and explained it often. So we wanted to have this temperature data and other data sets. But the problem is other data sets. They didn’t have a reason to collect temperatures so they didn’t. So we couldn’t use that feature. So we need to kind of generated or collected in some other way after the fact. So what we ended up doing is we use the historical weather API called MeteoStat. And using that, we were able to approximate the temperature collected on that day by looking at the historical temperature records gathered from weather stations in the area. And then we could you know just do like a table join on the timestamp and say this is what we think the temperature was too. Then do our analysis.
Kyle: So clustering was a technique or methodology you chose to pick up. There’s any number of other algorithms that might have been applied to this dataset. Why choose clustering? How could that inform your analysis?
Greg: So not just clustering but the specific clustering algorithm we chose helped us perform this analysis on this specific domain. So because elephants aren’t just being in one area and then jumping to another area and magically being in another geographic area as another cluster, it’s it’s not like a classification problem you know since elephants are physically moving they need to move from one cluster to the next. So there’s going to be a trail of points to the next cluster. So we ended up using a algorithm that can deal with that kind of movement called DBSCAN which has a concept of noise in the data. So by using a specific clustering algorithm we were able to differentiate between where the elephants are clustering around some feature, whether it’s a village or water source versus when they’re moving to the next area that they’re going to cluster around.
Kyle: So then the clusters you come up with they will be based on the latitude and longitude and I guess the temperature as well can be variable. That doesn’t include any of those features you had mentioned. How well do the calculated clusters align with the places that a human observer thinks of as hubs for elephants?
Greg: Yeah so that was kind of the final step in our research. You know you do all this clustering and you’re staring at graphs all day long that are just lead to the longitude. Not really having an idea of how well these clusters are performing. Like that doesn’t mean anything, so we then overlay or results onto human settlements and also bodies of water. And we found two things we found that elephants tend to cluster their movement around the rivers and around some camp sites. So with the camp sites, we found a couple instances where the campsite according to the National Parks website that the camp was in. So national parks would set up these areas for tourists to stay overnight when their international park on safari right? And they will set up watering holes in the campsites trying to draw animals near them. So we saw that the clusters we calculated ended up being centered around campsites especially ones with watering holes for the elephants to go to. So we took the results of the clustering and overlaid it on top and just browsed through the data. So I’m saying browsing as it was a manual task where we needed to find which clusters associated with some real life thing of interest like a body of water or village but that manual process of finding those real world features to see what that is of interest of, we wanted to automate that too. So I guess that brings us to talking about the second form of clustering we used in the study. So we first clustered on the movement of the elephant to find those clusters based on their latitude-longitude coordinates. But then after that, we wanted to also automate the detection of what villages maybe park rangers would want to focus on to. Say elephants might be clustering their movement around this village versus another village that they don’t care too much about. And we did that with K-means. So with K-means, what we did is we have all these what we called elephant centroids. It’s just the centroid of any given cluster that we calculated and we wanted to see where a bunch of elephants centroids. Maybe of different elephants or just a single elephant kind of having these little mini clusters just with this movement. We wanted to see what known campsites - because we know what the campsites are - what known campsites elephants are clustering the movement around a lot versus others? So where K means comes in here is that we set up the algorithm where we take each location of a known human settlement. One of those campsites. And we set that as the initialization of the centralized and the K means algorithm. Right? So traditionally in K-means you run the algorithm and then it updates the location of the centroid until you keep on going and then it converges, right? But the way we use K means was to serve a different purpose by initializing the centroids as the locations of these campsites. We can then classify all the surrounding elephant centroids to these K-means centroids. And by doing that, we kind of associate a bunch of centroids to each K-means centroid and then we can count how many centroids are in the different clusters calculated with K means. And then we can just rank the different human settlements of how many elephant centuries are associated with that
Kyle: Do you end up then with basically like an affinity between the human settlements or human areas of interest and the particular elephant centralize your finding?
Greg: We don’t get should I say like a distribution of how tightly correlated they are. But we do see how many of the elephant centrroids there are near a human settlement. So we can just get a single number saying there’s 26 elephants centrooids around this settlement. And then the next has 16. And you can pretty much just say I want you know the top 10 settlements based on how many elephants centroids are around that. So we do get an association in that sense.
Kyle: I think there’s a lot of virtue and just analysis to better understand these animals and their movements But what would be really great is if there’s some way you can inform the conservationlist that would help them do their job cheaper, better, faster, easier or something like that. Is there enough in the data that you can work with people on the ground and help them make their job easier?
Greg: The one caveat I’d say with this is you need a good amount of data of elephants movement from the past to understand it. It’s not really a real time tool but we can definitely learn from historic data. So what we can do to help the guys on the ground is they can look at our results and say yes, what we’ve been observing from the ground and our own knowledge of the movements of elephants within our park. Yes these are the villages that we know elephants are displaying interest in. And maybe our algorithm might show one or two other villages that they didn’t realize elephants are expressing interest in and they can go there and do a little bit more. Looking into it on their own and see if our algorithm showed anything. But in terms of any like real time application or use, the way that this is designed is more of a historical analysis.
Kyle: Well I think K means is the poster child for clustering algorithm. And then there are many that are sort of variants of K means And then of course there’s DBSCAN which is a relatively different idea altogether. Could you talk a little bit about the decision making process and using both of these approaches?
Greg: DBCAN: The reason why we chose that is because it has that concept of noise. So if you were to do something like K means it’s going to try to put every single point in your data set into some cluster. But the problem with that, like I was saying before. is that elephants need to move to their next cluster. It’s not like each data point is within a vacuum where from one data point to get to the next you need to move there physically. So as a result of that, there’s some points that you need to just classify as noise and say okay that’s not a point of interest for the elephant. That’s just the elephant wandering through the forest or something like that to get to their next destination. So that’s the main reason why we chose DBSCAN over a simpler algorithm like K means. We also did a little bit of looking into another clustering algorithm called Optics. Now this was kind of an afterthought to be honest ofconsidering that second algorithm. But if you don’t know optics is a very close cousin of DB scan with a couple of different parameters in the way that it works. But it would be definitely interesting to look into other clustering algorithms that have a concept of noise, that would be appropriate to this domain.
Kyle: Another criticism I’ll hear people make of K means is thatit really wants to work on gauzy and blobs of data, if you have sort of like a crescent shape or a half moon that it’s not necessarily ideal. Whereas maybe dB scan does well in that scenario. The elephant data does it have to conform to geography in a way that might be inconvenient for K means?
Greg: Yeah. So the elephants will be movingin long like you were saying blobs. And they’re moving along a river For example you know the river is the shape it is The elephant doesn’t care about the data scientists that’s going to be looking at their movement two years down the line. So the elephants going to move across these geographic features like rivers or I know elevation is another big influence on elephants movement. Elephants don’t wanna travel uphill if they don’t need to. So one of the areas we were looking at was near Mount Kilimanjaro. As you can imagine, a big mountain like that at the beginning of that mountain or any mountain range will definitely restrict the elephants movement to not go in that direction. So because of that, you start to see these weird shapes kind of carved out of the data. Where there’s almost like this invisible force when you plot your data. It’s like why are the points not there. And it’s because there’s some geographic feature that is pushing the movement of the elephant. And like you were saying with something like K means that causes problems when your data isn’t a nice circle
Kyle: And then what about the human settlement data Made it more amenable where you try and apply K means there?
Greg: Yes. So K means it’s interesting. I almost view our use of K means not as clustering. Let me explain how I came toto using K means for that. So what I wanted to do is we had these different human settlements and my goal was in some unknown way to calculate how many elephant centuries were near these different human settlements. So I needed some way to calculate the distance to these different settlements and then associate each of the different elephant centroids to the settlements. So I take K means almost as a classification algorithm in that sense. Where itcalculates the distance from each point to each centralized right? And then associates them with those different centroids. And since we’re not updating the location of the centroids but just running the algorithm for one iteration. What it’s doing in that sense is not finding the best centuries but rather taking a collection of points and saying you belong to this centraoid.
Kyle: I’m wondering if you’ll imagine with me that we have at least one listener who’s tied into one of these large financial giving organizations aimed at conservation and they align money to different projects where they can do the most good. Is there an opportunity to do a lot of good with an investment like that? Would it be more equipment better data What are your thoughts on it?
Greg: I think it would be going to supporting the park rangers that are out there. When you look at more on the end of preventing poaching some of these national parks are huge. We’re talking hundreds thousands of square kilometers for a park and these park rangers there might be 10 or 15 guys out there trying to catch the poachers in the park. And when we’re talking about poaching, the poachers put out these traps that are pretty much a giant glorified zip tie that they just lay out in the bush. And the goal is for the elephant to step in it and then the elephant becomes trapped and the poacher comes along. So these traps are called Snares. And the 15 guys that are patrolling this huge area are trying to catch these snares out hidden in bushes and stuff before the endangered animals get caught in them. So if you were to have the ability to contribute financially to this cause, I would figure out how you can donate directly to national parks to maybe incentivize more people to become park rangers or help park rangers invest in technologies whether that is drones to help catch poachers with the image recognition I talked about in the beginning or some other yet to be discovered Technology to help out with that problem.
Kyle: Well what about your own next steps? Is this part of a larger effort in your ongoing data science journey or just one step along the way?
Greg: More or less is one step along the way When I did this research I ended up presenting at the Towiri conference which is the Tanzania Wildlife Research Institute. So they’re based right near Mount Kilimanjaro in Arusha Tanzania. And I ended up presenting this paper at the conference and because of covid and various travel difficulties I didn’t get to go in person to present. But I presented virtually. And after that I ended up getting reached out by someone who saw my paper at the conference, who is working with the Tewksbury Institute and now I’m doing more elephant movement research onanother study. But overall in the big picture this is just a stepping stone in my data science journey.
Kyle: Very cool We’ll need projects I’m glad to hear there’s still some ongoing effort. Gregory is there anywhere people can follow you online?
Greg: for one thing You can reach out to me and see some of my other work at my portfolio website you can find at g1776.github.io. And that actually so besides data science it’s something I’m very interested in is full stack web development. So I built that portfolio website by myself from the ground up using React. And I’m definitely a big proponent of building applications that allow people to interact with data science and explore with it in the real world. So that’s what my love for full stack development kind of comes together with data science
Kyle: Well keep me posted as you’ve got new releases coming out very interested in that area. Well Greg thanks so much for coming on Data Skeptic
Greg: Thank you Kyle.
Kyle: Thanks for listening to the second installment of data skeptic K means clustering. Vanessa Bly as Guest Coordination, Claudia Armbruster is our Associate Producer and show notes by David Obembe. I’ve been your host Kyle Polich.