How Big Data and Automation Keep Malwarebytes Ahead of the Bad Guys
Darren Chinen: Our CEO and founder, Marcin Kleczynski, was
14 years old, playing video games, and caught a virus on his computer, didn’t
want to tell his parents about it, and he went searching on the Internet for
some people that could help him clean his machine. And so, at the ripe old age
of 14, with the help of some people he now calls heroes who are actually still
working at this company, he was able to write his first Malwarebytes program,
and that kind of evolved. And eventually, by the time he was 18 years old, he
was able to found Malwarebytes.
And really, our mission has stayed the same from the very
beginning. We absolutely can’t stand malware. We imagine a world without
malware. And what I think is the best thing about what we do is we always talk
about — our goal is to put ourselves out of business. There’s no malware in the
world, there’s no reason to be in existence anymore.
Joe: Could you maybe give us a little bit of a description
about the role of that team within Malwarebytes?
Darren: I lead three separate teams here. The first team is
the team that actually does the website, what we call the “dub-dub-dub team.” I
have an infrastructure team. And the team that we’ll kind of double-click on
today is the data science and engineering team. And that team really handles
all of the big data. We have some anonymous telemetry streams coming in that
pours out all kinds of IOT data. Think of it as sensors all around the world.
And they also do some of the data science and AI as well.
Joe: One of things I think, certainly most people don’t draw
a connection — certainly not immediately — between malware and malware
detection and big data. Can you talk about how Malwarebytes is leveraging the
power of the data that you’re collecting?
Darren: Marcin, our CEO and founder — had this vision of
everyone has a right to a malware-free existence. He’s been giving out our
product for free — even to this day, you can download it for free. And one of
the things that we do is we collect anonymous data that act as sort of sensors
out there for us to understand the landscape of what’s happening in the world
and where are attacks happening.
It’s really interesting because, Joe, if you were to look at
some of the telemetry years ago, the world of malware was evolving at maybe a
monthly, or sort of a slower than monthly pace. And today, sort of with modern
technologies that help software engineering, things like CI/CD (continuous
integration and delivery), the world of malware is literally changing by the
hour.
So, it’s very, very, very important that we’re able to collect
and harvest and mine all of this data in an efficient manner so that we can
understand the landscape, and our company and our systems can adjust
appropriately to combat malware as it’s involving on literally and intra-day
basis.
Joe: It’s really interesting to think about how malware
creators are using agile software methodologies to deliver malware more
quickly.
Darren: The world-class software engineering organizations,
unfortunately — the television shows you see with guys in dark rooms and sunglasses
and a big bag of Doritos next to their desk is — that’s probably far from the
truth. These are very, very sophisticated software engineering shops that are
in business because there is a path to cash, and they’re making money off of
things like ransomware. So, absolutely, for them, it makes sense to stay on
leveraging the most cutting-edge technologies. And really, to fight the war
against malware, we have to be as good as or better than whatever they’re
doing.
Joe: Recently I heard you talk about how your team is not
just using AI and machine learning, which again, I think so far, our
conversation has been chock full of things that are really top of mind for
everybody. But, not only are you leveraging these cutting-edge technologies and
approaches to building, I guess, malware detection, but to be able to do it at
scale and to industrialize that. Can you elaborate on that and maybe share with
our listeners how you’re doing that?
Darren: AI is one type of technique, a very effective
technique that we use. And we use it both in our endpoint technology as well as
— our behavioral EPR technology is coming out with an AI engine as well, and
we’ve been testing that.
One of the things that we recognize is that we can build
sort of anomaly detectors with AI. AI is actually decently good as anomaly
detectors, as long as you can narrow the problem set. And one of the problems
with machine learning is that it’s a machine, right? So, if you think about
machines in general, they’re very good at specific tasks and specific problems.
When you walk into your kitchen, you have a machine that is
called a toaster, and all it does is toast, right? And you have another machine
called a waffle maker that only does the waffles, and you have another machine
that’s called a blender, and it just does blending. Why isn’t there one machine
— or the Jetsons’ robot, right? — that actually does everything? The answer is
because machines are good at very specific tasks. It’s actually really, really,
really difficult to build machines that sort of generically do everything in
the sense that a human can.
When you’re trying to figure out if a machine is infected,
you have to look at the behaviors of that machine, or that’s one of the
techniques that you can use. Now, the problem is that the way people use
machines or PCs today is wildly different. You can imagine a salesperson uses
their machine in a much different way than a marketing person, and the person
who really messes up their machine is probably an engineer or a DevOps guy, right?
And so, if I try and figure out what is anomalous behavior,
and I’m just looking at the machine logs of the activity of what’s happening on
these machines — if I’m not doing this correctly or if I’m not careful, I’m
going to either be too sensitive, and everything that the engineer and the
DevOps guy does, I’m going to think, oh my gosh, these guys are infected with
malware, or I’m not going to be sensitive enough, and I’m going to think that
the engineer and the DevOps guy is normal behavior. And when the sales guy is
experiencing the hack, I won’t recognize it, because it will look just like an
engineer.
So, it’s a signal to noise problem, and that is sort of one
of the key things that we have to cover in the world of malware detection. We
have to improve the tradeoffs between wanting to do a good job at detecting
malware, but also not making a mistake. We call that a false positive. And if
you’re in the security industry, a false positive is what we jokingly call an
extinction event. It’s one of those things that can put you out of business.
So, we have to very careful about getting good detections, but also not making
catastrophic mistakes.
Joe: What kind of technology are you using, what’s your
architecture like? Some of the things that help you avoid false positives and
to identify it from that signal that you’re seeing and separating it from the
noise of normal activity?
Darren: Well, I think there’s a couple of phases to this.
The first one is how you harvest the data. So, there’s a lot of different types
of data that are coming in. So, we have this IOT type of data, what we call
telemetry data, and that data really just gets a public-facing API. We quickly
push that data into stream processing — we use Kafka and Kafka Streams. And
then we’re able to process that data, and then we leverage everything in a
public cloud, so we can leverage existing cloud services like AWS and some of
the things that they have, like S3 and ephemeral processing to process that
large amount of data.
There are other types of data that come in as well, and data
that we need to harvest from external APIs. And for that, we use sort of a Java
framework that can go out there and harvest data from some of the APIs that we
need to get to, to enrich a lot of that data with some of the other — I would
call it more transactional type of data. We do some caching that happens in
Redis. And then, basically, our goal is to provide a platform for the data
scientists to actually go do their work.
Joe: It sounds like you are describing a lot of different
processes, like training models and things that have to be done in a repetitive
kind of fashion. That sounds like this other sort of periodic, maybe long-term
orchestration kind of tasks that have to be done. Is that right? Can you maybe
talk a little bit about that component of the engineering and the delivery
process?
Darren: It’s a piece of AI that’s, I think, overlooked by a
lot of people. What happens is, scientists come up with this model and it
works. And then when they throw it over the fence, it works perfectly fine. But
over time these models, the underlying data changes and the prediction accuracy
goes down. And so, you’ll find data scientists sort of looking over their
shoulder, and then they have to say, “Oh my gosh, my accuracy’s not as good.”
And then what they do is they have to go retrain their model manually and give
another model and it’s a big deploy.
We found that process is something that we can automate. And
it’s about keeping our production AI environment up and running smoothly and
constantly making good predictions based on the ever-changing data. The
analogy, right, is if I built an AI model around celebrity detecting, and I fed
it all the celebrities — let’s say we’re in a time machine and we’re back in
1990. And I fed all the celebrities from 1990 into the model and it trained on
it, it would make good predictions in 1990.
But fast forward to today in 2019, if I try to run that
exact same model without feeding any new data, it couldn’t accurately predict
anything, right? So, it’s very important that you take as good or better care
of your production AI deployments and the ever-changing data under there as you
do with taking care of sort of building your data science labs.
Joe: Can you talk a little about maybe specifically what
tools and what components help you achieve that level of automation to make
sure that your models are operating properly?
Darren: Control-M has helped us to orchestrate everything
beautifully. We’ve been using Control-M, too, for all the big data. We also use
Control-M on Snowflake for some of the structured data. And it really
beautifully coordinates all of our ETL processing, the batch processing, all of
the ingest, sort of what we call the pre-feature builds. And then what it does
is — we actually take all of our AM models, and what we try to do is get them
scheduled, and if it can be done in a container, we’ll definitely do it in a
container.
Otherwise, what it will do is it will kick off a job on
Spark, and once that model is retrained, it’s tested, it goes through an
approval process where we take a look at the confusion matrix to make sure that
the accuracy has, in fact, improved with the new training, and then there’s a
process to actually promote that improved model into production. So, that’s
something that’s pretty much handled — the backbone of that is pretty much
handled by Control-M and that whole orchestration and the scheduling of all of
that. That’s all done in an automated
fashion.[Source]-https://www.bmc.com/blogs/how-big-data-and-automation-keep-malwarebytes-ahead-of-the-bad-guys/
big data courses in mumbai at Asterix Solution is designed to scale up from single servers
to thousands of machines, each offering local computation and storage. With the
rate at which memory cost decreased the processing speed of data never
increased and hence loading the large set of data is still a big headache and
here comes Hadoop as the solution for it.
Comments
Post a Comment