At Box, we're proud to say that our developers are some of the best and brightest in the business. We strongly believe that every engineer has a unique perspective that they bring to their work, and it's this diversity and creativity that make our teams so successful. This is the idea behind Inside Eng, a regular segment where we sit down with software engineers to pick their brains on their experiences, the projects they've worked on, and their life outside of work.
We all know it's only a matter of time before robots rule the world (unless the apes beat them to it.) To help me navigate the treacherous world of artificial intelligence, I sat down with Shivani Rao, a machine learning expert on the Content Classification team at Box. Passionate about supporting women in technology and salsa dancing, Shivani taught me that while machine learning today may not be as sexy as AI in the movies, it is an incredibly exciting field that is changing the way our computers interact with data.
Karna Warrior: So, have you always been interested in machine learning?
Shivani Rao: Yeah, computer vision and machine learning were the fields I started working in early on, and that's been my area of expertise ever since. I first fell in love with image processing when I built a custom C++ image editor, like an Adobe of my own, in the last year of my undergrad. That was the first thing I ever really built! I learned how to do contrast enhancements of images, negatives of images, edge detections; I just started playing around with stuff like that.
KW: Was that an assignment, or did you just build it for fun?
SR: Well, the assignment was just to do one of those things, so I basically just kept playing around with it and kept going until I built an entire library of things I could do.
KW: Looking back on that project now, how does it make you feel? Are you proud of yourself, or do you think about it more in terms of how much you've grown?
SR: I think about that time with a lot of gratitude and pride. The fact that I was even exposed to a computer — I didn't even own a computer back then. I lived in India and I went to a school which had a computer lab. I would go to the lab right after exams to get in while computers were still available, because it was full all the time. When I look back on that time, I'm amazed at how much tenacity and enthusiasm I had. I want to hold on to that part of myself! Even now, I typically spend my Saturdays in the library, reading up on machine learning literature and playing around with new algorithms on my IPython Notebook. I have grown a lot in terms of managing people, writing papers, and implementing code at a larger scale, but the essence of that pure pursuit of knowledge— I never want to lose that. If that part of me ever goes away, I think I'll be really scared [laughs].
KW: Yeah, I think that's a really key part of being a great developer— constantly taking new opportunities to learn. So you've been at Box for a couple of years now. How did you end up in Content Classification?
SR: I started off as a machine learning/computer vision scientist; the last year of my undergrad was mostly robot vision and that's really where machine learning all began. During my masters I did a lot of research in computer vision and image processing. That's where I learned the fundamentals of machine learning. I joined the robot vision lab at Purdue University for my PhD, which at that time was getting projects that applied machine learning to text analytics. My dissertation focused on implementing machine learning and search algorithms to improve software engineers' experiences with code bases. I was building tools that would help them interact with code bases, like a question answer system or search tools. When I came to Box, I was an perfect fit for the Analytics team. That's where I learned the whole open source stack — what's used in the industry vs. academia. There's the Kafka pipeline to bring the data from the front end to the back end, hadoop and Elastic Search (ES) and HDFS and all of these technologies that I got to learn. But that stint was actually really short lived. I worked on that team for about three months and then joined the machine learning team, or content classification as we call it, when it was was formed. I think we're solving one of the most challenging problems in the content management industry.
KW: Can you speak on that a little bit? What is content classification and why is it so powerful?
SR: Content management is not just about sharing files, or uploading and downloading. It's also about securely sharing files. It's about not sharing files that shouldn't be shared, or quarantining files that contain PII (Personally identifiable information) or PHI (Protected Health Information); it's very dangerous to hold that information. Currently the technology that's prevalent in the industry uses regular expressions to figure out if a certain file is confidential or classified. These are often ridden with false positives, so people end up losing files, or quarantining files unnecessarily, and they end up losing more data than you intended. What we're trying to do is come up with content understanding, based on machine learning and text analytics, to compare files and see if one file marked as confidential by the customer can help us make predictions about files in the neighborhood of that file, in terms of textual similarity. So, for example, I'm a lawyer and I have a bunch of legal files, like agreements or NDAs. These contain more or less similar textual information, but the end of each file contains a signature or some sort of PII, and that's the only thing that makes that file confidential. If I just tag one of the files as confidential, even a file which does not contain any signatures, then automatically by using textual similarity our algorithm is able to make a prediction that the other files are also confidential, without having to contain the PII. This is very robust and simple solution to a problem that's prevalent in the industry. People don't want to keep to tag every single NDA as a confidential file, they want that to be automatically detected. And regular expressions only go so far. This field of work is called Data Loss Prevention (DLP), and there's very little research in the field as of right now, at least, as far as I know, and there are very few players in the machine learning market in this field.
KW: So, what gets you most excited about machine learning?
SR: Machine learning has definitely been successful, and it has been applied to a lot of problems, especially where there's a lot of data available, but in a lot of fields people are just waking up to the idea of using machine learning. The most exciting part for me is how we can learn from what applications of machine learning have been prevalent in the industry and apply what we know to unknown domains. This content classification project has been exactly that, we applied a machine learning algorithm that was already prevalent in a totally unrelated field and applied it to the problem that we were trying to solve. As the world is growing more and more data heavy and more and more data is being uploaded to the cloud, you're getting more data problems to solve, but that doesn't mean we need to come up with totally new algorithms— it means we need to learn how to apply what we already know to these new problems.
KW: Besides machine learning, you've been very involved in the Women in Technology group here at Box. Can you tell us a little about that?
SR: The Women in Tech (WIT) group is focused on bringing more awareness to the gender issues in the tech industry. We also focus on providing a mentoring and support system for women working at Box, so that they can seek our support when they need it, and ask the tough questions when they can't ask anyone else. We want to build a connection with the rest of the community and showcase the fact that we support and care about the issues women face in the workplace. As a result of women in tech, we've done a lot of outreach activities. I was involved in an activity last year where we went up to Berkeley and mentored a bunch of students who were in their first or second year, and sometimes it makes me wish that I had that kind of support finishing up school. I think that mentoring can make a huge difference in someone's life.
We also want to bridge the gap between mentoring women and sponsoring them. There's a difference between mentoring and sponsoring, mentoring means "I can give you advice and I'll help you", but sponsoring means "I will fight for your cause, I will fight for your next big project, because I have authority and I want to give you that chance." We've seen a lot of studies that have shown women are being over-mentored and under-sponsored compared to men, and there are a ton of other biases that people aren't even aware of. By creating a platform where people can talk about these issues and discuss these things openly, we want to increase awareness of the biases that people walk around with without even knowing about it.
We also send a group of women to the Grace Hopper Conference (GHC) each year, which is one of the largest networking conferences. In fact, I met one of the product managers here, through Women in Tech at GHC. I went for a dinner with them, like a whole group of women engineers, and I really felt like Box was a place that cared about women and women's issues in the workplace. I've seen that as well! I mean there are so many managers at different levels who are women and that was actually one of the reasons I joined Box.
KW: So what gets you excited outside of work?
SR: I actually picked up salsa dancing during the second year of my PhD, as a way to take an engaging break from work. I was also kind of pushed into running by a friend in my first year, and then got into it myself.
KW: So are you still salsa dancing?
SR: I am! I auditioned for a team last year and I got in, but I actually didn't join that year. There's another audition coming up so I've been thinking about doing it this year. The dance teams are crazy, you have to dance four times a week, which is awesome, but they also do a lot of partying, and I'm not a party person. So over time I find that I'm the only one leaving after practice when everyone else is going out for drinks and stuff, which doesn't really fly well with the dance teams. For me it's more like, dance, work, and sleep [Laughs]. I really hope to do at least one more performance though.
Recently I've been picking up piano as well. I learned classical Indian music growing up, but in order to accompany a singer in Indian classical music you just learn with one hand, because you're using a harmonium, so your other hand is pumping air into the instrument. I never really learned western piano, so I've been doing that for about a year now. But it's much slower. I think when you're older it's always slower to learn things.
KW: Are you playing mainly classical music?
SR: Yeah mostly classical. I'm more of a pattern-based person though, I play by ear, and traditional music is very reading heavy, you need to be able to read music. I actually joined a program called Simply Music, which was originally developed for blind people. It's totally pattern based, they introduce you to patterns without even reading music for the first year. I'm not there with reading yet; I'm just starting to read now, but I'm really enjoying it, and that's been another staple in my life. I think it'll take another five years until I'm as good as a five year old [Laughs], but learning is where the thrill is anyway.
Are you interested in joining the Box Engineering Team? Check out our jobs page, and look out for the next installment of Inside Eng, coming soon!