       ![Group of people watching someone speak at front of room](/sites/g/files/omnuum10826/files/styles/hwp_21_9__1920x825/public/2025-06/valery-tenevoy-vWjwXHGd13Y-unsplash.jpg?itok=vly0Zm2y) 

 



 

#  The Power of Community-Driven Small Language Models 

 





Episode Seventy



 

February 19, 2025

 

 

 [ Betsy Gardner ](/betsy-gardner) 

In this episode, host Stephen Goldsmith is joined by MIT Professor Sarah Williams and Boston CIO Santi Garces to explore the ways that generative AI is transforming how cities - and residents - use data. Williams shares insights from her work at the Civic Data Design Lab, discussing how GenAI can help make sense of vast amounts of qualitative data, from city council minutes to community feedback. Garces reflects on the opportunities and challenges of integrating AI into municipal decision-making and civic engagement. Together, they highlight the potential for community-driven, small language models that empower residents and make city services more transparent and effective.

Embed



 



*Listen here, or wherever you get your podcasts. The following is a transcript of their conversation.*

**Stephen Goldsmith:**

This is Stephen Goldsmith, professor of government at the Bloomberg Center for Cities at Harvard. Welcome back to the Data Smart City Pod. I am today joined by two folks: one, Boston CIO, Santi Garces, who's also present in Episode 67 for those of you who count. And Professor Sarah Williams, associate professor of Technology and Urban Planning at MIT, director of both the Civic Data Design Lab and the Leventhal Center for Advanced Urbanism. Welcome, Sarah, and welcome back, Santi.

**Sarah Williams:**

Thanks for having us.

**Stephen Goldsmith:**

So our audience has already met Santi, but before we get into the substance of all this, Sarah, give us a brief overview about your background and your work at MIT and why Santi and I think you're so cool.

**Sarah Williams:**

Wow, that's awesome. Well, thanks for introducing, myself, I started at MIT running something called the Civic Data Design Lab, which really thinks about how we can use data and data analytics to create policy change. And one of the ways that our lab thinks about doing that is through data visualizations and communication strategies for policy experts. So we can have all of this open data in the world, but if we can't communicate it clearly to make the decisions we need to make, it's hard to use. And so we really try to make those processes, procedures, easier to understand, easier to use. The Leventhal Center believes that cities are very interdisciplinary places, right? We need architects, we need urban planners, we need data scientists. We need all different fields to address the future problems of cities. And really what we try to do is make long-term relationships with cities.

So that's one of the things that we've started to think about in Boston. And Alan Leventhal who funded the Leventhal Center, really has an interest in Boston. The Leventhal family has really supported different urban projects throughout, but recently through the Leventhal Center and our 40 affiliated faculty, we've really gotten interested in how generative AI can be used and applied in cities and what are some of the caveats for doing that? And if we go back to my original interest really of data analytics in cities, what are some of the things that we need to think about? What are some of the biases and how do we really take the data that we put into generative AI and use it for action?

**Stephen Goldsmith:**

Well, let's talk a little bit about action. Last October, we had a meeting here at the Bloomberg Center that brought together leading experts from most of the major cities. And we discussed, as you know, because you and Santi both were there, generative AI, and how local governments could use it both for performance management but also for interactivity interaction with community members. How does generative AI improve the conversation and improve the insights? And Sarah, why don't we start with you and then we'll go to Santi about how you were thinking about that. I'm so old that a dozen years ago, while working for Mike Bloomberg in New York City Council passed an open data ordinance, and compliance back then was put up your ugly PDFs and check-mark the box. It was early days. And then we got away from PDFs, but how about an operating proposition today that there's so much data that is open that it limits its use. So how could we use generative AI to improve access and insights by community groups into city data?

**Sarah Williams:**

Well, I think one of the really great things about generative AI is its ability to take qualitative information and process it. And I'm going to sidestep your question and go into this qualitative data, which has always been so hard to open up for cities, so hard to interpret, so hard to make sense of. And so much of the data that we have in cities is qualitative data, whether it's city council minutes, community meeting transcripts, or even just records and documents. And I think generative AI holds a lot of potential for helping us synthesize and parse through a lot of the comments that come from communities. And that includes, that say, the more traditional data sets like 311 as well. When we get into more traditional data sets and how they might, using generative AI might be really useful to them, I think it's by having a dialogue with generative AI systems about the questions that they may want to ask, looking at some of those qualitative data sets, right?

So highlighting some of the issues that might be coming up for residents and then thinking about how they can apply the open data to those problems. I think also one of the things that generative AI is really great at is helping you code and helping you advance visualization. So if we want to look at something quickly, our advanced programmers and our advanced data scientists that are already working in cities can take those questions really quickly and really be able to code with them and have answers in a much more rapid pace. So it's really an assistant in many ways to the work that we're doing. And I think it really needs to be seen as that kind of an assistant, because just like anything in the world, we need to check and verify its results. And having people that specialize, whether it's in working with communities or working with 311 data, having those checks and balances are really helpful.

**Stephen Goldsmith:**

Hey, Santi, in addition to the leadership work you do in Boston, I know you've watched this issue. Before we turn this podcast on, you asked Sarah some really sophisticated question that I kind of understood. Would you ask her that question again?

**Santi Garces:**

Absolutely. I was curious to hear from Sarah, a lot of the conversations that we tend to have around how to use data to improve the performance and the work of CitySense should be work that is very top down. It's having experts going in and reviewing information that only with some deep expertise. Can you make sense of, but one of the things that seems promising about generative AI, and you started alluding it to the ability to have people without technical skills being able to do technical work, but there's also other things in that vein. We've talked a lot about what happens when residents are bringing their own experiences and their own data into the equation. How do you think that generative AI can start changing the dynamics when the public can be making analysis and assessments and consume information in ways that are as sophisticated as any manager or any policy expert that works within City Hall?

**Sarah Williams:**

Yeah, I mean, I think one of the things that's great is that we can now compile all of this qualitative data and allow residents to ask questions of it. I do think that one of the biggest issues in city is closing that knowledge gap between who has information and who doesn't. And generative AI has the ability to potentially help us close that knowledge gap to understand why decisions are made on both ends of the spectrum, like why our communities are interested in a certain topic and why city officials respond in a certain way. I think one of the things that was so exciting about the work we do together, Santi, was bringing in those community meetings, bringing in those other documents, and then seeing how they played out in the Blue Hill Avenue plan. And what was exciting about that is you can see, the community can see, we listened to your feedback.

Here is the page number where we listen to that. And to be able to make that connection, I think, even for myself is hard, right? I write a strategic planning document, I outline it. I think about all the things that I listen to. I'm not always thinking about exactly what community meeting that's reflected from, and by bringing those community transcripts and the surveys and then connecting them to the final output, you could see those rule lines between how the community was asking for things and how the city was responding, in a way that I think has been really obscure and difficult to do before. And I think to your second part of the question, it's been hard to get information from communities. This is something that cities struggle with always over time. We have surveys or we have community meetings, but there's certain voices that are coming with that.

And I think that one of the things we really need to think about is how creating, let's say, large language models of communities that are specific to their needs might help get more input from diverse groups. And what I mean by that is allowing communities to input documents or newsletters that are important to them, other kinds of materials like photos, images, and others that have meaningful experiencing to them and having those documents, let's just even say we have the community newsletter every month that we train into a large language model. We're really reflecting the values of the community in a way that was hard to process before. And so one of the things I've been really thinking about is can communities have these smaller language models or miniature models about what are their interests and feeling that we can then build off of when we're thinking about planning in cities?

**Stephen Goldsmith:**

So let me get back to Sarah in a second, Santi, but let me ask you kind of an intervening question. I think the aspiration that Sarah set out is terrifically important and possible, but what does it require from you as a CIO of Boston? Your data today, how much of it would be available to OpenAI open data inquiry? Most of it? Some of it? All? Let's say you're dealing with a question of why does flooding occur in my neighborhood more than any other neighborhood? And you want to use generative AI as a community leader to kind of explore the data. Have you made that possible or not?

**Santi Garces:**

I think starting from what we have, going to what we're missing, we have, I think because of the collective work and things that you led in New York and Indianapolis, I think that we have a breadth of open data, that quantitative understanding of our transactions and administrative procedures. We generally have also had a lot of data in the form of reports, assessments, analysis that we put on the website that we hope everybody reads, but we know that not a lot of people read. We have policies and process and procedures that are published, right? There's the law around, how is it, if you thought that we should be building a seawall, you might be able to go and find a lot of the information about what's preventing us from being able to do certain things. They're all online. So I think there's a lot of pieces to start there.

Generally, we continue to have favoring of the insider because if nothing else, you know where to look for the pieces that you need. But a lot of it is already online. When I hear Sarah speak, I think that it is very exciting because to this point, I think that there's this tension between what does government know that is going on and how government would react. But up to some point thinking about how is it that we can get better data and better information about what's going on and what people care about and how they're thinking about it and how could we communicate and get better feedback? And I think there's obviously a matter of trust for most of understanding how is it that we're going to use information that you're providing? I think, secondly, my sense is that there's a cultural understanding of government a little bit as an antagonist, if you want government to do something and the insider activists know how to pull what levers at what time to try to get us to do the thing that they want.

So that's why you get so many distortions of signal to noise in some cases. People know how to block something that they don't want or push something that they like. And I think that there's some gaps around governance as well, if we're really kind of creating this layer of information from people. But I know that some of the things that I'm getting from Sarah just thinking if we are to achieve this kind of beautiful and messy new version of what government could be with generative AI, what are the pieces that are missing? And ultimately, I think, the biggest question is all data are representation, and then there's a question of like, if the model tells me something about your preference, is it still as good as you expressing that preference? There's a lot of complicated stuff, but I think that there's a lot of stuff that we already have and then missing pieces that are, I think, a little bit complex but maybe really worth it to sort out in the next few years.

**Stephen Goldsmith:**

Sarah, I was always interested in the fact that city performance, stat programs, metrics, our interesting definition of customer satisfaction is basically we score our own results and we tell the customer how satisfied they should be. And even in New York where Mike Bloomberg had a pretty sophisticated approach where we had city employees drive around and inspect, but still it was government evaluating its own service. So let's take this conversation you and Santi are having, even though Boston is kind of one of the most progressive, let's make them the bad guy here in this conversation. So let's say that Santi is measuring his own performance. Could you tell your neighborhood organization just to drive around all day and take pictures of their sanitary conditions, upload them to OpenAI generative examination and score from the photographs and score from real life pictures of the sanitation? Couldn't the community actively add, to your earlier comment, couldn't they actively add to those metrics for real scores?

**Sarah Williams:**

Yeah, I mean, I think that's interesting. In a way, you're implying, could the community crowdsource their performance review? And I think it holds the same kind of problems of other kinds of crowdsourcing where you don't know who's contributing and who's not, and are there certain kinds of people who contribute more than others? And so I think that while the technology might exist and allow us to do that, we could maybe use people's doorbell videos or other things to start to access certain levels of care. But again, it would be biased to those people who have those doorbell apps or those people who can contribute. So I think one nice thing about what a city provides is they really seek to go out and take that data that might have been missing, should it not otherwise be contributed? But on your line of performance evaluations and cities performing their own self-evaluations, I thought about this a lot because a lot of them come out as data visualizations.

And data visualizations can skew towards certain kinds of answers. And I think sometimes those data performance reviews do that, or they over complicate the reviews to a point where it's hard to really glean the information. So I think one thing that could be used is actually let's take those PDF evaluations and bring them into a generative AI, and maybe we can have a better interpretation of the results in them for, let's say, a community member. So I'm just thinking about, I worked with New York City Department of Sanitation about their complaints and their performance evaluation review sheet was the most complicated thing I've ever looked at. I really had a hard time knowing how they were doing based on some of the output from those Excel sheets. Could you then take those and make them easier to communicate? And I think that's where generative AI might be able to be of more assistance in that.

**Stephen Goldsmith:**

Santi, I want to go back to you, but let me just comment on Sarah's, so when I was deputy mayor and looked at the performance reports, what was it? Lake Wobegon, every neighborhood in New York City was rated by sanitation as over 97% in sanitation quality. So there was nobody who was inferior. So your story is appealing to me. Santi, what were you going to say and how are you using visualizations internally to drive action?

**Santi Garces:**

Well, I will go on a slight tangent and asking a follow-up from Sarah, because that's maybe one of the things that would be interesting. Again, the generalized models we know that have some bias around how they're trained and they exacerbate existing issues. But imagine that a community knew, not only the city government but as the city of Boston, thought that being able to qualify the quality of sanitation based on pictures was important. And we know that we get very sparse feedback, but we invested in building a small multimodal model that just basically, based on a picture, was able to tell us, did we do a good job or not? Is the street clean or not? And we accounted for bias by making sure that we had proportional representation in the creation of the model. Isn't it become kind of interesting around the governance? There's clearly this market failure and there's a contradiction of, there's something of value that would help us run a better government.

The nice thing about these models is that we would have to build it once and then we could reuse it for a longer period of time. But there's this, still this kind of tragedy of the commons. How do you incentivize people to participate to build this thing in a way that would be unbiased? And I'd imagine, I think that one of the things that we keep talking about is ownership, and that's one of the big, more complicated things about the generative AI is you're getting access to knowledge to people that you might not necessarily be paying for the creation of that knowledge, which is kind of paying for the agglomeration of the knowledge and the model. So have you thought about how to bridge that kind of economic failure?

**Sarah Williams:**

Yeah, I mean, I think this is a good question. This gets to a conversation we had before, so maybe I'll share with everybody. One of the things that is problematic about using generative AI in cities is not, large language models don't have the data yet to say a lot of things about cities. They don't have a lot of the kinds of things that we are interested in cities marked or tagged. So poor sanitation department, they keep getting brought up. But in the case of thinking about trash and what is a dirty street versus a clean street, these kind of things haven't been tagged or what does dirty and clean look like in each city hasn't been tagged. And so Santi and I have been talking about, we really need these databases that kind of have a understanding of what a city is made up of, and the value systems around it in order to really be able to use generative AI well in that city.

And so one of the things we've been talking about is how could we have the community build that with us and that we could build a collaborative, let's say, small language model, which is focused on the city of Boston or even a neighborhood in Boston that's built with the community, helping us tag it, helping us build that knowledge base that doesn't currently exist in generative AI. And related to that, what it means is the city can really co-own that generative model. And so instead of let's say that model being owned by Google or Amazon or OpenAI, this small language model of the city of Boston could be co-built by the residents that live there, be co-produced by the city of Boston. And that can be really useful and updated over time in order to ask those important questions about the city. But why is it the co-ownership important to me, I think, is because we always talk about, who owns data has the power, and here the power would be shared.

And I think that's an important model. Also, I think that whatever benefit or let's say funding or possible funding that could be made with such a model should be distributed through that co-ownership model so that the residents benefit from the data that they tag and input into the model itself. And that's a hard thing to do. So I mean, of course, I would love to keep visioning this with Santi, and it's something we've been talking about, but it reminds me of really early conversations that we had when GDPR came out. And so we said under GDPR, everybody owns their own cell phone data and that we should be able to sell that data back to AT&amp;T. But they already own the data. They already have it, that negotiation of selling your private data back. But in the case of generative AI models where we're really training them with the unique understanding of cities and their history, that model can be built with the citizen and co-owned now, and we should set up a model for thinking about how to do that now.

**Stephen Goldsmith:**

Sarah, one more quick question, just explanation, what is the it that would be co-owned and who are the co-owners of it? For our listeners, just a little more detail.

**Sarah Williams:**

It would be a small language model. So that's like a large language model except for small language models have specific context and information. That's what makes them different. So large language models are trained on everything that's available that OpenAI or Google can get its hands on. In the case of small language models, we train it on specific data and information, and that model tends to sit on a separate server somewhere. And that allows the owner of that model to control its use and access to it. It does use the power of large language models, and that's one of the incredible things, right?

Large language models were built by processing all of this data and then looking at statistical representations to inform results that are based really on the statistics of one word being next to another, so forth. Small language models really train on the separate, so they're using the benefit of that large language model, but they're closed to the community. So the it that I'm talking about is a small language model and who has access to it would be any community member that contributes to it, I think. And that's where we need to create kind of governance structure around those.

**Stephen Goldsmith:**

Santi?

**Santi Garces:**

I think, going back to the sanitation example, I think with language, sometimes I struggle more like what would be specifically the small language model for the city of Boston or for a particular neighborhood. But I think about the question of sanitation. Is the street clean? What is trash? And you could think about, again, if we're going to train a model to determine these things from pictures, we would start with a large collection of pictures and then we would go and determine what are the elements in the picture, and we would label. And we would say, this is trash or it's not trash. And eventually you could start ascribing some language that this is what, the language piece, you'd say this is a clean street or it's not a clean street. But to train a model that works well, you want to have very heterogeneous data. You want to have pictures taken in sunlight, pictures taken in the snow, pictures taken from different angles, that will make the model more robust.

And I think the part that's interesting, I get excited talking with Sarah about these things, right now, a lot of the people that are contributing the data to enable these generative models are not getting compensated. And if the people of Boston had the best way of assessing whether a city of Boston street was clean or not, it would be them labeling and them saying, this is what success looks like, so that then we could automate the process of assessing quality of sanitation, the picture against that model. And I find it so interesting thinking about this because it gets a little bit to the frontier about how do you pay for a public good that is contributed by private people. And then I'm getting three steps ahead, I'm like, is it the kind of thing that would make sense to enable through taxation? Is this a public good and should we tax everybody because everybody gets to benefit from it or should we pay the contributors that Sarah was alluding to? There's so many interesting, there are technical issues, but also governance and economics issues that make this very, very interesting.

**Stephen Goldsmith:**

Are you an academic?

**Santi Garces:**

I think that we live in one of the most interesting points in time, thinking about how new tools can help us make government work better.

**Sarah Williams:**

I mean, I love the example that you used because you put it in real terms. Like this is tagging images for our city done by city residents. They benefit from it in multiple ways. And right now when we talk about the labor associated with generative AI, we kind of pass over and buzz over it. But in order to make generative AI work, somebody is labeling everything in those images. And that labor tends to be under-recognized, underpaid, and exploitative. And here what we're saying is by doing that labor you co-benefit so that it doesn't have those kinds of processes. But if we think about the labeling of the images, that's one really concrete way of doing it. But we can also think about bringing in texts, bringing in newsletters. And one of the things that I talk a lot about with communities is not wanting to contribute those kinds of archival data that make their communities so great because they're worried about where it's going to go, and who's going to have access to it, and can they retain their privacy around that information? Can they retain their control of that?

And so one of the things that I love about smart language models is whoever contributes can co-own and they can determine who accesses and controls that information. So obviously you would create a governance structure for that so that not one individual can block everybody, but something that you can buy into. And this is something that Eden Medina and I are working on right now at MIT, is thinking about what are governance structures for these smaller language community-owned models? If you think about it, I like to think about the nodes on a web and that we can have lots of these small language models that can collectively tell us a lot about diverse and different cities and almost a constellation, a mesh of models as it were that people can be connected and access to.

**Stephen Goldsmith:**

Okay, you two have more capability of talking than our listeners do of listening, so let's think about how to handle this. Santi, give us a last word. What are the next couple most important steps that you would need to do to enable Sarah's vision?

**Santi Garces:**

I think that we are at a point that, my hypothesis is that we need to validate the value of doing some of these things, again, like these distributed data models. I think for me, Sarah is trying to solve this impedance issue. There's mismatch between government and the people that government is supposed to serve. And then the question is how can we demonstrate the value of the necessary investment and infrastructure to be able to solve it? I think that in the time being, there's some workarounds leveraging things that we already have, to try to bypass some of the things that would actually make it work better.

So again, I think, if nothing else, I already take talking with Sarah is a great provocation and an invitation for us to step up our work. Knowing that we don't have it all figured out and that there's this really sticky issue with generative AI, there's a whole new set of tools and approaches that we can have and trying to bridge that gap. So if nothing else, I end the conversation, provoke to try more, figure out what is it that we already have that we could use to try to prototype and do these things. And if nothing else, to try to find the resources to validate that there's value in building this kind of new way of thinking about data and urban models for government.

**Stephen Goldsmith:**

That was a terrific answer, Santi. Sarah, your closing comment on what would be needed either from a larger community group with resources and capacity or from the city to reach this new level of collaborative insight?

**Sarah Williams:**

Well, I think the first step is really experimentation and diving in. And so I think that it would be fun to pick a small problem, a small community of need, and really try to see what building out a model like that would entail. Because I think it's through doing that we often understand the nuances behind some of our theory here. So not just making it theory or not being just the professor, as you put it, Stephen, but actually applying that theory towards a specific topic is what I would be super excited about. So whether the sanitation department's ready for us or not, or maybe another community agenda, whether it's thinking about issues around flooding or even the case of Blue Hill Avenue, I think if we can strategize a particular place that we might want to experiment with, I think we could really see the benefits and potential caveats that would be in place and be a leader in thinking about this, because I do think there's real potential in this new area of study.

**Stephen Goldsmith:**

This is Steve Goldsmith from the Bloomberg Center for Cities at Harvard with a terrific conversation with Santi Garces from Boston and Sarah Williams from MIT on the future of how cities are going to get better using generative AI tools. Thanks very much to both of them.

**Santi Garces:**

Thank you.

**Sarah Williams:**

Thank you.



 

 

 

##  About the Author 

### Betsy Gardner

   ![Headshot of Betsy Gardner](/sites/g/files/omnuum10826/files/styles/hwp_1_1__100x100_scale/public/2025-05/Betsy%20Headshot%20resize.jpg?itok=k2OsSp1g) 

 

Betsy Gardner is the editor of Data-Smart City Solutions and the producer of the Data-Smart City Pod. Prior to this, Betsy worked in a variety of roles in higher education, focusing on deconstructing racial and gender inequality through research, writing, and facilitation. She also researched government spending and transparency at the Lincoln Institute of Land Policy. Betsy holds a master’s degree in Urban and Regional Policy from Northeastern University, a bachelor’s degree in Art History from Boston University, and a graduate certificate in Digital Storytelling from the Harvard Extension School.



 

 



 

 See also:- [ Artificial Intelligence ](/topics/artificial-intelligence)
- [ Civic Analytics Network ](/topics/civic-analytics-network)
- [ Civic Engagement ](/topics/civic-engagement)
 
 

 Share on:- [     Facebook ](#)
- [     Twitter ](#)
- [     Linkedin ](#)