Reduce, Reuse, Recycle: Data Benchmarking and Accessibility for Faster Research With the Catalyst Property Database Webinar

This is the text version for the Reduce, Reuse, Recycle: Data Benchmarking and Accessibility for Faster Research With the Catalyst Property Database webinar.

Erik Ringle, National Renewable Energy Laboratory:
Well, hello, everyone, and welcome to today’s webinar, Reduce, Reuse, Recycle: Data Benchmarking and Accessibility for Faster Research With the Catalyst Property Database. I’m Erik Ringle from the National Renewable Energy Laboratory. Before we get started, I’d like to cover some housekeeping items so you know how you can participate in the event today. You’ll be in listen-only mode during the webinar. You can select audio connection options to listen to your computer audio or dial in through your phone. For the best connection, we do recommend calling in through your phone line. You may submit questions for our speaker today using the Q&A panel. If you’re in full-screen view, click the question mark icon, which is located in the floating toolbar at the lower right side of your screen. That will open the Q&A panel. If you’re in split-screen mode, the Q&A panel is already open, and that’s located also at the lower right side of your screen.

To submit your questions, simply select “all panelists” in that Q&A dropdown menu, type in your question or comment, and press enter on your keyboard. That’s all it takes. You can send in those questions at any time. We will then collect these and, time permitting, address them during the Q&A session at the end. If you have technical difficulties or just need help during today’s session, I want to direct your attention to the chat section. This chat section is different from the Q&A panel and appears as a comment bubble in your control panel. Your questions or comments in that chat section only come to me, so make sure you use that Q&A panel and not the chat section for content questions for our speaker. We are also recording the webinar. It will be posted on the ChemCatBio website at a later date along with the slides. Please see the URL provided on the screen here.

Now, a quick disclaimer. This webinar, including all audio and images of participants and presentation materials, may be recorded, saved, edited, distributed, used internally, posted on the U.S. Department of Energy’s website, or otherwise made publicly available. If you continue to access this webinar and provide such audio or image content, you consent to such use by or on behalf of DOE and the government for government purposes and acknowledge that you will not inspect or approve, or be compensated for, such use.

Our speaker today is Kurt Van Allsburg, who is a scientist in the catalyst research program at the National Renewable Energy Lab. His research focuses on understanding pathways to commercialization for catalytic processes such as biomass conversion to fuels and chemicals, and using those commercialization insights to accelerate technology development. We are excited to hear what you have to tell us today, Kurt. And with that, feel free to take it away.

Kurt Van Allsburg, National Renewable Energy Laboratory:
Great. Thanks, Erik. I’m just going to get set up here. OK.

Erik:
That looks good.

Kurt:
Good. And I’m just going to grab the chat, few other set up. Oh, while we’re getting started, I wanted to clear up a common misconception. This is not me, even though this is a similar looking person. Unfortunately this is a much smarter researcher. You’re going to have to settle for me today. I’m excited to be talking with everyone today, though, about this Catalyst Property Database that we’ve been developing as part of the ChemCatBio consortium. And, Erik, do you think we should get started, or are we going to wait a little bit longer?

Erik:
Kurt, I think we’re good to go.

Kurt:
Great. Well, thanks everyone for attending and excited to talk with you today about my topic of reduce, reuse, recycle, working on better uses of data for faster research through the Catalyst Property Database. Probably many of you are already familiar with the Chemical Catalysis for Bioenergy Consortium—ChemCatBio—that this work is done under. But just for those who aren’t, our goal is to accelerate the catalyst and process development cycle. You can learn more about our consortium at ChemCatBio.org. And many of you probably found out about this webinar through our newsletter, The Accelerator, but if any of you aren’t subscribed and are interested, please go to our website and you can sign up.

So in ChemCatBio, as I mentioned, what we’re seeking to do is accelerate catalyst discovery. And we’re responding to the observation that the path to catalyst deployment is slow and difficult. And that’s sort of represented by all of these barriers to catalyst maturation that are shown at the top as you’re on your pathway from discovery to scale-up and eventual commercialization. And what we’re trying to do is provide resources that reduce the cost and the time of maturation, basically flattening this path to make it faster. And we do that through all these resources that we provide as a consortium. Today I’m going to be focusing on tools that improve research efficiency.

And we’ve already released several of those, and one is the CatCost tool, which I was heavily involved with. CatCost is a free catalyst cost estimator that we released in 2018, and we just did a major update of the tool this year in May. And we also added a mailing list, so if you use CatCost or are interested and you want to get updates—infrequent updates—on new features and so forth, please check it out at catcost.chemcatbio.org. Some things about CatCost that are cool is that it considers detailed capital and operating cost for making a catalyst. It has both Excel and web app versions, and we have a Python tool to convert between those. And we have some really cool advanced visualizations that are built right into the tool to help you to visualize your data and better understand it.

Today’s webinar is going to be focused on the Catalyst Property Database, a tool that we’ve developed and released more recently. It’s a free database for catalyst research, and we released it last year, and we just opened it to external contributions last month. So that’s available at cpd.chemcatbio.org, and I’m going to be talking more about that. So the first question is why build a catalyst database like this. Well, the way that I think about this is it’s really about how slow and difficult it is to do a literature search to find catalysis data. So, as you can see up here at the top, it can be very slow and cumbersome to find catalysis data in the literature. First you have to find the right papers to read from all the papers that are out there. That’s a slow process of searching databases and setting up literature alerts and things like that. Then, once you’ve got a set of papers, you have to carefully read them to confirm that the methods that were used in the paper match those that you need for your study. And then finally, often—too often—the way that you actually are able to get that data and then use it is by hand-copying the data out of the journal PDF into something else for your application.

So because of all those challenges and all those slow steps that make it hard to find reliable and directly comparable data sets, it’s often easier to just compute or collect the data yourself from scratch. Even if it existed out there somewhere else. It’s just too hard to find. So this results in a lot of duplication of effort, a lot of spent money and time that perhaps didn’t need to be. So our goal is to—this sort of cheeky title of reduce, reuse and recycle data. So what we’re trying to do is make it easier to find the data that’s already out there so that you can focus on doing what’s new in your research and ultimately accelerate your progress.

So how we’re doing that is we want to make catalysis data more accessible. And we’re doing that with the Catalysis Property Database. The idea is to bring a bunch of data into one place so it’s centralized, and that will enable faster comparisons because you can take a whole bunch of data and look at it all together. It’s also searchable, meaning that instead of having to pore over those journal PDFs to try and figure out whether the methods match, you can search those metadata fields. We have dozens of metadata fields. You can search them directly so you can find the right results faster. And then it is a publicly accessible database, and we just opened it to contributions from the public, as I mentioned. I’m going to mention a little bit more about the quality control measures we have in place to ensure everything is high quality.

The data types that we want to include in the database sort of fall into three different categories, and we’re initially focused on computational catalysis. And that’s just because the metadata for computational catalysis is really quite well defined and easier to implement, we found, compared to other options. So that’s where we’ve focused our first efforts. But we’d also be interested in incorporating a bunch of catalyst characterization data and ultimately reaction performance. And all three of these together can provide a lot of value and opportunities for finding new trends.

So just to say a little bit more about what we currently have in the database. At initial release, we have DFT-computed adsorption energies for intermediates on catalyst surfaces. So in other words, the energy of this reaction to take some adsorbate and put it onto a particular catalyst surface. And those all come from peer-reviewed journal articles. Our longer-term goals are to add new data types that will allow us to enable scaling relations and reactor—reactivity descriptor discovery, both experimental and computational results, as I mentioned. We want to incorporate details on how a particular catalyst was made or what its composition is. And ultimately, we’d really like for all of these additions to be guided by what your needs and preferences are, so please communicate those to us and work with us to help us guide the progress on the CPD.

Some of the possible applications of the CPD that we envision are first of all up here in the left, top left—we have benchmarking and validation. This is where researchers want to check the results against the literature, and it makes it easier through the CPD to find the right results and check them. For catalyst screening, you’re looking at taking reactivity data and correlating it to making structure-function relationships and ultimately to figure out how you might be able to predict new target compositions to make in the lab. You can do really a lot of things when you have data at your fingertips, coupling, computational modeling, characterization and activity, sort of those three data types I mentioned earlier. And then finally, by bringing together a lot of different data, you can uncover trends that you haven’t found before. So we’re excited about each of these and I’m going to tell you more about some opportunities there. So I just told you why we built the database, and now I want to talk about how we built it and why you should care, because there are a lot of things that we had to decide and to implement that are really quite relevant to how it would be useful to perhaps you as a catalyst researcher or someone interested in biofuels. And I’m also going to give some examples in upcoming additions.

So talking about how we built it, I first want to highlight our excellent team for the Catalyst Property Database. All of these researchers are working under the ChemCatBio Data Hub project. And we have a mixture of experimental and computational expertise. That’s sort of on our chem team, what I’m calling it, and we focus more on finding data in the literature and thinking about how to structure the data from the perspective of the chemistry and chemical engineering. Then we have a great development team made up of software engineers and system architects and UI designers, and really have an awesome group of people working together on this. So big thanks to them. They have really worked hard over the past year and before that to deliver some awesome features for you, and we’re excited to show you what’s next as well. So that team has allowed us to develop some really cool stuff.

And without further ado I’ll get into that. The first thing that I want to highlight is that the CPD has really cool, fast, and simple searching. So if we look at this, the first thing you might notice is that there isn’t a search page and then you go to results page. The search updates live as you add criteria so you can get feedback as you’re going on, whether you’re getting the results you want or you’re not getting any results, and you maybe need to broaden your criteria. We designed the database to be simple to use for both computational and experimental researchers who may not be as familiar with the computational results that are currently contained there.

The way that we were able to set all this up is by sort of having a tiered interface, if you will. So the way that the CPD architecture is set up is that especially if you’re a casual user that’s just looking to get some values and you don’t want to get into the details too much, you would just be interacting with the CPD website at cpd.chemcatbio.org and that’s the web app, the front end. And it has this intuitive search-and-filter interface. You don’t have to have any coding experience, and that’s where casual users will be able to just look at the data and search and find what they’re needing. Then, we also have an application programming interface that is what the web app is communicating with. And that’s where power users—if they’re interested in uploading data or interacting with the website in order to download a lot of entries—they’re going to be interacting using scripts directly with the API. So the bottom line is that we have different access points for different types of users. And wherever users are interacting with the database, we’re using user training, data curation, and quality controls really just to make sure that everything stays high quality. In the back end we have the actual relational database where the data lives, and that’s designed for performance at scale as we continue to add data.

It's sometimes harder than you might guess to figure out what data and metadata fields to include for a new database that you want to set up, and that was the challenge that we faced. We had to figure out which things are important to users, which of those things that we want to pull out of the journal article, so that people don’t have to go to the journal article to find them. So we thought about that and we developed this data structure for computed adsorption energies, and it contains all sorts of information from bulk material properties to—if it’s a nanoparticle—information about that, information about the methods and the adsorbate and the reference species, as well as a link to the actual paper via the DOI. So this is key. Providing all this data is key to our goal of helping users find the right data faster. As we continue to grow the database, we’re going to be working with users to define structures for new data types because of course it’s going to be different for experimental data or even perhaps a different type of computational output. But the nice thing is that with one under our belt it will get easier.

So we then converted—this image is showing up blurry for me, but the bottom line is that we took that data structure I just showed you and we actually turned it into a Postgres database. So this is the adsorption measurement, adsorption energy table, and it’s connected to the method and then to bulk surface properties. So the bottom line is that by breaking up the data in a way that makes sense for the structure, it helps the database to be ready to efficiently scale up. And as I mentioned, we can do this faster for new data types, whether it’s catalyst characterization or synthesis or reactor performance. I mean we’re at one now. Two will be a little easier, and as we get to end data types, it’s just going to get easier and easier and we’ll learn as we go. OK. So that’s a little bit about the structure of the database.

Now let’s talk about some of the features that help users. So the first thing that we’ve got that I’m really excited about is we have this parameter guide for nonexperts. So that explains the columns that you’re going to see when you’re searching the CPD. It gives details on what does the bulk formulae refer to or what would it mean if a unit cell were stretched or compressed. These are designed so that it’s really trying to make computational chemistry results a little bit more accessible to people that don’t work in them every day. And also in some cases, maybe we haven’t been quite clear enough, and we want to provide a detailed guide so that people can understand what we mean. So that’s sort of—that’s our parameter guide, which is just for specific columns that you’ll see in the CPD. But then we also have a detailed user guide that gives information on how you search, how you upload data, some history on why we made choices in the CPD. So this provides documentation for both casual users and power users.

So another step that we had to take that’s really important, I think, for users is that we standardized the data that we were putting into the CPD and created rules so that we can standardize future data coming in. The reason this is important is let’s just look at these first three entries here. These are all different ways to write oxygen, dioxygen. You can write the name as dioxygen or oxygen with a g behind it. There’s a lot of different ways to write really the same molecule. So what we did is we set up rules so that we don’t have these entries. We only have this one that follows our rules so that when you’re searching the database you can be sure that you’re finding all the entries that actually have oxygen as an adsorbate or a reference species. So this makes performance and search better and it ultimately, we hope, gives you more confidence in the data because you know that you’re getting what you hoped and you don’t have to kind of try and guess what other people might have written.

So those are some of the features of the CPD from the perspective of accessing data. But as I mentioned last month, we opened the CPD to uploads from the entire catalyst community and we’re really excited about that. Previously, we had been adding all—we as the Data Hub and/or CPD team—have been adding all the data that’s present in there. So this is the process for data to be uploaded. The user prepares data in a CSV or JSON format—that’s comma-separated values or JavaScript object notation. And then they upload the data. Currently we’re just allowing uploads via script. In fact, we’re thinking that maybe this user interface-based upload may not be needed, so interested in hearing from you, but we figure that probably a lot of people that might be uploading data would be comfortable using Python scripts to call the API, and we do have that set up.

So once the data is uploaded by the user, it goes into a staging area, and in that staging area it’s awaiting curation. So once the data shows up in the staging database, a curator gets a notification of data to be reviewed. They then log in and they take one of many different steps depending on what’s—how the data looks and whether it’s compliant with the rules that we’ve established and whether there’s any missing information. And then they will either accept the data with changes or contact the user or whatever is needed. Now this is still a pretty early process. We’re sort of working out the kinks here because we’re waiting on people to really start uploading lots of data, but it will get easier as we figure out what the most common problems are, and then we’ll put those into our user guide to help people to avoid them. In the long run we’d love to recruit a team of curators sort of like Wikipedia that can be contributing to the site and helping us ensure high quality.

So if you’re interested in uploading data, you can get started by cloning this repository, which is where the Python 3 scripts—the library—the Python 3 library for batch upload lives. And there’s some usage examples in there including integration examples that go all the way from some vast output files, computational output files to the database. So please check that out if you’re interested. And if you want any help, please reach out. We want to work with you, and especially these earlier uploads are going to be very much a collaboration, so please reach out. OK. Next, I want to give two use case examples and then I’m going to talk about what’s coming up next for the CPD.

So the first use case that I want to talk about is a computational chemistry graduate student. So they show up at the university. It’s their first year, and their advisor says, “Oh go and run these calculations. I want you to look at this set of adsorbates on this surface or this catalytic pathway and you should check that your results make sense.” And the graduate student says, “Oh well, I guess I should figure out whether they make sense somehow.” Well, they can, that grad student can then go to the literature and find papers that might match with what they’re doing in order to benchmark their work. But that will be made a lot easier if they can go to the Catalyst Property Database to find each reference that they’re interested in looking at. So it basically helps them benchmark faster and sort of get a leg up on their project as they’re getting going, so they can benchmark that way.

The second use case is also benchmarking, and that’s because benchmarking is very important, as I’m going to say a little bit more about in a second. This is the case of a journal article reviewer who wants to do a sanity check on a density functional theory paper that they’re reviewing, and they can go to the Catalyst Property Database and check whether the numbers are in the ballpark that they would expect from what’s previously been published and uploaded to the CPD. We also hope that on a sort of related note that editors and reviewers will encourage authors to upload their data to the CPD. That will aid in all of the applications that I’ve mentioned. So you notice that I made two examples that were both benchmarking, and that’s because I really think benchmarking is a very important challenge facing the field, and I’m certainly not the first to point this out.

There have been quite a few people commenting on this and papers published. I’ll just highlight two here. This “Toward Benchmarking in Catalysis Science: Best Practices, Challenges and Opportunities” grey paper. And also a more recent one “Towards Experimental Handbooks in Catalysis.” We’re really interested in thinking about how we can advance benchmarking overall in the field and in both computational and experimental results. We’re—I want to acknowledge some great discussions that we’ve been having with Northwestern researchers listed here. We’re talking about ways to improve benchmarking and possible collaborations on demonstration projects that would sort of guide benchmarking efforts. But we don’t want in any way for this to be exclusive. If you’re interested in working with us in the CPD on growing the database and benchmarking efforts, please reach out. It would be great to work with you on this important challenge.

So last thing I want to talk about is future updates to the CPD. The first one of those, of course, is we just want to get more data in there and grow the database. So the first thing we’re going to do is to continue adding data from the literature. It is a relatively slow and manual process, to be honest, and that’s part of the value we’re adding to users. At least we do it once and then many other people can benefit from it. But we’re looking for ways to speed this up, and so that would be things like we could really use your help. If you’ll work with us and upload your new results so that we don’t have to mine them from your papers, that of course benefits everyone. You can get more citations and we can help our users benefit from what’s in the CPD. In the long run, we are considering whether to relax the current requirement that all data in the CPD comes from a peer-reviewed journal article. We’re very interested in your thoughts on this if you want to weigh in.

One way that we can improve what I mentioned is still a relatively manual process for us to add data from the literature, is we’re piloting the user of machine-learning and/or natural language processing. Please don’t criticize my use of the terms. I’m only barely grasping them myself. But there’s this really cool tool called ASReview that allows you to basically sift through a lot of papers and find those that are most likely to be worth reading so that you can then look in them and say, “Oh, this one contains lots of tables full of density functional theory results. This is a great paper for me to spend time uploading into the CPD.” So excited to work on that, and we’ll share our results with you in the future.

The next big update coming up is that we’re going to be doing some major upgrades to the user interface. Those are planned for release this winter and/or spring. For example, one thing that we want to add is prepopulated filters, because right now you have to add each criteria in that you might add to your search. We want to prepopulate those that kind of reflect the most common search preferences to save people time and also allow them to focus in on the data that’s most likely to be relevant to them. That’s one example of an improvement we’re going to add. Another really cool feature is this idea of reference species translation. OK.

So stepping back, computed adsorption energies are energies of a reaction, of course. And it could be it’s a reaction from some gas-based reference that then adsorbs to the surface and produces this adsorbed species. The problem is—or well, just the reality is that for many adsorbates, well, for any adsorbate there are multiple reference species that are possible. So for this hydrogen atom, for example, it could be relative to one half of a dihydrogen molecule in the gas phase or it could be relative to an isolated hydrogen atom, and those have quite different energies. So the problem is if half of the data in the CPD is this reaction and half of the data is that reaction, that data cannot be directly compared, which limits the scope of comparisons that you can make. It’s almost like if you had a list of temperatures, some were in Celsius and some were in Fahrenheit, but you didn’t know the formula to convert between them.

Well, we want to solve this problem in the CPD, and our solution is to create a reference species translation feature that will—basically what it does is it will contain gas-based reference energies for a variety of different species that will allow you to convert between these two sets of data. So then if you had 100 entries, each with 10 different reference species, without this tool you’d only be able to do a comparison of 100 entries at a time, but with this tool you could do 1,000. So it’s really cool and definitely a possible way to improve comparisons in catalysis. And as far as we’re aware, no other public database or resource currently allows you to do this.

Another thing that will be enabled by the inputs that we’re going to have to calculate for that reference species in a conversion feature or translation feature is that we’ll be able to do reaction networks. So we’ll look at all of the different species that are connected to each other, show the connections, and even be able to generate energy diagrams for those reactions. So one other nice thing about this is that this would allow us to cross-pollinate with databases that instead of focusing on adsorption energies focus on reaction energies such as the Catalysis Hub that’s been developed by SUNCAT. And we’ve been in talks with the developers of Catalysis Hub looking at possible collaborations. This is definitely not a winner-take-all mindset. We want to all work together to move the field forward. And we’re thinking about things like collaborating on input specification so that everyone can streamline and harmonize their data.

The last or second-to-last future update is a catalysis deactivation mitigation resource. The idea here is catalysis deactivation is one of the biggest problems in the field, and it’s hard to figure out how to predict it. And what we want to do is have a bunch of adsorption energies for common catalysis poisons on common surfaces so that then you have a data set that can be used to make systematic predictions about catalyst deactivation rates. OK. Now the last update is that, as I mentioned, we’ve focused here on computational catalysis to start. But we’re interested in expanding to catalyst characterization, reaction performance, and so forth. So we need your help with this. We need your help with prioritizing new areas to go into and also with figuring out how to structure the data.

So I’m going to close with the four different applications of the CPD that I mentioned from benchmarking and validation, catalyst screening and predicting new target compositions, interpreting experimental trends, and to discovery of scaling relationships and reactivity descriptors once we get the right type of data into the database. I want to again thank our great team that has worked on the database. It’s been really a pleasure and really cool working on this project. And I want to thank the Bioenergy Technologies Office, which funded all of this work and has been very supportive throughout. OK. Then I’m going to end here, so thanks so much for your attention. Really happy to take your questions and to further discuss with you. Thanks.

Erik:
Yeah. Thanks, Kurt. It was an interesting presentation on what is probably going to become a frequently used addition to the ChemCatBio toolbox. So we do have time for a few questions. As a reminder, you can use the Q&A box to submit your questions to Kurt. I’ll collect those and pass them on to him. But I do have a couple Kurt maybe to get us going here.

Kurt:
I guess, Erik, while we are waiting for people to input their questions, I could do the quick demo of the CPD that—

Erik:
Oh yeah. Why don’t you go ahead and do that?

Kurt:
All right. I’m just going to share my desktop again. Let’s see if this works. OK. Now on a different desktop, so hopefully this works OK. So what I’m going to do is I’m just going to go to cpd.chemcatbio.org and load the database. And now, so I’ve landed on the landing page here. I’m going to say here you can see the parameter guide that I mentioned, lots of cool information there. And then also there’s the user guide here. But now I’m going to go to actually search the database. So just waiting a second for that to load. And just going to pull the window up a little bit. OK. So the database is loaded and you can see there are a lot of results that are showing here. And if we want to just sort of explore what’s there, we can edit the columns that are displayed to show different aspects of the data that we might be interested in.

But what I’m going to do is just add a criteria in and do a little search real quick. So I’m just going to select an adsorbate. Since I was doing hydrogen earlier, I’ll do something else. Let’s do methanol. So if I select methanol then I can see that I have a few entries. I guess not too many for methanol that show up. And just to highlight a few features here, I can scroll across and see a lot of information about each entry. And I can also hover over certain areas where there’s more information like the adsorbate. So I can hover over here and I can get information like the formula, the molecular formula, the SMILE string, which describes the structure and some other information. Same for reference species. And let’s see, maybe I’ll switch to a different adsorbate that I think this one has multiple reference species. Oh yeah.

So here if I hover over this entry. Go over this way. If I hover over this entry, I can see some more details on the reference species. In this case it’s multiple gas-phase molecules that are used as the reference, and you can see all the information on them there. You can see the formula and the details. And for each entry in the database, you can also click this dropdown button to see more details on what’s included there. There’s quite a bit of metadata, and we think that will really help you to home in on the right data. OK. And with that I’m going to stop sharing again. We can do questions.

Erik:
Great. Thanks for that walkthrough, Kurt. We do have a few questions that came in and we have 10 minutes to get through some of them, so let’s just dive right in. First one, how will you choose the data types to add to the database?

Kurt:
Yeah. That’s a great question. We don’t want it to be too determined by us, I think is the answer. We certainly have some ideas, but we really want to talk to you and talk to the overall community to understand what data types you think are the most valuable and how you would structure them so that we make sure not to miss any of those key metadata fields that would be valuable. So we’re obviously doing some outreach right now. But we also hope to engage with you at conferences and other venues to talk more about how to prioritize that. So there isn’t—we don’t have like a firm plan on specifically which datatypes because we already have a long list of features, as I mentioned, that we’re working on. But definitely interested in your thoughts on that.

Erik:
Sounds like feedback is helpful. A couple other questions. So which parameters do you use for data curation before uploading?

Kurt:
That’s a great question. There isn’t really a single answer to that yet. What we’re doing at this moment is basically it’s just looking at every piece of information sort of row by row, which is very onerous, but we really feel that’s the only way to identify those things that might be more problematic. And then what that will allow us to do is say, “Oh, here are the 5 or 10 fields that are most—that seem to be the most problematic, where we didn’t explain clearly enough what we were hoping people would do or we just wanted to change the formatting.” It’s like, oh no, space here, things like that. And that will allow us to put extra attention on those areas, and then just kind of do a quick skim of the others. But to be honest with you, at this point we’re kind of just in the point of like looking at every entry, and we’re hoping that our methods level up fast enough to keep up with the data coming in.

Erik:
Great. Thanks for that. With regard to relaxing the peer-reviewed data requirement, have you thought about including a quality factor analogous to the ones associated with the Powder Diffraction Files? You wouldn’t have to use—just continuing to read here. You wouldn’t have to use the word “quality,” but perhaps you could select in a search to include or exclude non-peer-reviewed data.

Kurt:
Yes.

Erik:
Have you thought about that?

Kurt:
Thanks. That’s a great question. I am thinking along exactly the same lines as you. I had the same thought, and I think, yeah, maybe a toggle for if users want to exclude non-peer-reviewed publications or non-peer-reviewed results that would be one thing. But absolutely. That’s what I was thinking about as well, some sort of quality-of-completeness score. We haven’t figured out exactly how we would determine that because it could be just based on like, does it have all the fields in the database. It could be on some other metric. It would be really cool in the very, very long run if we had enough data you could somehow vet it against existing data. But there’s a lot of open questions about how we might implement that. Bottom line is I completely agree. I think a quality score or something like it would be really valuable.

Erik:
Interesting. OK. We have a few more questions. If you have a starting material and a desired product, can you enter these and pull out possible catalysts? Can you apply temperature and pressure limits to the selection parameters? For example, isoprene to carbon dioxide temperatures less than 200°C.

Kurt:
So the comment about sort of catalyst synthesis and connecting raw materials to possible catalysts, that’s a really interesting idea and it kind of dovetails to something I didn’t talk about but that has been sort of a motivating vision for us in this project overall, and that’s the idea of sort of building a catalyst design engine where you would be able to evaluate materials considering things like performance and properties and also cost. So you could trade off the cost and performance and properties all together. That is a big effort. We really love the idea of doing that, but what we’re doing right now is just starting with this focus on computational chemistry results. To answer your question more directly, of connecting to synthesis information, that would require us to have sort of a synthesis part of the database like a synthesis table that says OK, it was made via some sort of hydrothermal method or it was made via wet impregnant, and what are the parameters for whatever method was used.

We really are excited to look at that, and that would dovetail very nicely with the work we’ve already done to describe chemical structures, so that if it’s—you have precursor one, and that’s—we capture that information the same way that we capture information about the surfaces. I mean maybe slightly different. So we want to go in that direction, but we don’t have that information yet. As far as the incorporation of specific experimental results for reactor performance, that also is an area we’re really interested in going into but we haven’t gotten there quite yet. I hope I answered your question, but feel free to follow up if I didn’t.

Erik:
Yeah. Feel free to put that question back in the chat. We can get to that. So I guess one about regarding where the data comes from in the database. Does all the data in the database like material property and others come from publications? So does data collection involve developed parsing of the papers or manually collecting the data values and filling them in CSV or JSON to then be uploaded? So kind of how does that process work currently?

Kurt:
Could you read the first part of the question again? I think it must have cut out.

Erik:
Yeah. So does all the data in the database come from publications?

Kurt:
Yes. So the answer to that is yes. All data that’s currently in there we have manually mined. So we’re basically doing for you the process that I described as slow and cumbersome. We have to pick the right papers, read them, and get all the data out and then get the—because the problem is there will be a table of adsorption energies but that’s not all the information you need. You’ve got to read the methods and figure out, oh OK, this was the potential used, and this was the software used, and this was some weird thing that they did that we need to put in the notes. So that’s the challenge right now. What we’re trying to do is speed up that process using some of the really exciting new tools that are coming out. Like I mentioned ASReview that might help us find the right papers so that we can be a little more focused in our reading. And there are also tools that are being developed to actually pull information out of papers.

Now, we’re not going to just accept the results of those, obviously. There’s a lot of risk of bad data coming out. But we don’t want to be closed off to them either. I think there could be a lot of opportunities to pilot them and see how they work, and then maybe do spot-checking and things like that. So we’re excited to try and move beyond that. It’s definitely hard right now. The last thing I’ll say is that better than mining existing literature would be to get the input files from researchers when they’re publishing the paper. Then it’s a lot easier. So that’s kind of the goal is to encourage people to upload as they’re publishing.

Erik:
A lot of good questions here and great answers. So maybe time for one more regarding interfacing with other BETO consortia. Are you interested in mapping schemas and building links between other online databases and BETO consortia like FCIC or Agile BioFoundry? What are your thoughts about that?

Kurt:
Absolutely. And I haven’t thought as much about FCIC, but that would fit very nicely with biomass conversion data, which of course the whole database is structured specifically to handle the complex adsorbates that are present in biomass conversion. And once we get some reactor performance data, then it would be really cool to link up to FCIC. Another very obvious point of connection is to the materials project so that OK, when we’re looking at a surface, it’s not the same as what they have in the materials project, the bulk structure. But you might still be interested in the bulk structure. So it would not be that hard for us to parse the string that says oh molycarbide and take you to the page for molycarbide, or for that specific phase. So we’re definitely interested in looking at things like that. It’s just every single thing we’ve discussed in the Q&A is like a new feature, and it is a lot, so wish us luck. We’re hoping to do as many of the things we’ve discussed as possible.

Erik:
That’s fantastic. But I think it looks like we are out of time. If we haven’t gotten to your question, obviously we’re going to collect those and we can kind of touch base with you, but I want to thank everyone who joined us today, and a special thanks to you, Kurt, for sharing these insights. That’s exciting stuff happening. As a reminder, a recording of this presentation will be posted on the webinar section of ChemCatBio as soon as it’s available. I’d also like to make another plug for ChemCatBio newsletter called The Accelerator. This is a great resource to keep tabs on any other further updates to the Catalyst Property Database or other ChemCatBio news resources or initiatives. I just posted a link to that in the chat. Feel free to check that out. And with that, I think we’ll take our leave. So have a great rest of your day and remember to stay tuned for future ChemCatBio webinars. Thanks, everyone.

Kurt:
Thanks, Erik. Thanks, everyone.

[End of Audio]