Monday, September 5, 2016: 9:00am – 4:00pm (7 hours)
Today is a holiday, but I still want to get my 20 hours in this week. I began my day by boiling water for coffee and catching up on a couple emails, Homa and Patrick (our clients/collaborators in California) reviewed our progress on Friday for scoring sentences [output / code for generating].
Following is the series of emails:
- Alexander Hayes and Kaushik Roy:
Based on the updates for finding secondary shares, we worked on a new method for scoring sentences. So far we are finding almost all of the secondary shares (including some that the previous method missed), however we have a fairly high false-positive rate alongside it. We’ll continue tweaking the algorithm a bit more when we return from the weekend, for now we are attaching a .csv file with the latest output.Thank you,Alexander L. HayesKaushik Roy
Thank you! I am curious, were you able to incorporate all combination we know so far? The ones Patrick suggested yesterday as well as what we had communicated via email and the basic rules used for Antler?
It seems to me that the patterns I specified would not have identified the ones identified as false positives. Am I wrong about that?
In other words, it seems like you have more patterns implemented than the two I stated. Is that right?
Thank you for this tremendous improvement.
- Alexander Hayes (answering the previous two questions):
Good morning,I hope you are both enjoying your labor day, since it’s a holiday I’m spending most of my day reading, but I wanted to provide a couple updates and at least answer your questions.
- Homa: “Were you able to incorporate all combinations we know so far?”The two we focused on here were:[selling] [stockholder(s)/shareholder(s)/unitholder(s)] * [is/are] [selling/offering] <number>[ADS(s)/common unit/ordinary shares] [offered by the selling] [shareholder(s)/stockholder(s)/unitholder(s)] <number>
- Patrick: “It seems to me the patterns I specified would not have identified the ones identified as false positives. Am I wrong about that?”No, you are not wrong. A hard-line approach to pattern matching would look for exactly the patterns above, but some leniency will help us pick up examples that do not match perfectly.For example: “common units the selling unitholder is selling in this offering is 12,937,500” probably would not be picked up by a pass/fail implementation of these two rules, but we can find this sentence up with our implementation.
- Patrick: “In other words, it seems like you have more patterns implemented than the two I stated. Is that right?”It’s partially correct. Our implementation of the two rules is lenient, but lenient rules can produce some strange results: “common stockholders by approximately 4.90%” was flagged as positive example. But our weights are adjustable and new rules can be added. With some tweaking these strange results can be reduced.
I returned my focus to the ADE/Drug Interactions project. During the poster presentation, Savannah and I received some feedback about our project that sent us back to the drawing board on a few accounts. Since we want to publish at ‘AI in Medicine’ there are certainly some things we need to wrap up. One bit of advice from a bioinformatics researcher was to dive into the literature on how drug interactions are already predicted (hint: SMILE strings). Another was a word of caution that hinted that DailyStrength’s terms and conditions likely forbid scraping their website.
About an hour of perusing Google Scholar and my own reverse citation trees yielded eight papers that would be a useful starting point for diving back into the literature. This time I wanted to focus more on the approaches in medicine rather than just focusing on the approaches taken in the field of machine learning.
- Drug-Drug Interactions Among Elderly Patients Hospitalized for Drug Toxicity
- SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules
- Designing better drugs: predicting cytochrome P450 metabolism
- Predicting adverse side effects of drugs
- PREDICT: a method for inferring novel drug indications with applications to personalized medicine
- ADMET in Silico Modeling: Towards Prediction Paradise?
- In silico target fishing: Predicting biological targets from chemical structure
- DrugBank: a knowledgebase for drugs, drug actions and drug targets
Reading was a bit slow, “Designing better drugs: predicting cytochrome P450 metabolism” in particular resulted in frequent breaks to look up auxiliary resources. The paper had some incredibly useful information though, combined with a few of the FDA’s sites for drug interactions, I gained better insight to what causes the interactions in the first place.
Implementation is something I’m going to need to continue to think about: SMILES was first implemented in 1988 but continues to be a useful method for storing the shape of molecules, the “Designing better drugs” paper alludes to the use of more complicated 3D models to predict 3D overlaps and docking between two molecules. Though the two papers were published eighteen years apart, the latter also seems to suggest that their methods were (at least at the time) fairly computationally expensive. If I could achieve similar results with a less expensive method (perhaps our labs’ loved cosine similarity method), it seems like it would be ideal. When I was working on pulling the PubMed data on drug combinations a few months ago, the ~5000 drug dataset from RxList resulted in over 11 million combinations that had to be queried, a similar method probably wouldn’t be appropriate for finding all the drugs that interact with another drug on-the-fly.
Furthermore, I’m suspecting that I need to focus on some more specific terms when looking for reading material: “drug-design process” and “in silico” methods both seem to come up more in the medical literature than the machine learning literature. Once I have a base established hopefully the branching citations will be helpful for finding new material.
By the time I left I’d only read the first three papers, I might have to spend some more time reading this week, though looking through my Mendeley history this pace doesn’t seem out of the ordinary.
Tuesday, September 6, 2016:
Professor was in London for the weekend, so Phillip lectured in his place. He led a fascinating discussion on reinforcement learning, I wasn’t as familiar with the field but the examples he provided got me interested.
Wednesday, September 7, 2016: 10:30am – 5:06pm (6.6 hours)
I had the intention of reading for a bit, but I got wrapped up in working on the secondary shares and never returned. There were some pretty major problems in extractFinancialLines.py that needed to be resolved.
My update included a few main changes:
- Increased the sentence limit from 16 to 30. Occasionally there were secondary shares that had a clause in the middle, preventing the full sentence from being picked up.
- punishing really short sentences (length < 5)
- punishing sentences that contained false positive words (may, up, we, us) – these words were characteristic of primary shares and wrongly-identified secondary shares.
- punishing sentences that were weirdly long (if length > 10: subtract 2 from the score for each word after 10) – oftentimes there was a sub-sentence in the sentence that was the better answer, but longer sentences were favored.
Creating a wider “window” (moving from 16 to 30) helped to pick up some sentences, but made the program as a whole quite a bit slower.
Thursday, September 8, 2016: 12:30pm – 1:24pm (0.9 hours)
We wrapped our discussion on the C-FOVE paper. I still think it was a really challenging paper that I’ll have to return to at some point, but at least I have a small understanding of lifted inference entails.
Friday, September 9, 2016: 12:00pm – 5:12pm (5.1 hours)
Since the lab missed hackathon on Monday , we made up for it today. Professor and I expanded on the contribution I would make, which was a user-friendly way to set modes in RDN-Boost. I began by fiddling with some swing libraries in NetBeans, but figured I should get a rough idea of what my end product would look like before I started.
The mode documentation is pretty scarce, I thought it was a more general term until (to my horror) Google found zero related results. Devendra pointed me toward a guide on Tushar’s website, I spent a couple hours thoroughly going through the examples and making sure I understood the theory.
I talked with Professor about my thoughts so far, and now that I had a small foundation I wanted to expound on the problem I would solve. I suspected there were three problems: setting them, translating them, and optimizing them. Professor explained that optimization is a point that is made in the documentation, but should already be handled by the existing code. We talked through a student/professor/advisedBy example, and he pointed me toward one of his lesser known papers: “Learning from Human Teachers: Issues and Challenges for ILP in Bootstrap Learning.”
It was getting fairly late in the day at this point so I could only focus on a few points from the paper, but one of the really interesting contributions was an idea of translating a teacher’s instruction into ILP, and three problems this could be applied to.
Before I headed out, Kaushik and I sent an email to Homa and Patrick. He used the posEx.txt we generated on Thursday and a new negEx.txt to run the full pipe again while I was working on a few other things. This time we were achieving ~95% confidence for finding secondary shares, a huge improvement over the previous method that produced ~75% confidence in most cases. As a further bonus, we found 147 secondary shares, about 100 more than previous.