It’s been a few weeks since the ProHealth REU site wrapped, a few notes since then:
- Professor Natarajan hired me for 20 hours a week during the fall semester.
- STARAI is moving to a house on 13th street.
- I led a reading group section on 8/11 on the TILDE paper.
- I’ve joined Kaushik Roy on the extracting financial information project.
- Each Monday the lab will hold “Hackathons” to prepare consumer-ready code.
Monday, August 22, 2016: 11:45pm – 4:20pm (4hours, 35minutes)
This week I’m easing in and developing my schedule (with six classes I expect to have a full and rewarding semester), so far I’ve reviewed the LVI code Kaushik added me to last Thursday.
The hackathon for this week involved us starting with the jar files for the for the RDN-boost, mostly so we could get used to writing rules. Professor challenged us to write rules to distinguish two situations: 1) when a number was increasing. 2) when a number was greater than 20. Our solution used a precompute in prolog.
We immediately ran into an issue with the amount of data we had (two positive examples and three negative examples were not enough to create a robust model). I wrote a couple scripts to generate training data for each.
Tuesday, August 23, 2016:
I sat in on Professor Natarajan’s Applied Machine Learning course, though I could only stay for the first hour due to how my schedule was set up. The first part of the lecture was mostly introductory material and a basic overview, both were fairly familiar to me because of my involvement in the lab. The biggest thing I want to get out of attending the lectures is better familiarity with the vocabulary: understanding the differences between the algorithms we’re using and the terms professor uses.
Wednesday, August 24, 2016: 10am-5pm (7hrs)
Much of my day was spent going through the code with Kaushik. Trudging through the files on my own was taking a huge amount of time and I wasn’t getting super far, a brief overview of some of the code and commands follow:
We started with Tool.jar, which takes a folder containing documents scraped from the web and parses positive and negative examples. Positive examples might be phrases such as: “We are selling 14,000,000 shares” while negative examples could be other patterns (such as dates, secondary shares, or underwriters).
java -jar Tool.jar docs500 primary positive > posEx.txt
java -jar Tool.jar docs500 primary negative > negEx.txt
Feed these two into makeTrainPredicates.py to get training data so we can begin training our model (note that posEx.txt and negEx.txt need to be in the same folder as makeTrainPredicates because the file names are hardcoded into the latter).
I learned one of the great things about blogging today: when you’re stuck on a problem that you’ve solved before, you can go back and check your notes if you forget. As Kaushik and I were going through the code it ground to a halt again, this happened one other time when I was running the drug interactions over the summer:
module load python/2.7 fixed this.
Now we could actually run the boosting algorithm on the training and testing sets:
java -jar RDNBoost-w2v-adv.jar -l train data/train -target sentenceContainsTarget -trees 10 -aucJarPath . > traindata.txt
java -jar RDNBoost-w2v-adv.jar -i -test data/test -model data/train/models -target sentenceContainsTarget -trees 10 -aucJarPath . > testoutput.txt 2>&1 &
This gave us some pretty good results (~88% confidence rate in most cases), a few other small tweaks gave us the output in a form that was easier to work with:
mv data/test/results_sentenceContainsTarget.db ../../nlpUnit/;cd ../../nlpUnit/
sed -i '/^$/d' results_sentenceContainsTarget.db
mv results_sentenceContainsTarget.db Results.db
There are a few things to work on now:
- Rewrite the scraper.
- Somehow words seem to be running together, which causes a loss in accuracy (i.e. ‘offered by us4,300,000shares). Typically this example can still be picked up, but is always tossed.
Thursday, August 25, 2016: 12:45pm – 2:15pm (1 hour, 30 minutes)
Reading group for this week covered Counting Belief Propagation. We got off to a bit of a late start, Kaushik led the discussion.
Friday, August 26, 2016: 8:00am – 3:00pm (7 hours)
I got a pretty early start to my day, picking up where I left off Wednesday evening. Kaushik and I spent some more time working on the pattern matching:
egrep "[0-9],[0-9]" docs800/AAAP-000157104915009106-t1502530-424b1.txt | egrep -v "^[0-9]" | egrep -v "\([0-9]" > smaller.txt
seemed like a reasonable place to start. We worked through quite a few examples, comparing our outputs to what we would expect to find. Most of the matches we’d focused on were with the ones parsed by our collaborators, now we tested through files that were usually used as negative examples or files not included in the initial results. Soon we found we could match new examples with up to ~95% confidence.
We continued going through with some slightly different tests: using a quick shell script we separated out lines that contained certain phrases that were indicative of primary shares: containing “common” and (“offer” or “sell”).
python getLines.py $1 | grep "common" | grep "offer\|sell"
At the moment we suspect that cases where this fails is due to a problem with how the pages are scraped:
We are selling
12,000,000 shares of our common stock.
are offering 2,000,000 shares of our common stock.
These lines are separated in the html, and when beautifulsoup scrapes the html it also creates the output files on separate lines. This makes it difficult for the scripts to consider them as one line, and this gets sorted out as a negative example in most cases.
We are offering1,892,308 shares of our common stock in this offering.
A similar problem is when the words get scrunched together. This example is only rated around 50% confidence: most of the words match those that are typically rated as positive examples, but “offering1,892,308” is a wildcard that doesn’t fit either a string or a number. (refer to the note at the end of Wednesday’s notes)
We worked on a solution for solving these, but it will require further evaluation (it may cause more harm than good in the long run). Kaushik and I combined the python program for creating sentences with a bash script that read articles into a variable, thus squeezing an entire document onto a single line (thus eliminating linebreaks). From here we ran the scripts on the docs800/ folder, pulled the most likely candidate from the top of each file, and output our results as a .csv like so:
for file in docs800/*; do bash Lines.sh "$file" | head -n 1 > TEMPORARY.txt && echo "$file, \"`head -n 1 TEMPORARY.txt`\"" >> OUTPUT.txt && echo "$file"; done
Kaushik emailed the results to our collaborators for testing, hopefully they’ll be able to report on the accuracy.