Fall 2016 – Week 2: Secondary Shares

Monday (8/29/16): 9:45am – 5:00pm (7hours, 15 minutes)

Before the weekly hackathon, Kaushik and I did some work on finding secondary shares in the documents.  Adapting Lines.sh into strip.sh was an extremely simple way to remove linebreaks (\n) and run code over it.  Before I arrived Kaushik made some changes to the python code for pulling primary shares and the alteration allowed it to pull secondary shares (getLinesSeco.py). We experimented for a short time with removing dates, but after some preliminary testing it appeared to cause more problems than it solved (simple string matching created quite a few false positives, but the differences were usually too inconsistent for regular expressions).

Our results looked fairly consistent with the expectations of our collaborators, but it was tricky to be certain due to secondary shares being less common (harder to pick out positive and negative examples by hand, assuming none exist still results in ~80% accuracy).  After a few more tests Kaushik realized that our collaborator’s code may be able to pick examples fairly effectively since the linebreaks were removed, so he setup a script to run the code on a subset.

The subset seemed to produce fairly accurate results, so he set up the full set after we tested a few.  Kaushik ran the java/antlr to program to detect the primary and secondary shares.  There were some issues with running out of memory, but we figured that could get messed with after our team meeting.

We had Thai food for lunch and Professor collected project updates, including what papers we were hoping to write and which conferences we were submitting to.  Devendra, Savannah, and I still need to wrap up the drug interactions project/paper (the main thing is just improving on RNN’s 71%); the financial NLP project Kaushik and I are on may end in a paper, but we’re still fairly preliminary to be absolutely certain.

Hackathon started around 1:30pm (Professor and Phillip returned to Info to get a coffee machine).  Getting started was a bit rough: most of the code was at least a year old and there are close to 500 Java files with almost no documentation.  About an hour in Phillip discerned that the MLN-boost was really out of date.

Devendra, Prateek, and I will spend hackathon time “focusing on modes, relational schema, converting to mode file (automatic construction of modes), and making the interface easier for people to work with.  Create a gui for basic users, command line utility for the advanced users.”

Devendra’s quick rundown on modes: if you want to learn a relation, it will have entities within it (‘professor’ is a predicate, ‘sriraam’ is an entity).  If you want those to be constants or variables, (+, -, #), learn them from a background file.

I perused some swing libraries but kept running into issues with either eclipse or my java installation.  I didn’t feel like I had a super-productive hackathon day, but I was more confident in feeling like I had a clearer idea of what I would be working on.

My things were packed at 5:00 to leave but I hung around a little longer, Professor brought his baby by the lab for all of us to see.

Tuesday (8/30/2016):

I attended professor’s lecture again.  Topics du jour were splitting training/test sets, evaluating machine learning algorithms, how to judge the accuracy of a classifier, and how to handle false positives and false negatives.

Wednesday (8/31/2016): 9:45am – 4:12pm (6 hours, 27 minutes)

After wrapping up my 8am class I began the research portion of the day by reading.  The paper for this week is quite dense (details in Thursdays section).

I asked Professor about how my payment was going to work, and we took a walk over the Informatics to get that sorted out before his next meeting.  On the way over we talked a bit about getting my PhD application in for next year, including a few of the steps (letter writers, statement of purpose, etc).

Kaushik suggested a new method for extracting the secondary shares by looking through the words in a file, begin parsing when we found an appropriate ‘start word’ and finishing when we reached a number.

extractFinancialLines.py (8/31/2016)

By the end of the day the algorithm was incomplete, but the idea was there to pick up next time we met.  There were multiple ‘out of bounds’ errors where we tried to read beyond the end of the document.

Thursday (9/1/2016): 12:48pm – 2:00pm (1 hour, 12 minutes)

We got through the first half of “Lifted Probabilistic Inference with Counting Formulas” (the C-FOVE paper).  I knew it was dense, and about halfway through it we concluded we would need to continue this paper next week.j

Friday (9/2/2016) 8:00am – 5:12pm (9.2 hours)

Kaushik and I received some updated definitions of secondary shares from the domain expert at LVI.  We spent the day implementing a method for  finding them in the text files, then scoring them for accuracy.

https://github.iu.edu/hayesall/STARAI-financial-nlp/blob/master/extractFinancialLines.py

Around 3pm, Professor started a pickup game of cricket (more akin to pitching and batting while everyone scrambled around in the outfield).  I had a lot of fun, got some decent hits in, and was awarded “player of the day” for being successful on my first time.

Kaushik and I returned to work, after looking through the output we believe we are hitting almost all of the true positive examples but our output is also filled with a huge number of false positives.

We finished out the day by emailing our collaborators our updates, and putting a plan together for where to pick up after the weekend.