| Week 0 | Wednesday

Homework:

Tracing the Potential Flow of Consumer Data: A Network Analysis of Prominent Health and Fitness Apps

Summary: This study aims to analyze that data collected by the most popular health apps in the US, Canada, and Australia and classify whether sensitive information is gathered by these or not. The apps were chosen by looking at both media to see which apps are most publicized in media and looking at the app store in each of the countries previously listed with an algorithm to track which apps enter the top 100. The study groups these apps into families in which data could be shared, finding that a very significant percentage (>50) of these apps were connected such that they could share data. To find these relationships, REDCap was used, a web based data analysis tool. The paper goes on to analyze the types of data collected by these apps, concluding that the types of information requested if shared could indeed lead to breaches in privacy due to the numerous permissions requested by some apps and the large number of connections between a lot of these apps. The paper argues that more transparency should be added to the relationships between these apps and the information collected.

1) What is the problem to be solved?

This study aims to find out what types of health data are collected by health apps and how that data is shared between groups of these apps.

2) Why is the problem important?

This problem is important because it may not be clear to users of these health apps just how much information is being collected of them or the potential implications of these networks of information on just exactly how anonymous their data is.

3) How does the paper propose to address this problem?

The problem is addressed by analyzing developer-reported information such as app store descriptions to find commonalities between apps. An example of this is two apps saying that they have integration with FitBit. This would imply that they both share an API and thus these would be classified as in the same family.  REDCap was used for this analysis, a web based data analysis tool built specifically for securely collecting and storing data in a secure manner (built for medical usage). REDCap is also useful due to its software generation cycle fast enough to accommodate multiple concurrent projects without the need for custom project-specific programming. Using this functionality, researchers built a priori coding instrument to capture data, specifically from the University of Sydney. This instrument was used to look specifically at four categories.

(1) app characteristics,

(2) partnerships and affiliations,

(3) developer and funding characteristics

(4) permissions

4) What are some other solutions to the problem and what are their limitations?

Limitations included that much of the data collected had to be self-reported due to the nature of data collection on Android. This lead to skewed reporting of data and is something that would need to be addressed in future studies. (There is no related works section here, but the aim of the paper wasn’t along the lines of solving a problem but rather assisting in identifying the potential existence of a big issue)

5)How do the authors evaluation their approach and do you believe/agree with the paper’s findings?

The authors evaluate their approach by quantitative analysis based on the data they collected about these apps. Due to the self-reported nature of the data and use of permissions request as an indicator of what apps actually collected, I’m not sure that I can be 100% on board with the findings of the paper. However, there is a high possibility of under reporting which makes me lean more towards believing the validity of the approach in terms of user data collection.

Automated Analysis of Privacy Requirements for Mobile Apps

Summary: This paper takes the approach of machine learning to attempt to solve the issue of inaccurate reporting of privacy policies in mobile applications. This was only possible due to annotation of privacy policies by researchers, so that should be kept in mind when considering whether this is fully automated. After testing, the paper found that their algorithm may have over reported instances of breach of privacy policy. In any case, the finding of this paper are very solid and lay a good foundation for how a system such as this could be made through machine learning.

1) What is the problem to be solved?

This paper aims to introduce a system which is capable of analyzing the compliance of Android apps with their privacy policies. This will solve the issue of apps being noncompliant with their own privacy policies or with any other concrete standards. This has both personal privacy and legal implications.

2) Why is the problem important?

This is important because as the paper states, a very appreciable percentage of mobile apps which lack privacy policies ought to have one based on the information they collect and a similarly significant percentage have inaccurate privacy policies based on which data they collect compared to their current policies. A big example of this would be a previous issue with Snapchat’s privacy policy. It previously stated that they were not sharing user location information with third parties but was found by the FCC to be in breach of this. This has some pretty major implications, proving that one can never know if they are safe in this space of mobile data sharing.

3) How does the paper propose to address this problem?

The paper aims to use supervised ML combined with static analysis for classification to try and develop a system which can classify these apps. The machine learning library used was scikit-learn, in python. Specifically, support vector machines and logistic regression performed the best for classification. The static analysis tool was taintdroid. After training a model, analysis should be able to done with just an annotation of a new app’s privacy policy. With this approach, inconsistencies can be more easily identified and later addressed if seen appropriate. Ground-truth was found by systematic annotation of privacy policies by ten (10) law students. To classify features of data collected by apps,  information gain and tf-idf were used to identify meaningful keywords. An example of pseudocode is given in the article, Listing 1.

4) What are some other solutions to the problem and what are their limitations?

The issue of classification of privacy policies was approached here using machine learning. While an original approach in its own right, it was informed by the work of Costante et al. This approach is also based on a static analysis tool, Androguard, and uses that app to figure out which types of data apps are collecting. PScout was also used to determine if an application had sufficient permissions to make certain API calls within its code.

A big limitation of this approach is the issue of getting accurate source code from the apps. Access to server-side code for Android apps is not accessible and thus code cannot be analyzed fully and accurately. In addition, this approach is not doable within IOS since it is very difficult to decompile these applications (necessary for static analysis). Also, percentages of requests sent to Ad servers were inferred rather than gathered which might add some question to how exact the data is.

5)How do the authors evaluation their approach and do you believe/agree with the paper’s findings?

A testing dataset was kept for testing throughout development and inconsistencies were identified using these until the later mentioned deployment. The authors also evaluated their approach by implementing their approach for Cal AG. This was done by the use of various APIs to use the system within an accessible web application. The paper found that there were a high amount of false positives that showed up with the system but argued that it was still unlikely to miss any true positives. I agree with these findings however it might be inefficient to implement the system as is due to the need for evaluators to find false positives after automated analysis.

Tableau:

You can notice some similarities in the step count graph we made in class and this graph. Notably that in the middle of the week, values were lower and towards the end values were higher. The disparities in this data were more clear, though and there is an interesting trend on Thursday and Friday of much much  lower rate.

Arduino:

Cyclical Groups Using the Lilypad

Lilypad Photo

Press Release:

What we can learn from the Bi-Cycle, an exploration of cyclical groups

Beginner cryptographers often struggle with the concept of cyclical groups, contributing to a general frustration with cryptography in general and turning away those who might have succeeded in the field with better demonstrations of concepts. The lilypad employs a cyclical group using powers of two mod a four digit prime to produce random outputs between Red, Green, Blue, on its tri-color light.

The intention of this demonstration is to first show students how the lilypad works. The randomness should be relatively easy to grasp on a higher level. Then the main point is to explain how the result is achieved using powers of two! (Hence BI-cyle).

— insert video here

–insert enlightened  quote from student here

It is important to understand that anyone can do cryptography! We should take down the barriers associated with unfamiliar mathematical calculations and focus on the applications, building to the concept gradually! Do you think you could do it?