Friday, September 1, 2017

Summary of GSoC'17 Project

    This blog post is going to be an "informal" summary of my work over the past 3-4 months, with excerpts from the official report. For the official, more "formal" and "technically articulated" report, refer the following links:



  • The detailed final progress report of my work and contributions during GSoC'17 can be found here.
  • GSoC'17 Final results and challenges available here.
  • The List-Extractor can be found here.



So, let's begin!

    Finally, after 3 long and intense months (see:eternity) of coding, brainstorming and caffeine ridden hysteria, my GSoC project finally reached it's inevitable conclusion. It was a 12 week program, but it did seem forever, considering the effort and time that went into it. No wonder developers are paid so much :P



    My project's main goal was to extract relevant data form Wikipedia lists (obviously, duh! :P), and then form appropriate RDF triples with the same, which would lead to the formation (extending it, to be precise) of a knowledge graph, which could then be merged with DBpedia's datasets, which in turn can be used for various purposes, like a QA bot.

    So, my journey with this project started back in January. The idea was pretty fascinating, as Wikipedia, being the world’s largest encyclopedia, has humongous amount of information present in form of text. There’s also a lot of data present in form of lists which are quite syntactically unstructured and hence its difficult to form into a semantic relationship. With more than 15 million articles in different languages, the lists could prove to be a goldmine of structured data. So, I looked into the idea and started working on the idea. As a part of my warm up task, I had to implement another domain to the existing list-extractor. I added `MusicalArtist` domain and had a discussion on the direction of the project. After consulting with my mentors, I wrote a proposal for the project and I was selected!

There were 3 main goals as proposed in my GSoC proposal:
  1. Creation of new datasets. 
  2. Making the extractor more scalable, so that users can easily add their own rules and extract triples from different domains. 
  3. Removing the JSONpedia Live Service bottleneck by integrating the existing JSONpedia library with the list-extractor. 



    New datasets were created for some domains like MusicalArtist, Actor, Band, University, Magazine, Newspaper etc. All the sample datasets combined, that were created with the list-extractor, were created after processing about 1.3 million list elements, generating about 2.8 million triples. More triples can be created by running the extractor over different domains.

    The biggest challenge of the project was to make the list-extractor more scalable. The previous extractor had hand written functions for each property and for each domain.  Even though the properties were being managed in another file, the whole process was still cumbersome, as every new domain required the user to implement a new function that could use those properties. So, the main idea behind this year's project related to this goal was to automate this process, or, to state in simple terms, write a function that writes a function!

"The extractor was also made more scalable, by adding several more common mapper functions that can be used, while also making the selection of the mapper functions more flexible for every domain, by shifting the MAPPING dict to settings.json and allowing multiple mapper functions for a single domain. But, a bigger impact on the scalability came from the creation of rulesGenerator, which would now allow the users to create their own mapping rules and mapper functions from a interactive console program, without having to write code for the same! A sample domain MusicGenre was tested for the working of rulesGenerator, and the results/datasets are also present. Although the domain did not have much information that could be extracted, this still showed the ability of the rulesGenerator, a tool that can be used by people who are not programmers or don't have much knowledge about the inner working of the extractor, to generate triples and produce decent results."

    The third goal of this year's project was to remove the JSONpedia Live web-service's dependency. JSONpedia is a framework designed to simplify access at MediaWiki contents transforming everything into JSON. It is being used in the project to fetch the Wiki resources in a more desirable JSON form, rather than manually scraping and restructuring the data. The service was hosted on a small server, which could go down if it incurred too many requests, and with the new extractor requesting tens of thousands of pages within minutes, the service would defintiely go down at some point. So, instead of using the web-service, I had to use the available library. Only a little problem..... the library is written in Java, so I couldn't directly use it in my extractor and had to figure out a way integrate the library. Turns out the System Software course turned out quite useful xD

"The dependency on JSONpedia Live Service was removed and JSONpedia Library is now being used for obtaining the JSON representation of the resource. This was achieved by writing a wrapper function (jsonpedia-wrapper.jar) on the actual JSONpedia library, so that it could be manipulated easily by the list-extractor. The JSONpedia wrapper is a command-line program that'll take some commandline parameters and output the retrieved JSON. The wrapper can be individually run using the following command:
java -jar jsonpedia_wrapper.jar -l [language] -r [resource_name] -p [processors] -f [filters] 
So, the list-extractor simply forks another process that runs the JSONpedia wrapper with the parameters provided by the list-extractor, and the output is piped back to the list-extractor's stdin, which is then converted to JSON using the json.loads() method, hence completely emulating the previous behavior and eliminating the bottleneck."

    And with that, all the proposed goals were achieved. The results were pretty encouraging too. The extractor worked fine for the previous and new domains, and generated more triples. The efficiency also increased, with the extractor generating more triples from the same list elements.

"For a comparative analysis, we take a look at the Actors dataset, results for which are available from previous year. The accuracy of the extractor has improved (accuracy is defined as the ratio of list elements that succesfully contributed to a triple generation to the total number of list-elements present). We also see, despite there being less resources than the previous year, the list extractor was able to generate about 22k more triples from the same domain.This can be due to many factors. On of them could be people adding new list entries in the wikipedia resources, causing the number to increase. This, of course, cannot be influenced by us and hence could have lead to an increase in that number. From a programmer's perspective, the major addition in this year's project was the new year_mapper(), which helped in extracting time periods from the list elements, as well as changing the select_mapping() method, which previously allowed only one mapper function per domain. The newer version of select_mapper() allows selecting several mapping functions to be used with a single domain, allowing more sections to be considered for extraction and consequently, creating more triples from the existing list elements."

    And with that, my project was complete but it wasn't without it's fair share of challenges. The main challenge remained the same from the last year, which was the extreme variability of lists. Unfortunately, there is no real standard, structure or consistency in the resource's article and there are multiple formats used along with different meanings depending on the user who edited the page. Also, the strong dependability on the topic as well as the use of unrestricted natural language makes it impossible to find a precise general rule to extract semantic information without knowing in advance the kind of list and the resource type. Hence the knowledge of the domain is also extremely important to write a good set of mapping rules and mapper functions, which would require the user to go through hundreds of Wikipedia pages of the same domain to find out the finer structure and relationships in the domain, which is very time consuming and exhausting. Apart from the heterogeneity, unfortunately there are several Wikipedia pages with bad/wrong formatting, which is obviously reflected in the impurity of extracted data. These were the main challenges that were present in the tool in general. 

Ironically, the feature that makes Wikipedia great is also the root cause of our biggest challenges (i.e. openly accessed and modifiable by any/everyone), which reminds me of a popular phrase from the Holy Bible : "Lord giveth, and the lord taketh away."

    The past 3-4 months were enlightening and it was an incredible experience. Exposure to massive code-bases and keeping up with the whole development cycles and commits have really given me a realistic glimpse to the software industry. The work was completely different from what we generally learn in universities, where more emphasis is given on theoretical knowledge, whereas GSoC provides a platform for a more practical experience. Being a part of such a large community also helped me as a developer as I got to interact with many qualified, experienced people and fellow developers from all over the world and sharing our ideas, knowledge and experiences. GSoC is hands down the best work I've ever done so far in my life & has definitely helped me grow as a software developer. Now, I wait for my results. Hopefully I'll pass. After all, every story deserves a happy ending :P


Keep Calm and Keep Coding!!!



No comments:

Post a Comment