Tuesday, June 27, 2017

GSoC 2017 : Week 4

    Last Sunday marked the end of the 4th week of my 3-month long Summer of code project. Another significant corollary from that is, this was the final week before the first evaluations that take place this week. I've done what I could've and now my fate lies in the hands of my mentors...



    Anyway, continuing from last week, this week too, I continued with adding new domains for the list-extractor. The main idea is trying to figure out domains that could *potentially* contain list elements. These mapper functions can later be used by other domains too. So, this week, I started working on the `PeriodicalLiterature` domain, since it contained many lists which were common to many institutions. So I started working on mapping rules and mapper functions for `PeriodicalLiterature`.

    While exploring these domains, I realized that most of the list elements had a date entry, which is an important information present in the lists. The existing year extractor only extracted years in the regex form: 

#old regex
year_regex = ur'\s(\d{4})\s'

which missed out nearly all the information, as it didn't support months, or a period of time. A major effort this week was spent on re-writing the `year_mapper()` to add months (if present) with the dates. Also, if present, the new mapper tries to extract the period of years of the particular element (start date - end date).

#regex to figure out the presence of months in elements
month_list = { r'(january\s?)\d{4}':'1^', r'\W(jan\s?)\d{4}':'1^', r'(february\s?)\d{4}':'2^', r'\W(feb\s?)\d{4}':'2^',
                    r'(march\s?)\d{4}':'3^', r'\W(mar\s?)\d{4}':'3^',r'(april\s?)\d{4}':'4^',r'\W(apr\s?)\d{4}':'4^', 
                    r'(may\s?)\d{4}':'5^', r'\W(may\s?)\d{4}':'5^',r'(june\s?)\d{4}':'6^',r'\W(jun\s?)\d{4}':'6^',
                    r'(july\s?)\d{4}':'7^',r'\W(jul\s?)\d{4}':'7^', r'(august\s?)\d{4}':'8^', r'\W(aug\s?)\d{4}':'8^', 
                    r'(september\s?)\d{4}':'9^', r'\W(sep\s?)\d{4}':'9^',r'\W(sept\s?)\d{4}':'9^', r'(october\s?)\d{4}':'10^',
                    r'\W(oct\s?)\d{4}':'10^',r'(november\s?)\d{4}':'11^', r'\W(nov\s?)\d{4}':'11^' ,
                    r'(december\s?)\d{4}':'12^', r'\W(dec\s?)\d{4}':'12^'}
    
#flags to check presence of months/period
month_present = False
period_dates = False

for mon in month_list:
    if re.search(mon, list_elem, re.IGNORECASE):
        rep = re.search(mon, list_elem, re.IGNORECASE).group(1)
        list_elem = re.sub(rep, month_list[mon], list_elem, flags=re.I)
        month_present = True

#new year regex (complex, isn't it :P)
year_regex = ur'(?:\(?\d{1,2}\^)?\s?\d{4}\s?(?:–|-)\s?(?:\d{1,2}\^)?\s?\d{4}(?:\))?'  #regex for checking if its a single year or period

if re.search(period_regex, list_elem, flags=re.IGNORECASE):
    period_dates = True

    After re-writing year_mapper, I finished mappers and rules for `PeriodicalLiterature` and tested it on some resources belonging to `Magazine`, `Newspaper` and `AcademicJournal` sub-domains. I also updated the awards/honors mapper, which can now differentiate honorary degrees and awards. I finished this week's work by optimizing the code a bit, removing redundant code and replaced the existing `year_mapper` with the the new mapper in each module; adding the newly written `quote_mapper` resource extractor in the URI extracting process. After that, I merged all progress into master, as this is the final stable running version before the evaluations begin.

    Should I pass the first evaluation (I really feel I would :P), my next task would be, as discussed with Luca, working on a module that'll create a new settings file and allow the user to select the mapping functions for the domain for the extraction process. This will increase support for unmatched domains.

    Let's hope for the best!! :)

    You can follow my project on github here.

Saturday, June 24, 2017

GSoC 2017 : Week 3

    Time is passing by ever so quickly and things are starting to get *real intense*. Although it has only been three weeks, it feels like I'm a veteran developer now (professional developers everywhere cringed :P). Anyways, here's the progress report from my third week.

    These next few weeks, my focus would majorly be on expanding the scope of extractor, adding few common domains and working on making it more scalable to handle previously unseen lists with existing rules. This week, I've started working on adding new domains. This time around, I took my mentor's suggestion and tried to implement a single mapper that can map multiple list items instead of having a mapping function for every single type of element. Previously, all the properties were present in the mapping functions itself, like in the example below:


# mapping bibliography for Writer, snippet from mapper.py
g.add((rdflib.URIRef(uri), dbo.author, res))
isbn = isbn_mapper(elem)
if isbn:
    g.add((rdflib.URIRef(uri), dbo.isbn, rdflib.Literal(isbn, datatype=rdflib.XSD.string)))
if year:
    add_years_to_graph(g, uri, year)
if lit_genre:
    g.add((rdflib.URIRef(uri), dbo.literaryGenre, dbo + rdflib.URIRef(lit_genre)))

    This led to me changing the way Federica and I have been using the mapping rules. The ontology classes/properties now stored in the mapping_rules.py instead of the mapping functions.

# (new)mapping contribution type for Person, snippet from mapper.py
contrib_type == None:
feature = bracket_feature_mapper(elem)
for t in CONTRIBUTION_TYPE[lang]:
    if re.search(t, feature, re.IGNORECASE):
        contrib_type = CONTRIBUTION_TYPE[lang][t]

if contrib_type:
    g.add((rdflib.URIRef(uri), dbo[contrib_type], res))  #notice the property!


    In new domains, I also analysed the `EducationalInstitution` domain, and completed writing the rules/mappers for that. The list extractor can now extract triples from EducationalInstitution, as well as its subdomains like `College`, `School` and `University`. After that I looked at different Domains within `Person` in order to generalize extractor to work on this superclass. Domains like `Painter`, `Architect` `Architect`, `Astronaut`, `Ambassador` `Athelete`, `BusinessPerson`, `Chef`, `Celebrity`, `Coach` etc. will now also work with extractor, increasing its coverage, but I still have to work on the quality of extraction as Person is one of the biggest Domain on Wiki and has extreme variability. For that, I changed various functions in order to support generalized domains (eg. year_mapper, role etc.). Now extracts all the years in which the person has won same award/honors. 

    In the end, I had a meeting with Luca to discuss ways to merge the mapping_rules for both list & table extractor projects, another meeting is scheduled next week after discussing the idea with mentors. For the next week, I'll keep on adding domains to the extractor, while adding the new rules/functions in a generalized way. I also hope to come to a resolution about the final structure of my extractor after a discussion with Luca.


    You can follow my project on github here.


Sunday, June 18, 2017

Cyberoam Auto-Login

    Most of the universities and institutions nowadays use cyberoam to control and monitor the way their students/employees use their Internet connection. While using VPNs can help in overcoming the whole login process altogether, but many times it doesn't work. Now, being an CS student, I can't really live without Internet for long and it's pretty annoying when cyberoam logs out in between a streaming football match, or while downloading a really large file. Since it logs out after a fixed amount of time, how about a script that can automate the login process even before the connection times out. No more worrying about stopping Internet.

    To achieve this, I used mechanize instead of the standard urllib2 module as the script is essentially a bot which logs in for us every few hours and it is a more easier to use and more powerful module as compared to the latter as it can easily simulate the browser properties.

    The next step is setting the payload with your login credentials and then posting the credentials with proper access tokens, which should do the trick. Then a few lines of regex would be enough to scrape out required information.

def post_request(username,password,value,mode):
	
	br = browser()
	#POST request values
	values ={
		"mode":mode,
		"username":username,
		"password":password,
		"btnSubmit":value
	}

	data = urllib.urlencode(values)
	page = br.open(url,data)
	response = page.read()
	br._factory.is_html = True
	soup = BeautifulSoup(response,"lxml")

	regex = re.compile(r"<message><!\[CDATA\[(.*)\]\]><\/message>")

	x = re.search(regex,response)
	print username + ": " + x.group(1)
	return br

    The login and logout functions are now simply periodic POST requests to the server. The login process could be run again after a pre-determined timing by using an infinite loop with a sleep function. As long as the program runs, user is logged in. Then, include a simple signal handler, that can post the logout request whenever the SIGINT signal is received and hence terminating the script automatically logs the user out.

def login():
	br = post_request(username,password,"Login","191")
	return br

def logout():
	br = post_request(username,password,"Logout","193")
	return br

   A small problem that we'll face is the SSL Verification, which will fail due to absence of a proper verification certificate. So, we'll have to bypass this validation for now. For more information, you can see PEP 476 -Enabling certificate verification by default for stdlib http clients. The following snippet bypasses the SSL check:


def bypass_ssl():
	### Bypassing SSL certification ###
	try:
	    _create_unverified_https_context = ssl._create_unverified_context
	except AttributeError:
	    # Legacy Python that doesn't verify HTTPS certificates by default
	    pass
	else:
	    # Handle target environment that doesn't support HTTPS verification
	    ssl._create_default_https_context = _create_unverified_https_context
	return


   And that's pretty much it. Set your credentials in the 2 fields in login.py, set your cyberoam portal's URL and port, update the re-login time and you're done! 

Forget about ever logging in again, but do remember to log out, or else, you're FOREVER stuck :P

Link to the code: Github

Friday, June 16, 2017

GSoC 2017 : Week 2


    The second week has flown by, and now it's time for the second week's progress report. 

    The primary tasks for the second week was to add support for Spanish and German for the existing Writer and Actor domain, along with integrating the MusicalArtist domain that was part of my warmup task. 

    So, I started off with adding the MusicalArtist domain to my current codebase. It was fairly straightforward (for the most part) and it worked like a charm.


ref = reference_mapper(elem)  # look for resource references
if ref:  # current element contains a reference
 uri = wikidataAPI_call(ref, lang)  #try to reconcile resource with Wikidata API
    if uri:
  dbpedia_uri = find_DBpedia_uri(uri, lang)  # try to find equivalent DBpedia resource
        if dbpedia_uri:  # if you can find a DBpedia res, use it as the statement subject
   uri = dbpedia_uri
        else:  # Take the reference name anyway if you can't reconcile it
             ref = list_elem_clean(ref)
             elem = elem.replace(ref, "")  #subtract reference part from list element, to facilitate further parsing
             uri_name = ref.replace(' ', '_')
             uri_name = urllib2.quote(uri_name)  ###
             uri = dbr + uri_name.decode('utf-8', errors='ignore')
        g.add((rdflib.URIRef(uri), rdf.type, dbo.Album))
        g.add((rdflib.URIRef(uri), dbo.musicalArtist, res))


    However, diving deeper into many musical artists, I noticed that the extractor wasn't working very efficiently and constantly missed many elements. It was then when I realised, an actor could well have recorded a few songs or a musician might've acted in a movie, and the extractor was specifically looking for a particular section for a resource. Its funny and astounding at the same time that one can completely miss such an intuitive thing. Anyway, a big overhaul was needed.
 
    So, after analyzing many articles from different domains, I realised that there are several domains that have intersecting sections, and I had to change my approach. So from now, I'll be focusing on writing the mapping functions that can extract list elements from a given section. Later, domains can be added in the mapping_rules.py inluding the various sections that might exist in the domain articles.

    For this, I had to completely restructure my current mapping_rules file. The rules now contain 2 multi-level dictionaries, first of which maps the domain of the resource to the sections it could be related to, and the second one mapping the sections to the appropriate mappings.


MAPPING = { 
            'Person': ['FILMOGRAPHY', 'DISCOGRAPHY', 'BIBLIOGRAPHY', 'HONORS'],
            'Writer': ['BIBLIOGRAPHY', 'HONORS'], 
            'MusicalArtist': ['DISCOGRAPHY','FILMOGRAPHY', 'CONCERT_TOURS', 'HONORS'],
            'Band':['DISCOGRAPHY', 'CONCERT_TOURS', 'BAND_MEMBERS', 'HONORS'],
}

BIBLIOGRAPHY = {
    'en': ['bibliography', 'works', 'novels', 'books', 'publications'],
    'it': ['opere', 'romanzi', 'saggi', 'pubblicazioni', 'edizioni'],
    'de': ['bibliographie', 'werke','arbeiten', 'bücher', 'publikationen'],
    'es': ['Obras', 'Bibliografía']
}

I also had to change the select mapping function to handle multiple sections.


domains = MAPPING[res_class]  # e.g. ['BIBLIOGRAPHY', 'FILMOGRAPHY']
domain_keys = []
resource_class = res_class

for domain in domains:
    if domain in mapped_domains:
        continue
    if lang in eval(domain):
 domain_keys = eval(domain)[lang]  # e.g. ['bibliography', 'works', ..]
    else:
 print("The language provided is not available yet for this mapping")

    mapped_domains.append(domain)  #this domain won't be used again for mapping
    
    for res_key in resDict.keys():  # iterate on resource dictionary keys
 mapped = False

 for dk in domain_keys:  # search for resource keys related to the selected domain
     # if the section hasn't been mapped yet and the title match, apply domain related mapping
     dk = dk.decode('utf-8') #make sure utf-8 mismatches don't skip sections 
     if not mapped and re.search(dk, res_key, re.IGNORECASE):
     mapper = "map_" + domain.lower() + "(resDict[res_key], res_key, db_res, lang, g, 0)"
     res_elems += eval(mapper)  # calls the proper mapping for that domain and counts extracted elements
     mapped = True  # prevents the same section to be mapped again

    So, this major change in the selection of mapper functions greatly improved the working of the extractor. It is now possible to add multiple mappers to a domain, effectively increasing the number of extracted elements, hence increasing accuracy.

    Then, I continued with adding support for German and Spanish language in all the 3 initial domains (Actor, Writer, MusicalArtist). And that concluded the work for my second week.

    This coming week, I'll be looking forward to adding new domains to the extractor. Another task next week would be discussing an approach with Luca, my friend who is also working for DBpedia on another similar project, for potentially coming up with a possible template /mapping rules, to make a more effective and scalable extractor.

    You can follow my project on github here.


Monday, June 5, 2017

GSoC 2017 : Week 1

    With the first week now past us, it's time for the first week's progress report. 

    First week was mainly about checking the existing code for potential improvements. So, this week, I went over the existing code and made slight tweaks to them. I added the __init__ module and docstring. I also worked on improving the method that was used to create the resource dictionary, and which extracted triples and stored them, to get rid of the junk values that were observed during extraction. One of the added methods was remove_symbols().

def remove_symbols(listDict_key):
    ''' removes other sybols are garbage characters that pollute the values to be inserted 

    :param listDict_key: dictionary entries(values) obtained from parsing
    :return: a dictionary without empty values
    '''
    for i in range(len(listDict_key)):
        value = listDict_key[i]
        if type(value)==list:
            value=remove_symbols(value)
        else:
            listDict_key[i] = value.replace('&nbsp;','')

    return listDict_key

    Another addition was a method that stores the statistical results of all the extractions that would take place in a csv file. This method would be used in future for evaluation of the performance of the extractor and logging the statistics of the extractions that would be performed in meantime.

def evaluate(lang, source, tot_extracted_elems, tot_elems):
''' Evaluates the extaction process and stores it in a csv file. :param source: resource type(dbpedia ontology type) :param tot_extracted_elems: number of list elements extracted in the resources. :param tot_elems: total number of list elements present in the resources. ''' print "\nEvaluation:\n===========\n" print "Resource Type:", lang + ":" + source print "Total list elements found:", tot_elems print "Total elements extracted:", tot_extracted_elems accuracy = (1.0*tot_extracted_elems)/tot_elems print "Accuracy:", accuracy with open('evaluation.csv', 'a') as csvfile: filewriter = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL) filewriter.writerow([lang, source, tot_extracted_elems, tot_elems, accuracy])

    Lastly, I merged the MusicalArtist domain to the existing code, which was already part of my GSoC warmup task. This however, requires finer extraction functions, which would be added later on. As discussed with mentors, I'm currently looking at ways to make the list-extractor more scalable. I'll also look for potential problems in the existing code and improve it wherever required.

    This week, I'll be looking forward to adding more languages to the existing domains, and then, as discussed with my mentors, I would look into the scalability potential in the list-extractor.

You can follow my project on github here.