High performance and scalability are two essentials requirements for data analytics systems as the amount of data being collected, stored and processed continue to grow rapidly.
In this paper, we propose a new approach based on HadoopDB. Our main goal is to improve HadoopDB performance by adding some components. To achieve this, we incorporate a fast and space-efficient data placement structure in MapReduce-based Warehouse systems and another SQL-to-MapReduce translator. We also replace the initial Database implemented in HadoopDB with other column oriented Database.
In addition we add security mechanism to protect MapReduce processing integrity.
ADSI Research Group, National School of Applied Sciences ENSA, Ibn Tofail University, Kénitra, Marocco
Nowadays, there are many ongoing researches to construct knowledge bases from unstructured data. This process requires an ontology that includes enough properties to cover the various attributes of knowledge elements. As a huge encyclopedia, Wikipedia is a typical unstructured corpora of knowledge. DBpedia, a structured knowledge base constructed from Wikipedia, is based on DBpedia ontology which was created to represent knowledge in Wikipedia well. However, DBpedia ontology is a Wikipedia-Infobox-driven ontology.
This means that although it is suitable to represent essential knowledge of Wikipedia, it does not cover all of the knowledge in Wikipedia text. In overcoming this problem, resources representing semantics or relations of words such as WordNet and FrameNet are considered useful.
In this paper we determined whether DBpedia ontology is enough to cover a sufficient amount of natural language written knowledge in Wikipedia. We mainly focused on the Korean Wikipedia, and calculated the Korean Wikipedia coverage rate with two methods, by the DBpedia ontology and by FrameNet frames.
To do this, we extracted sentences with extractable knowledge from Wikipedia text, and also extracted natural language predicates by Part-Of-Speech tagging. We generated Korean lexicons for DBpedia ontology properties and frame indexes, and used these lexicons to measure the Korean Wikipedia coverage ratio of the DBpedia ontology and frames. By our measurements, FrameNet frames cover 73.85% of the Korean Wikipedia sentences, which is a sufficient portion of Wikipedia text.
We finally show the limitations of DBpedia and FrameNet briefly, and propose the outlook of constructing knowledge bases based on the experiment results.
KAIST, Republic of Korea
In this paper, we address the problem of finding Named Entities in very large micropost datasets. We propose methods to generate a sample of representative microposts by discovering tweets that are likely to refer to new entities.
Our approach is able to significantly speed-up the semantic analysis process by discarding retweets, tweets without pre-identifiable entities, as well similar and redundant tweets, while retaining information content. We apply the approach on a corpus of 1.4 billion microposts, using the IE services of AlchemyAPI, Calais, and Zemanta to identify more than 700,000 unique entities.
For the evaluation we compare runtime and number of entities extracted based on the full and the downscaled version of a micropost set. We are able to demonstrate that for datasets of more than 10 million tweets we can achieve a reduction in size of more than 80% while maintaining up to 60% coverage on unique entities cumulatively discovered by the three IE tools.
University of Southampton, UK