Arabic Named Entity Translation Lexicon Corpus


This dataset contains Named Entities extracted from newswire and wikipedia articles parallel corpus
It is described in the paper 

Dudley North visits North London: Learning When to Transliterate to Arabic. Mahmoud Azab, Houda Bouamor, 
Behrang Mohit and Kemal Oflazer. In proceedings of The 2013 Conference of the North American Chapter of the 
Association for Computational Linguistics: Human Language Technologies (NAACL/HLT 2013), Atlanta, USA, June 2013.

and can be downloaded at 

 http://www.qatar.cmu.edu/~behrang/NETLexicon/

This dataset is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License (see LICENSE).


The dataset is divided into 2 categories each category has both training and testing files:

	-Training.wiki.txt: it contains the NEs extracted from wikipedia articles that we used for training the classifier.
	-Testing.wiki.txt: it contains the NEs extracted from wikipedia articles that we used for testing the classifier.
	-Training.news.txt: it contains the NEs extracted from newswire parallel corpus that we used for training the classifier.
	-Testing.news.txt: it contains the NEs extracted from newswire parallel corpus that we used for testing the classifier.


Each data file consists of 1 line per NE; each line has the following format


[ID]	[English NE]	[Arabic NE]	[NE label]	[Tokens decision] [Freeman score]

	- [ID]: It is the ID of the NE we collected from the corpus
	- [English NE]: The English NE tokens
	- [Arabic NE]: The corresponding Arabic NE tokens (UTF-8 encoding) 
	- [NE label]: The NE tag generated by the NE tagger (PERSON, ORGANIZATION, LOCATION or MISC)
	- [Tokens decision]: The decision of every single token in the NE wether it's 0 (translate) or 1 (transliterate)
	- [Freeman score]: The freeman score of every single token in the NE with its corresponding arabic one


For example:
	13296	  Nuker Team	فريق نوكر	ORGANIZATION	1 0	0.6 0.166666666666667	
	
	- [ID] --> 13296
	- [English NE] --> Nuker Team
	- [Arabic NE]-->  فريق نوكر
	- [NE label] --> ORGANIZATION
	- [Tokens decision] --> 1 0   (1 = transliterate 'Nuker', 0 = translate 'Team')
	- [Freeman score]: 0.6 0.166666666666667 ( 0.6 = Freeman score between 'Nuker/نوكر' , 
						0.166666666666667 = Freeman score between 'Team/فريق' )


For more info about the scores, please check the following publication: 

A.T. Freeman, S.L. Condon, and C.M. Ackerman. 2006. Cross linguistic name matching in English and Arabic: 
a one to many mapping extension of the Levenshtein edit distance algorithm. In Proceedings of NAACL/HLT.