Arabic Named Entity Translation Lexicon Corpus This dataset contains Named Entities extracted from newswire and wikipedia articles parallel corpus It is described in the paper Dudley North visits North London: Learning When to Transliterate to Arabic. Mahmoud Azab, Houda Bouamor, Behrang Mohit and Kemal Oflazer. In proceedings of The 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL/HLT 2013), Atlanta, USA, June 2013. and can be downloaded at http://www.qatar.cmu.edu/~behrang/NETLexicon/ This dataset is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License (see LICENSE). The dataset is divided into 2 categories each category has both training and testing files: -Training.wiki.txt: it contains the NEs extracted from wikipedia articles that we used for training the classifier. -Testing.wiki.txt: it contains the NEs extracted from wikipedia articles that we used for testing the classifier. -Training.news.txt: it contains the NEs extracted from newswire parallel corpus that we used for training the classifier. -Testing.news.txt: it contains the NEs extracted from newswire parallel corpus that we used for testing the classifier. Each data file consists of 1 line per NE; each line has the following format [ID] [English NE] [Arabic NE] [NE label] [Tokens decision] [Freeman score] - [ID]: It is the ID of the NE we collected from the corpus - [English NE]: The English NE tokens - [Arabic NE]: The corresponding Arabic NE tokens (UTF-8 encoding) - [NE label]: The NE tag generated by the NE tagger (PERSON, ORGANIZATION, LOCATION or MISC) - [Tokens decision]: The decision of every single token in the NE wether it's 0 (translate) or 1 (transliterate) - [Freeman score]: The freeman score of every single token in the NE with its corresponding arabic one For example: 13296 Nuker Team فريق نوكر ORGANIZATION 1 0 0.6 0.166666666666667 - [ID] --> 13296 - [English NE] --> Nuker Team - [Arabic NE]--> فريق نوكر - [NE label] --> ORGANIZATION - [Tokens decision] --> 1 0 (1 = transliterate 'Nuker', 0 = translate 'Team') - [Freeman score]: 0.6 0.166666666666667 ( 0.6 = Freeman score between 'Nuker/نوكر' , 0.166666666666667 = Freeman score between 'Team/فريق' ) For more info about the scores, please check the following publication: A.T. Freeman, S.L. Condon, and C.M. Ackerman. 2006. Cross linguistic name matching in English and Arabic: a one to many mapping extension of the Levenshtein edit distance algorithm. In Proceedings of NAACL/HLT.