View on GitHub

PerSenT

A challenge dataset for Person SenTiment analysis in news domain.

What is PerSenT?

Person SenTiment, a challenge dataset for author’s sentiment prediction in news domain.

You can find our paper Author’s sentiment prediction

Mohaddeseh Bastan, Mahnaz Koupaee, Youngseo Son, Richard Sicoli, Niranjan Balasubramanian. COLING2020

We introduce PerSenT, a crowd-sourced dataset that captures the sentiment of an author towards the main entity in a news article. This dataset contains annotation for 5.3k documents and 38k paragraphs covering 3.2k unique entities.

Example

In the following example we see a 4-paragraph document about an entity (Donald Trump). Each paragraph is labeled separately and finally the author’s sentiment towards the whole document is mentioned in the last row.

Image of PerSenT stats

Dataset Statistics

To split the dataset, we separated the entities into 4 mutually exclusive sets. Due to the nature of news collections, some entities tend to dominate the collection. In our collection,there were four entities which were the main entity in nearly 800 articles. To avoid these entities from dominating the train or test splits, we moved them to a separate test collection. We split the remaining into a training, dev, and test sets at random. Thus our collection includes one standard test set consisting of articles drawn at random (Test Standard), while the other is a test set which contains multiple articles about a small number of popular entities (Test Frequent).
Image of PerSenT stats

Download the data

You can download the data set URLs from here

The processed version of the dataset which contains used paragraphs, document-level, and paragraph-level labels can be download separately as train, dev, random test, and fixed test.

To recreat the results from the paper you can follow the instructions in the readme file from the source code.

Liked us? Cite us!

Please use the following bibtex entry:

@inproceedings{bastan2020authors,
      title={Author's Sentiment Prediction}, 
      author={Mohaddeseh Bastan and Mahnaz Koupaee and Youngseo Son and Richard Sicoli and Niranjan Balasubramanian},
      year={2020},
      eprint={2011.06128},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}