Text Style Transfer 텍스트 스타일 변환 데이터셋 조사

Tasks

informal → formal
toxic → neutral
democratic → republican(분석 중)
impolite → polite(분석 중)
shakespeare → modern
positive → negative

병렬 데이터셋이 아닌데 대체 뭘로 학습을 하고, 평가를 했다는 거지? 의문이 드는 데이터셋은 아직 찾는 중..

informal → formal

비공식적인 언어로 작성된 문장을 공식적인 언어로 다시 작성하는 태스크
주로 병렬 데이터셋인 GYAFC (Grammarly’s Yahoo Answers Formality Corpus) 데이터셋을 사용
- 총 110,000개의 비공식/공식 문장 쌍을 포함
- Yahoo Answers는 질문 답변 포럼으로, 많은 수의 비공식 문장을 포함
- 5단어보다 짧거나 25단어보다 긴 문장을 제거
- 비즈니스, 엔터테인먼트 및 음악, 여행, 음식 등 여러 도메인으로 구성
- 저자는 가장 비공식적인 문장이 포함된 두 가지 특정 도메인을 대상으로 작업
- 엔터테인먼트와 음악, 가족 및 관계 영역에 가장 비공식적인 문장이 포함되어 있다는 것을 발견 → 이를 이용
예제

Dear Sir or Madam, May I Introduce the GYAFC Dataset: Corpus, Benchmarks and Metrics for Formality Style Transfer (https://arxiv.org/pdf/1803.06535v2)

GYAFC 데이터셋을 사용하려면 해당 github(https://github.com/raosudha89/GYAFC-corpus)의 지침대로 먼저 야후 데이터셋을 얻기 위한 권한 취득 필요

toxic → neutral

유해한 언어를 정제하는 텍스트 해독 작업
주로 Paradetox 병렬 데이터셋을 이용
- Toloka.ai 크라우드소싱 플랫폼을 이용하여 구축
  1. Generation of Paraphrases: 사용자에게 내용을 유지하면서 주어진 문장에서 독성 제거 요청
  2. Content Preservation Check: 생성된 패러프레이징 문장과 원래 문장을 사용자에게 보여주고 의미 판단 요청
  3. Toxicity Check: 마지막으로 독성 제거하는 데 성공했는 지 확인
- 11,939개의 유해 문장에 대한 의역(문장 당 평균 1.66개의 패러프레이징)을 얻었고, 총 19,766개의 패러프레이징 수집
- https://github.com/s-nlp/paradetox

GitHub - s-nlp/paradetox: Data and info for the paper "ParaDetox: Text Detoxification with Parallel Data"

Data and info for the paper "ParaDetox: Text Detoxification with Parallel Data" - s-nlp/paradetox

github.com

예제

shakespeare → modern

현대 영어에서 셰익스피어 영어로 텍스트를 변환하는 태스크
예제

modern	shakespeare
A jumbled confession can only receive a jumbled absolution .	Riddling confession finds but riddling shrift .
I love rich Capulet's daughter .	Then plainly know my heart's dear love is set On the fair daughter of rich Capulet .
We're bound to each other in every possible way , except we need you to marry us .	As mine on hers , so hers is set on mine , And all combined , save what thou must combine By holy marriage .

아래 github 홈페이지에서 관련 논문과 데이터셋 다운로드 가능
- https://github.com/harsh19/Shakespearizing-Modern-English/tree/master

GitHub - harsh19/Shakespearizing-Modern-English: Code for "Jhamtani H.*, Gangal V.*, Hovy E. and Nyberg E. Shakespearizing Moder

Code for "Jhamtani H.*, Gangal V.*, Hovy E. and Nyberg E. Shakespearizing Modern Language Using Copy-Enriched Sequence to Sequence Models" Workshop on Stylistic Variation, EMNLP 2017 - h...

github.com

positive → negative

가장 많이 사용되는 태스크로, 부정적인 → 긍정적인 문장 또는 그 반대로 변환하는 태스크
Yelp, Amazon 데이터셋이 유명
두 데이터셋 모두 리뷰 데이터로 학습데이터는 positive(1), negative(0)와 같이 label만 있으나 테스트로는 병렬 데이터셋 제공
Yelp
- 병렬인 평가 데이터셋 1000 문장
- https://www.yelp.com/dataset

Positive	Negative
Ever since joes has changed hands it's gotten better and better.	ever since joes has changed hands it 's just gotten worse and worse

Yelp Dataset

The Yelp dataset is a subset of our businesses, reviews, and user data for use in connection with academic research. Available as JSON files, use it to teach students about databases, to learn NLP, or for sample production data while you learn how to make

www.yelp.com

Amazon
- 병렬인 평가 데이터셋 1000 문장

Positive	Negative
this is honestly the only case i've kept for so long.	his is honestly the only case i ve thrown away in the garbage .

두 평가 데이터셋은 아래 논문의 github에서 다운로드 및 확인 가능
- https://github.com/MANGA-UOFA/Prompt-Edit/tree/main/data

Prompt-Edit/data at main · MANGA-UOFA/Prompt-Edit

An official implementation for the EMNLP 2023 Findings paper "Prompt-Based Editing for Text Style Transfer" - MANGA-UOFA/Prompt-Edit

github.com

저작자표시 비영리 변경금지

'딥러닝 논문 리뷰 > Text Style Transfer' 카테고리의 다른 글

[논문리뷰] Style-Specific Neurons for Steering LLMs in Text Style Transfer (1)	2024.12.12
[논문리뷰] Delete, Retrieve, Generate: A Simple Approach to Sentiment and Style Tran (0)	2024.11.20
[논문리뷰] Politeness Transfer: A Tag and Generate Approach (0)	2024.11.19
Text Style Transfer 텍스트 스타일 변환 목표, 방법론 정리 (0)	2024.11.18

채민의 딥러닝 블로그

Text Style Transfer 텍스트 스타일 변환 데이터셋 조사

Tasks

informal → formal

toxic → neutral

shakespeare → modern

positive → negative

'딥러닝 논문 리뷰 > Text Style Transfer' 카테고리의 다른 글

티스토리툴바

Text Style Transfer 텍스트 스타일 변환 데이터셋 조사

Tasks

informal → formal

toxic → neutral

shakespeare → modern

positive → negative

'딥러닝 논문 리뷰 > Text Style Transfer' 카테고리의 다른 글

'딥러닝 논문 리뷰/Text Style Transfer' Related Articles

티스토리툴바