PySpark สำหรับการประมวลผลภาษาธรรมชาติบน Dataproc

ในหน้านี้
ใบอนุญาต

PySpark สําหรับการประมวลผลภาษาธรรมชาติใน Dataproc

เหลืออีก 25 นาที

PySpark สําหรับการประมวลผลภาษาธรรมชาติใน Dataproc

เกี่ยวกับ Codelab นี้

อัปเดตล่าสุดเมื่อ มิ.ย. 25, 2021

เขียนโดย bmiro

1 ภาพรวม

การประมวลผลภาษาธรรมชาติ (NLP) คือการศึกษาเกี่ยวกับการสร้างข้อมูลเชิงลึกและการวิเคราะห์ข้อมูลที่เป็นข้อความ เมื่อปริมาณการเขียนที่สร้างขึ้นบนอินเทอร์เน็ตเพิ่มขึ้นอย่างต่อเนื่อง องค์กรต่างๆ จึงพยายามใช้ประโยชน์จากข้อความเพื่อรับข้อมูลที่เกี่ยวกับธุรกิจของตนมากขึ้นกว่าที่เคย

NLP สามารถใช้กับทุกสิ่งได้ ตั้งแต่การแปลภาษาไปจนถึงการวิเคราะห์ความรู้สึก การสร้างประโยคตั้งแต่ต้น และอื่นๆ อีกมากมาย นี่เป็นสาขาการวิจัยที่กำลังพัฒนาและเปลี่ยนแปลงวิธีที่เราทำงานกับข้อความ

เราจะดูวิธีใช้ NLP กับข้อมูลข้อความจํานวนมากในวงกว้าง แน่นอนว่านี่อาจเป็นงานที่ยากลำบาก แต่โชคดีที่เราจะใช้ประโยชน์จากไลบรารีอย่าง Spark MLlib และ spark-nlp เพื่อทําให้กระบวนการนี้ง่ายขึ้น

2 Use Case ของเรา

นักวิทยาศาสตร์ข้อมูลระดับอาวุโสขององค์กร "FoodCorp" (สมมติ) ของเราสนใจที่จะดูข้อมูลเพิ่มเติมเกี่ยวกับเทรนด์ในอุตสาหกรรมอาหาร เราเข้าถึงชุดข้อมูลข้อความในรูปแบบโพสต์จาก r/food ซึ่งเป็นฟอรัมย่อยของ Reddit ที่จะใช้ในการสำรวจสิ่งที่ผู้คนพูดถึง

วิธีหนึ่งในการทำเช่นนี้คือผ่านวิธีการ NLP ที่รู้จักกันในชื่อ "การประมาณรูปแบบหัวข้อ" โมเดลหัวข้อเป็นวิธีการทางสถิติที่สามารถระบุแนวโน้มในความหมายเชิงอรรถศาสตร์ของกลุ่มเอกสาร กล่าวคือ เราสามารถสร้างโมเดลหัวข้อจากชุด "โพสต์" ของ Reddit ซึ่งจะสร้างรายการ "หัวข้อ" หรือกลุ่มคำที่อธิบายถึงเทรนด์

เราจะใช้อัลกอริทึมที่เรียกว่า Latent Dirichlet Allocation (LDA) ซึ่งมักใช้ในการจัดกลุ่มข้อความเพื่อสร้างโมเดล ดูข้อมูลเบื้องต้นที่ยอดเยี่ยมเกี่ยวกับ LDA ได้ที่นี่

3 การสร้างโปรเจ็กต์

หากยังไม่มีบัญชี Google (Gmail หรือ Google Apps) คุณต้องสร้างบัญชี ลงชื่อเข้าใช้คอนโซล Google Cloud Platform ( console.cloud.google.com) และสร้างโปรเจ็กต์ใหม่ โดยทำดังนี้

Screenshot from 2016-02-10 12:45:26.png

ถัดไป คุณจะต้องเปิดใช้การเรียกเก็บเงินใน Cloud Console เพื่อใช้ทรัพยากร Google Cloud

การใช้งาน Codelab นี้ไม่น่าจะมีค่าใช้จ่ายเกิน 200 บาท แต่อาจมากกว่านั้นหากคุณตัดสินใจใช้ทรัพยากรเพิ่มเติมหรือปล่อยไว้ให้ทำงานต่อไป โค้ดแล็บ PySpark-BigQuery และ Spark-NLP จะอธิบาย "ล้างข้อมูล" ในตอนท้าย

ผู้ใช้ใหม่ของ Google Cloud Platform มีสิทธิ์รับช่วงทดลองใช้ฟรีมูลค่า$300

4 การตั้งค่าสภาพแวดล้อม

ก่อนอื่น เราต้องเปิดใช้ Dataproc และ Compute Engine API

คลิกไอคอนเมนูที่ด้านซ้ายบนของหน้าจอ

เลือกเครื่องมือจัดการ API จากเมนูแบบเลื่อนลง

คลิกเปิดใช้ API และบริการ

ค้นหา "Compute Engine" ในช่องค้นหา คลิก "Google Compute Engine API" ในรายการผลลัพธ์ที่ปรากฏขึ้น

คลิกเปิดใช้ในหน้า Google Compute Engine

เมื่อเปิดใช้แล้ว ให้คลิกลูกศรชี้ซ้ายเพื่อย้อนกลับ

จากนั้นค้นหา "Google Dataproc API" และเปิดใช้ API ดังกล่าวด้วย

จากนั้นเปิด Cloud Shell โดยการคลิกปุ่มที่มุมขวาบนของคอนโซลระบบคลาวด์

เราจะตั้งค่าตัวแปรสภาพแวดล้อมบางอย่างที่เราอ้างอิงได้ขณะทำโค้ดแล็บ ก่อนอื่น ให้เลือกชื่อคลัสเตอร์ Dataproc ที่เรากำลังจะสร้าง เช่น "my-cluster" และตั้งค่าในสภาพแวดล้อมของคุณ คุณใช้ชื่อใดก็ได้ตามต้องการ

CLUSTER_NAME=my-cluster

ถัดไป ให้เลือกโซนจากโซนที่มีที่นี่ ตัวอย่างเช่น us-east1-b.

REGION=us-east1

สุดท้าย เราต้องตั้งค่าที่เก็บข้อมูลต้นทางที่งานจะอ่านข้อมูล เรามีข้อมูลตัวอย่างในที่เก็บข้อมูล bm_reddit แต่คุณใช้ข้อมูลที่สร้างจาก PySpark สําหรับการประมวลผลข้อมูล BigQuery ก่อนการประมวลผลข้อมูลขั้นสุดท้ายได้หากทําเสร็จแล้ว

BUCKET_NAME=bm_reddit

เมื่อกําหนดค่าตัวแปรสภาพแวดล้อมแล้ว ให้เรียกใช้คําสั่งต่อไปนี้เพื่อสร้างคลัสเตอร์ Dataproc

 gcloud beta dataproc clusters create ${CLUSTER_NAME} \
     --region ${REGION} \
     --metadata 'PIP_PACKAGES=google-cloud-storage spark-nlp==2.7.2' \
     --worker-machine-type n1-standard-8 \
     --num-workers 4 \
     --image-version 1.4-debian10 \
     --initialization-actions gs://dataproc-initialization-actions/python/pip-install.sh \
     --optional-components=JUPYTER,ANACONDA \
     --enable-component-gateway

มาดูคำสั่งแต่ละรายการกัน

gcloud beta dataproc clusters create ${CLUSTER_NAME}: จะเริ่มสร้างคลัสเตอร์ Dataproc ด้วยชื่อที่คุณระบุไว้ก่อนหน้านี้ เราใส่ beta ไว้ที่นี่เพื่อเปิดใช้ฟีเจอร์เบต้าของ Dataproc เช่น Component Gateway ซึ่งเราจะพูดถึงด้านล่าง

--zone=${ZONE}: คำสั่งนี้จะกำหนดตำแหน่งของคลัสเตอร์

--worker-machine-type n1-standard-8: นี่คือประเภทเครื่องที่จะใช้สำหรับพนักงาน

--num-workers 4: เราจะใช้เวิร์กเกอร์ 4 ตัวในคลัสเตอร์

--image-version 1.4-debian9: ระบุเวอร์ชันอิมเมจของ Dataproc ที่เราจะใช้

--initialization-actions ...: การดำเนินการเริ่มต้นคือสคริปต์ที่กำหนดเองซึ่งจะดำเนินการเมื่อสร้างคลัสเตอร์และเวิร์กเกอร์ โดยผู้ใช้อาจสร้างและจัดเก็บไว้ในที่เก็บข้อมูล GCS หรืออ้างอิงจากที่เก็บข้อมูลสาธารณะ dataproc-initialization-actions การดำเนินการเริ่มต้นที่รวมอยู่ที่นี่จะช่วยให้สามารถติดตั้งแพ็กเกจ Python โดยใช้ Pip ได้ตามที่ระบุไว้ใน Flag --metadata

--metadata 'PIP_PACKAGES=google-cloud-storage spark-nlp': รายการแพ็กเกจที่แยกด้วยเว้นวรรคเพื่อติดตั้งใน Dataproc ในกรณีนี้ เราจะติดตั้งgoogle-cloud-storageไลบรารีของไคลเอ็นต์ Python และ spark-nlp

--optional-components=ANACONDA: คอมโพเนนต์ที่ไม่บังคับคือแพ็กเกจทั่วไปที่ใช้กับ Dataproc ซึ่งจะติดตั้งโดยอัตโนมัติในคลัสเตอร์ Dataproc ในระหว่างการสร้าง ข้อดีของการใช้คอมโพเนนต์ที่ไม่บังคับแทนการดำเนินการเริ่มต้น ได้แก่ เวลาเริ่มต้นที่เร็วขึ้นและได้รับการทดสอบสำหรับ Dataproc บางเวอร์ชัน โดยรวมแล้ว ข้อมูลเหล่านี้มีความน่าเชื่อถือมากกว่า

--enable-component-gateway: Flag นี้ช่วยให้เราใช้ประโยชน์จาก Component Gateway ของ Dataproc เพื่อดู UI ทั่วไป เช่น Zeppelin, Jupyter หรือประวัติ Spark หมายเหตุ: รายการเหล่านี้บางรายการต้องใช้คอมโพเนนต์ที่ไม่บังคับที่เกี่ยวข้อง

ดูข้อมูลเบื้องต้นที่ละเอียดยิ่งขึ้นเกี่ยวกับ Dataproc ได้ที่ Codelab นี้

ถัดไป ให้เรียกใช้คําสั่งต่อไปนี้ใน Cloud Shell เพื่อโคลนที่เก็บซึ่งมีโค้ดตัวอย่างและ cd ไปยังไดเรกทอรีที่ถูกต้อง

cd
git clone https://github.com/GoogleCloudPlatform/cloud-dataproc
cd cloud-dataproc/codelabs/spark-nlp

5 Spark MLlib

Spark MLlib เป็นไลบรารีแมชชีนเลิร์นนิงที่ปรับขนาดได้ซึ่งเขียนด้วย Apache Spark MLlib ใช้ประโยชน์จากประสิทธิภาพของ Spark ด้วยชุดอัลกอริทึมแมชชีนเลิร์นนิงที่ปรับแต่งมาอย่างดี ซึ่งช่วยให้วิเคราะห์ข้อมูลจํานวนมากได้ โดยมี API ใน Java, Scala, Python และ R ในโค้ดแล็บนี้ เราจะมุ่งเน้นไปที่ Python โดยเฉพาะ

MLlib มีชุดทรานสฟอร์เมอร์และค่าประมาณขนาดใหญ่ ตัวแปลงเป็นเครื่องมือที่สามารถเปลี่ยนรูปแบบหรือแก้ไขข้อมูลได้ โดยปกติจะใช้ฟังก์ชัน transform() ส่วนเครื่องมือประมาณคืออัลกอริทึมที่สร้างไว้ล่วงหน้าซึ่งคุณใช้ฝึกข้อมูลได้ โดยปกติจะใช้ฟังก์ชัน fit()

ตัวอย่างของ Transformers ได้แก่

การจัดทําโทเค็น (การสร้างเวกเตอร์ตัวเลขจากสตริงคํา)
การเข้ารหัสแบบฮอตเวิร์ก (การสร้างเวกเตอร์ตัวเลขแบบเบาบางซึ่งแสดงถึงคำที่มีอยู่ในสตริง)
ตัวนำคำที่ไม่สื่อความหมายออก (การนำคำที่ไม่ได้เพิ่มคุณค่าเชิงความหมายออกจากสตริง)

ตัวอย่างเครื่องมือประเมิน ได้แก่

การจัดประเภท (นี่คือแอปเปิลหรือส้ม)
การถดถอย (แอปเปิลนี้ควรมีราคาเท่าไร)
การคลัสเตอร์ (แอปเปิลทั้งหมดมีความคล้ายคลึงกันมากน้อยเพียงใด)
ต้นไม้การตัดสินใจ (if color == orange, then it's an orange. ไม่เช่นนั้นจะเป็นแอปเปิล)
การลดมิติข้อมูล (เราจะนําฟีเจอร์ออกจากชุดข้อมูลได้ไหมและยังคงแยกแยะระหว่างแอปเปิลกับส้มได้อยู่ไหม)

MLlib ยังมีเครื่องมือสําหรับวิธีการทั่วไปอื่นๆ ในแมชชีนเลิร์นนิงด้วย เช่น การปรับและการเลือกไฮเปอร์พารามิเตอร์ รวมถึงการทดสอบไขว้

นอกจากนี้ MLlib ยังมี Pipelines API ซึ่งช่วยให้คุณสร้างไปป์ไลน์การเปลี่ยนรูปแบบข้อมูลได้โดยใช้ทรานสฟอร์เมอร์ต่างๆ ที่เรียกใช้ซ้ำได้

6 Spark-NLP

Spark-nlp เป็นไลบรารีที่สร้างโดย John Snow Labs เพื่อดําเนินการประมวลผลภาษาธรรมชาติอย่างมีประสิทธิภาพโดยใช้ Spark โดยจะมีเครื่องมือในตัวที่เรียกว่าเครื่องมือกำกับเนื้อหาสำหรับงานทั่วไป เช่น

การจัดทําโทเค็น (การสร้างเวกเตอร์ตัวเลขจากสตริงคํา)
การสร้างการฝังคํา (กําหนดความสัมพันธ์ระหว่างคําผ่านเวกเตอร์)
แท็กประเภทคำ (คำใดเป็นคำนาม ซึ่งคำใดเป็นคำกริยา)

แม้ว่าจะไม่ได้อยู่ในขอบเขตของ Codelab นี้ แต่ spark-nlp ยังผสานรวมกับ TensorFlow ได้อย่างดี

สิ่งที่สำคัญที่สุดคือ Spark-NLP ขยายความสามารถของ Spark MLlib ด้วยการจัดเตรียมคอมโพเนนต์ที่ใส่ลงในไปป์ไลน์ MLlib ได้อย่างง่ายดาย

7 แนวทางปฏิบัติแนะนำสําหรับการประมวลผลภาษาธรรมชาติ

เราต้องจัดการข้อมูลบางอย่างก่อนจึงจะดึงข้อมูลอันเป็นประโยชน์ออกมาได้ ขั้นตอนก่อนการประมวลผลที่เราจะดำเนินการมีดังนี้

การแปลงข้อมูลเป็นโทเค็น

สิ่งแรกที่เราต้องการทำตามแบบแผนเดิมคือ "แบ่งข้อมูลออกเป็นรายการ" ซึ่งเกี่ยวข้องกับการนำข้อมูลมาแยกตาม "โทเค็น" หรือคํา โดยทั่วไป เราจะนำเครื่องหมายวรรคตอนออกและตั้งค่าคำทั้งหมดเป็นอักษรตัวพิมพ์เล็กในขั้นตอนนี้ ตัวอย่างเช่น สมมติว่าเรามีสตริงต่อไปนี้ What time is it? หลังจากการแยกออกเป็นโทเค็น ประโยคนี้จะประกอบด้วยโทเค็น 4 รายการ ได้แก่ "what" , "time", "is", "it". เราไม่ต้องการที่โมเดลจะถือว่าคํา what เป็นคํา 2 คําที่แตกต่างกันโดยมีการขึ้นต้นคําต่างกัน นอกจากนี้ เครื่องหมายวรรคตอนมักจะไม่ช่วยให้เราเรียนรู้การอนุมานจากคำได้ดีขึ้น เราจึงนำเครื่องหมายวรรคตอนออกด้วย

การแปลงเป็นรูปแบบมาตรฐาน

เรามักจะต้องการ "ทำให้เป็นมาตรฐาน" ข้อมูล ซึ่งจะแทนที่คำที่มีความหมายคล้ายกันด้วยสิ่งเดียวกัน เช่น หากพบคําว่า "fought", "battled" และ "dueled" ในข้อความ การปรับให้เป็นมาตรฐานอาจแทนที่ "battled" และ "dueled" ด้วยคําว่า "fought"

การแยกคำ

การตัดคำศัพท์จะแทนที่คำด้วยความหมายของรากศัพท์ เช่น ระบบจะแทนที่คำว่า "รถ" "รถยนต์" และ "รถของ" ด้วยคำว่า "รถ" ทั้งหมด เนื่องจากคำเหล่านี้ล้วนมีความหมายเดียวกัน

การนำคำที่ไม่เกี่ยวข้องออก

คำหยุดคือคำต่างๆ เช่น "และ" และ "ที่" ซึ่งโดยทั่วไปไม่ได้เพิ่มคุณค่าให้กับความหมายเชิงอรรถศาสตร์ของประโยค โดยทั่วไปแล้ว เราต้องการนํารายการเหล่านี้ออกเพื่อลดสัญญาณรบกวนในชุดข้อมูลข้อความ

8 เรียกใช้งาน

มาดูงานที่เราจะเรียกใช้กัน ดูโค้ดได้ที่ cloud-dataproc/codelabs/spark-nlp/topic_model.py ใช้เวลาอย่างน้อย 2-3 นาทีในการอ่านคำขอและความคิดเห็นที่เกี่ยวข้องเพื่อให้เข้าใจสิ่งที่เกิดขึ้น นอกจากนี้ เราจะไฮไลต์บางส่วนของส่วนต่างๆ ด้านล่างด้วย

# Python imports
import sys

# spark-nlp components. Each one is incorporated into our pipeline.
from sparknlp.annotator import Lemmatizer, Stemmer, Tokenizer, Normalizer
from sparknlp.base import DocumentAssembler, Finisher

# A Spark Session is how we interact with Spark SQL to create Dataframes
from pyspark.sql import SparkSession

# These allow us to create a schema for our data
from pyspark.sql.types import StructField, StructType, StringType, LongType

# Spark Pipelines allow us to sequentially add components such as transformers
from pyspark.ml import Pipeline

# These are components we will incorporate into our pipeline.
from pyspark.ml.feature import StopWordsRemover, CountVectorizer, IDF

# LDA is our model of choice for topic modeling
from pyspark.ml.clustering import LDA

# Some transformers require the usage of other Spark ML functions. We import them here
from pyspark.sql.functions import col, lit, concat

# This will help catch some PySpark errors
from pyspark.sql.utils import AnalysisException

# Assign bucket where the data lives
try:
    bucket = sys.argv[1]
except IndexError:
    print("Please provide a bucket name")
    sys.exit(1)

# Create a SparkSession under the name "reddit". Viewable via the Spark UI
spark = SparkSession.builder.appName("reddit topic model").getOrCreate()

# Create a three column schema consisting of two strings and a long integer
fields = [StructField("title", StringType(), True),
          StructField("body", StringType(), True),
          StructField("created_at", LongType(), True)]
schema = StructType(fields)

# We'll attempt to process every year / month combination below.
years = ['2016', '2017', '2018', '2019']
months = ['01', '02', '03', '04', '05', '06',
          '07', '08', '09', '10', '11', '12']

# This is the subreddit we're working with.
subreddit = "food"

# Create a base dataframe.
reddit_data = spark.createDataFrame([], schema)

# Keep a running list of all files that will be processed
files_read = []

for year in years:
    for month in months:

        # In the form of <project-id>.<dataset>.<table>
        gs_uri = f"gs://{bucket}/reddit_posts/{year}/{month}/{subreddit}.csv.gz"

        # If the table doesn't exist we will simply continue and not
        # log it into our "tables_read" list
        try:
            reddit_data = (
                spark.read.format('csv')
                .options(codec="org.apache.hadoop.io.compress.GzipCodec")
                .load(gs_uri, schema=schema)
                .union(reddit_data)
            )

            files_read.append(gs_uri)

        except AnalysisException:
            continue

if len(files_read) == 0:
    print('No files read')
    sys.exit(1)

# Replacing null values with their respective typed-equivalent is usually
# easier to work with. In this case, we'll replace nulls with empty strings.
# Since some of our data doesn't have a body, we can combine all of the text
# for the titles and bodies so that every row has useful data.

df_train = (
    reddit_data
    # Replace null values with an empty string
    .fillna("")
    .select(
         # Combine columns
        concat(
            # First column to concatenate. col() is used to specify that we're referencing a column
            col("title"),
            # Literal character that will be between the concatenated columns.
            lit(" "),
            # Second column to concatenate.
            col("body")
        # Change the name of the new column
        ).alias("text")
    )
)

# Now, we begin assembling our pipeline. Each component here is used to some transformation to the data.
# The Document Assembler takes the raw text data and convert it into a format that can
# be tokenized. It becomes one of spark-nlp native object types, the "Document".
document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

# The Tokenizer takes data that is of the "Document" type and tokenizes it.
# While slightly more involved than this, this is effectively taking a string and splitting
# it along ths spaces, so each word is its own string. The data then becomes the
# spark-nlp native type "Token".
tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token")

# The Normalizer will group words together based on similar semantic meaning.
normalizer = Normalizer().setInputCols(["token"]).setOutputCol("normalizer")

# The Stemmer takes objects of class "Token" and converts the words into their
# root meaning. For instance, the words "cars", "cars'" and "car's" would all be replaced
# with the word "car".
stemmer = Stemmer().setInputCols(["normalizer"]).setOutputCol("stem")

# The Finisher signals to spark-nlp allows us to access the data outside of spark-nlp
# components. For instance, we can now feed the data into components from Spark MLlib.
finisher = Finisher().setInputCols(["stem"]).setOutputCols(["to_spark"]).setValueSplitSymbol(" ")

# Stopwords are common words that generally don't add much detail to the meaning
# of a body of text. In English, these are mostly "articles" such as the words "the"
# and "of".
stopword_remover = StopWordsRemover(inputCol="to_spark", outputCol="filtered")

# Here we implement TF-IDF as an input to our LDA model. CountVectorizer (TF) keeps track
# of the vocabulary that's being created so we can map our topics back to their
# corresponding words.
# TF (term frequency) creates a matrix that counts how many times each word in the
# vocabulary appears in each body of text. This then gives each word a weight based
# on its frequency.
tf = CountVectorizer(inputCol="filtered", outputCol="raw_features")

# Here we implement the IDF portion. IDF (Inverse document frequency) reduces
# the weights of commonly-appearing words.
idf = IDF(inputCol="raw_features", outputCol="features")

# LDA creates a statistical representation of how frequently words appear
# together in order to create "topics" or groups of commonly appearing words.
lda = LDA(k=10, maxIter=10)

# We add all of the transformers into a Pipeline object. Each transformer
# will execute in the ordered provided to the "stages" parameter
pipeline = Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        normalizer,
        stemmer,
        finisher,
        stopword_remover,
        tf,
        idf,
        lda
    ]
)

# We fit the data to the model.
model = pipeline.fit(df_train)

# Now that we have completed a pipeline, we want to output the topics as human-readable.
# To do this, we need to grab the vocabulary generated from our pipeline, grab the topic
# model and do the appropriate mapping.  The output from each individual component lives
# in the model object. We can access them by referring to them by their position in
# the pipeline via model.stages[<ind>]

# Let's create a reference our vocabulary.
vocab = model.stages[-3].vocabulary

# Next, let's grab the topics generated by our LDA model via describeTopics(). Using collect(),
# we load the output into a Python array.
raw_topics = model.stages[-1].describeTopics().collect()

# Lastly, let's get the indices of the vocabulary terms from our topics
topic_inds = [ind.termIndices for ind in raw_topics]

# The indices we just grab directly map to the term at position <ind> from our vocabulary.
# Using the below code, we can generate the mappings from our topic indices to our vocabulary.
topics = []
for topic in topic_inds:
    _topic = []
    for ind in topic:
        _topic.append(vocab[ind])
    topics.append(_topic)

# Let's see our topics!
for i, topic in enumerate(topics, start=1):
    print(f"topic {i}: {topic}")

เรียกใช้งาน

มาเริ่มเรียกใช้งานกัน เรียกใช้คําสั่งต่อไปนี้

gcloud dataproc jobs submit pyspark --cluster ${CLUSTER_NAME}\
    --region ${REGION}\
    --properties=spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.11:2.7.2\
    --driver-log-levels root=FATAL \
    topic_model.py \
    -- ${BUCKET_NAME}

คำสั่งนี้ช่วยให้เราใช้ประโยชน์จาก Dataproc Jobs API ได้ การใส่คำสั่ง pyspark เป็นการบ่งบอกให้คลัสเตอร์ทราบว่านี่เป็นงาน PySpark เราจะระบุชื่อคลัสเตอร์ พารามิเตอร์ที่ไม่บังคับจากรายการที่มีที่นี่ และชื่อไฟล์ที่มีงาน ในกรณีของเรา เราจะระบุพารามิเตอร์ --properties ซึ่งช่วยให้เราเปลี่ยนพร็อพเพอร์ตี้ต่างๆ สําหรับ Spark, Yarn หรือ Dataproc ได้ เราจะเปลี่ยนพร็อพเพอร์ตี้ Spark packages ซึ่งช่วยให้เราแจ้ง Spark ว่าต้องการรวม spark-nlp ไว้ในแพ็กเกจของงาน นอกจากนี้ เรายังมีพารามิเตอร์ --driver-log-levels root=FATAL ซึ่งจะระงับเอาต์พุตบันทึกส่วนใหญ่จาก PySpark ยกเว้นข้อผิดพลาด โดยทั่วไปแล้ว บันทึกของ Spark มีแนวโน้มที่จะมีความซับซ้อน

สุดท้าย -- ${BUCKET} คืออาร์กิวเมนต์บรรทัดคำสั่งสำหรับสคริปต์ Python ที่ให้ชื่อที่เก็บข้อมูล โปรดสังเกตช่องว่างระหว่าง -- กับ ${BUCKET}

หลังจากเรียกใช้งานไป 2-3 นาที เราควรจะเห็นเอาต์พุตที่มีโมเดลของเรา

เยี่ยมไปเลย คุณอนุมานแนวโน้มได้โดยดูที่เอาต์พุตจากโมเดลไหม แล้วคุณล่ะ

จากเอาต์พุตข้างต้น เราอาจอนุมานแนวโน้มจากหัวข้อ 8 ที่เกี่ยวข้องกับอาหารเช้า และของหวานจากหัวข้อ 9

9 ล้างข้อมูล

โปรดดำเนินการดังนี้เพื่อเลี่ยงไม่ให้เกิดการเรียกเก็บเงินกับบัญชี GCP โดยไม่จำเป็นหลังจากการเริ่มต้นใช้งานอย่างรวดเร็วนี้เสร็จสมบูรณ์

ลบที่เก็บข้อมูล Cloud Storage สำหรับสภาพแวดล้อมที่คุณสร้างขึ้น
ลบสภาพแวดล้อม Dataproc

หากสร้างโปรเจ็กต์สำหรับโค้ดแล็บนี้โดยเฉพาะ คุณก็ลบโปรเจ็กต์ได้เช่นกันโดยทำดังนี้

ในคอนโซล GCP ให้ไปที่หน้าโปรเจ็กต์
ในรายการโปรเจ็กต์ ให้เลือกโปรเจ็กต์ที่ต้องการลบ แล้วคลิกลบ
ในช่อง ให้พิมพ์รหัสโปรเจ็กต์ แล้วคลิกปิดเพื่อลบโปรเจ็กต์

ข้อควรระวัง: การลบโปรเจ็กต์จะมีผลดังต่อไปนี้

ระบบจะลบทุกอย่างในโปรเจ็กต์ หากคุณใช้โปรเจ็กต์ที่มีอยู่สำหรับบทแนะนำนี้ เมื่อลบโปรเจ็กต์ดังกล่าว ระบบจะลบงานอื่นๆ ที่คุณทำในโปรเจ็กต์ด้วย
รหัสโปรเจ็กต์ที่กำหนดเองจะหายไป เมื่อสร้างโปรเจ็กต์นี้ คุณอาจสร้างรหัสโปรเจ็กต์ที่กำหนดเองที่ต้องการใช้ในภายหลัง หากต้องการเก็บ URL ที่ใช้รหัสโปรเจ็กต์ เช่น URL appspot.com ให้ลบทรัพยากรที่เลือกภายในโปรเจ็กต์แทนการลบทั้งโปรเจ็กต์

ใบอนุญาต

ผลงานนี้ได้รับอนุญาตภายใต้สัญญาอนุญาตครีเอทีฟคอมมอนส์สำหรับยอมรับสิทธิของผู้สร้าง (Creative Commons Attribution License) 3.0 ทั่วไป และสัญญาอนุญาต Apache 2.0

รายงานความผิดพลาด