Latest AI Tools for AI Media Processing NEWS

[ad_1]

Building a large high-quality corpus for natural language processing (NLP) is not reasonable. Textual data can be large, cumbersome, and unwieldy, and unlike pure numbers or categorical data in rows and columns, differences between documents can be difficult. In organizations where documents are shared, changed, and shared again before being archived, the problem of duplication can become overwhelming.

To find exact duplicates, matching all string pairs is the simplest approach, but it is not a very efficient or sufficient technique. Using MD5 or SHA-1 hash algorithms can get the correct result at a faster speed, but almost duplicates will still be off the radar. Text similarity is useful for finding similar files. There are different approaches to this, and each of them has its own way of defining documents that are considered duplicates. In addition, the definition of duplicate documents affects the type of processing and the results obtained. Below are some options.

Using SAS Visual Text Analytics, you can configure and perform this task during your corpus analysis journey with the Python SWAT package or PROC SQL in SAS.

Work with Python SWAT

The Python SWAT package provides a Python interface to SAS Cloud Analytic Services (CAS). In this article, we’ll call the profileText action, drop the output tables, and perform duplicate detection in Python.

Prepare the data

The corpus we are going to study is the second release of the American National Corpus (ANC2). It is also one of the reference corpus of profile text action. The corpus contains over 22,000,000 words of written and spoken texts and includes both annotated data and their plain text.

We put all 13295 plain text files under /home/anc2. After connecting to the CAS server, we create a table TESTDATA with ANC2 data.

# Import libraries
import swat
from collections import Counter
import pandas as pd
import itertools
import random
 
# Connect to CAS server
s = swat.CAS("cloud.example.com", 5570)
 
# Add the caslib mycas with the path to corpus directory
s.addCaslib(caslib='mycas',
            datasource="srcType":"path",
            session=False,
            path="/home",
            subdirectories="yes")
 
# Load txt files under anc/ to the CASTable testdata
s.loadTable(casout="name":"TESTDATA", "replace":True,
            caslib="mycas",
            importOptions="fileType":"Document",
            path="anc2")

outside:

We can easily check the table using, for example, columnInfo() or head().

# View column summary for testdata
anc2 = s.CASTable("TESTDATA", replace=True)
anc2.columninfo()

outside:

# Check on the first five rows
anc2.head()

outside:

Fix the data

We load a set of text handling actions and call the profileText action to profile the ANC2 data. The casOut parameter is required to run the action. This output table contains information complexity, information density, and vocabulary diversity statistics. For duplicate identification we need results from two output tables, documentOut and intermediateOut. A CASTable can be converted to a SASDataFrame using the CASTable.to_frame() method. This method helps to bring down all the data for further investigation.

# Load the action set textManagement
s.loadactionset('textManagement')
 
# Call the action profileText
results = s.profileText(table=dict(caslib="mycas", name="testdata"),
                        documentid="fileName",
                        text="content",
                        language="english",
                        casOut=dict(name="casOut", replace=True),
                        documentOut=dict(name="docOut", replace=True),
                        intermediateOut=dict(name="interOut", replace=True))

documentOut contains document-level information complexity statistics. For each file, we know their total number of sentences and the maximum number of characters in those sentences.

# Convert the CASTable docOut to SASDataFrame
df_docout = s.CASTable('docOut').to_frame()
df_docout.head()

outside:

The other output, intermediateOut, contains the token count of each sentence in each document.

# Convert the CASTable interOut to SASDataFrame
df_interout = s.CASTable('interOut').to_frame()
df_interout.head()

outside:

Filter the data

Our goal is to find identical documents and documents that are not identical but similar in substance. To narrow down our search results for good candidates, we make the assumption that if two files have the same number of sentences and maximum number of sentence marks, they have a higher chance of being duplicates or near-duplicates.

Under this assumption, we store documents with their value pair _NUM_SENTENCES_ and _MAX_TOKENS_SENTENCES_ occurring more than once, leaving 8972 of the 13295 files.

# Filter out docs with their column values appearing more than once
df_docout_selected = df_docout[df_docout.groupby(['_NUM_SENTENCES_','_MAX_TOKENS_SENTENCE_'])
                                                 ['_NUM_SENTENCES_'].transform('size')&gt;1]
print(f"Number of rows after selection: len(df_docout_selected)")
df_docout_selected.head()

outside:

You can narrow the results even further if you focus your search by specifying conditions such as selecting only documents with a total number of sentences greater than 200, or selecting the maximum number of characters in sentences greater than 80.

# (Optional) Reduce search results by filtering out docs by condition
df_docout_selected=df_docout_selected[df_docout_selected._NUM_SENTENCES_&gt;200]
df_docout_selected=df_docout_selected[df_docout_selected._MAX_TOKENS_SENTENCE_&gt;80]

Next, we prepare pairs of file combinations that share the _NUM_SENTENCES_ and _MAX_TOKENS_SENTENCES_ values. Note that sometimes more than 2 files share the same values. The total number of unique pairs is 14617.

# Keep only the interout data for files that are selected
search_dict = df_docout_selected.set_index('fileName').T.to_dict('list')
df_interout_selected = df_interout[df_interout['fileName'].isin(search_dict.keys())]
 
# Get all unique combinations of every two docs
check_tmp_dict = Counter([tuple(s) for s in search_dict.values()])
file_pair_lst = []
for c in check_tmp_dict:
    file_pair = [k for k,v in search_dict.items() if tuple(v)==c]
    if len(file_pair) == 2:
        file_pair_lst.append(tuple(file_pair))
    else:
        pair_lst = list(itertools.combinations(file_pair, 2))
        file_pair_lst += pair_lst
 
print(f"Number of unique pairs is: len(file_pair_lst)\n")
print(f"The first five pairs are: file_pair_lst[:5]")

outside:

Compare the data

Finding text near duplicates is more difficult than duplicates. There is no gold standard for the similarity threshold of two near-duplicates. Based on the _NUM_TOKENS_ of _SENTENCE_ID_ from the interOut table earlier, we add an assumption that two documents have a very high chance of being near-duplicates if they share the same number of tokens for the list-ordered sentences, with their indices randomly chosen. Determined ratio to the total number of proposals.

For example, fileA and fileB are 20 sentences each and the specified ratio is 0.5. We use pandas.Series.sample to randomly select 10 sentences from each of the two files. The random_state value is required to ensure that sentences from the two files are taken in parallel. If two sentences have the same number of tokens for each pair we select, fileA and fileB are considered near-duplicates.

Now we are ready for the comparison.

# Compare doc pairs
possibleDuplicate = []
for (a, b) in file_pair_lst:
    # Keep only the column _NUM_TOKENS_
    tmp_a = df_interout_selected[df_interout_selected['fileName']==a].loc[:,"_NUM_TOKENS_"]
    tmp_b = df_interout_selected[df_interout_selected['fileName']==b].loc[:,"_NUM_TOKENS_"]
    # Drop the index column to use pandas.Series.compare
    tmp_a.reset_index(drop=True, inplace=True)
    tmp_b.reset_index(drop=True, inplace=True)
    # Select sentences by pandas.Series.sample with the defined ratio
    num_sent, num_sent_tocheck = len(tmp_a), round(ratio_tocheck*len(tmp_a))
    tmp_a = tmp_a.sample(num_sent_tocheck, random_state=1)
    tmp_b = tmp_b.sample(num_sent_tocheck, random_state=1)
    # Detect duplicates by checking whether it is an empty dataframe (with a shape of (0,2))
    if tmp_a.compare(tmp_b).shape != (0,2):
        pass
    else:
        possibleDuplicate.append([a, b])

The possible duplicate list contains 188 pairs of filenames.

# View the result
view = '======\n'+'\n'.join([" ".join(p) for p in possibleDuplicate])+'\n======'
print(f"NOTE: [ len(possibleDuplicate)  ] possible duplicate pairs -&gt; \nview")

outside:

Check the results

Now it’s time to see how far we’ve come in our duplicate search. By examining the contents of each pair, it is not difficult to find 133 duplicates and 55 near duplicates. Let’s look at the two almost duplicate pairs we can find. These documents have about 50 sentences and the differences are only between 2 sentences.

Working with PROC SQL in SAS

SQL is one of the many languages built into the SAS system. Using PROC SQL, you have access to a powerful tool for manipulating and querying data.

Prepare the data

We load the folder /home/anc2 with all the plain text files in the TESTDATA table.

libname mycas cas;
 
proc cas;
  table.addcaslib /
    caslib = "mycas"
    datasource = srctype="path"
    session = False
    path = "/home"
    subdirectories = "yes";
run;
 
  table.loadTable /
    casout = name="testdata", replace=True
    caslib = "mycas"
    importOptions = fileType="DOCUMENT"
    path = "anc2";
run;
quit;

You can load them directly if you have already saved them in a .sashdata file.

proc cas;
  table.save /
    table = name="testdata"
    caslib = "mycas"
    name = "ANC2.sashdat";
run;
 
  table.loadtable /
    casout = name="testdata", replace=true
    path = "ANC2.sashdat"
    caslib = "mycas";
run;
quit;

Fix the data

We call the profileText action in the textManagement action set to profile the data.

proc cas;
  textManagement.profiletext /
    table = name="testdata"
    documentid = "fileName"
    text = "content"
    language = "english"
    casOut = name="casOut", replace=True
    documentOut = name="docOut", replace=True
    intermediateOut = name="interOut", replace=True;
  run;
 
  table.fetch /
    table = name="docOut";
  run;
 
  table.fetch /
    table = name="interOut";
  run;
quit;

Filter the data

We store documents taking into account that their value pairs occur more than once.

proc sql;
  create table search1 as
    select * from mycas.docout
    group by _NUM_SENTENCES_, _MAX_TOKENS_SENTENCE_
    having count(*) &gt; 1;
quit;

We generate all pairs of file combinations that share the same values.

proc sql;
  create table search2 as
    select a.fileName as fileA , b.fileName as fileB
    from
      (select * from search1 ) a
        cross join
      (select * from search1 ) b
    where a._NUM_SENTENCES_ = b._NUM_SENTENCES_ and
      a._MAX_TOKENS_SENTENCE_ = b._MAX_TOKENS_SENTENCE_ and
      a.fileName &lt;&gt; b.fileName;
quit;
 
proc print data=search2(obs=5);
run;

Looking at table lookup 2, we notice that it would be better to get only unique pairs to avoid comparing the same file names.

proc sql;
  create table search3 as
    select distinct fileA, fileB from search2
    where search2.fileA &lt; search2.fileB;
quit;
 
proc print data=search3(obs=5);
run;

Compare the data

Given the assumption that two documents have a very high chance of being near-duplicates if they share the same number of tokens for list-ordered sentences, their indexes are randomly selected in a defined ratio to the total number of sentences. Here we use the rand(‘uniform’) function to generate observations from a continuous uniform distribution in the interval (0,1) by default. Setting it to “between .2 and .7” helps us get 50% of the sentences by chance. The similarity threshold can be adjusted by changing the range, say “where rand(‘uniform’) between .2 and .9”, meaning that 70% of the sentences in the documents are considered.

proc sql;
  create table search4 as
    select fileA as f1, fileB as f2 from search3 
    where not exists (
      select * from (
        select tmp1A, tmp2A from (
          select tmp1._NUM_TOKENS_ as tmp1A, tmp1._SENTENCE_ID_ as tmp1B,
                 tmp2._NUM_TOKENS_ as tmp2A, tmp2._SENTENCE_ID_ as tmp2B from
            (select * from sasout1.interout interout1 where interout1.fileName = f1) tmp1,
            (select * from sasout1.interout interout2 where interout2.fileName = f2) tmp2 
	    where tmp1B = tmp2B)
          where rand('uniform') between .2 and .7)
        where tmp1A &lt;&gt; tmp2A);
quit;

Check the results

We use table test data to verify the results. Out of 172 pairs, 133 are duplicates and 39 are near duplicates.

proc sql;
  create table Duplicates as
    select f1, f2 from search4
    where not exists (
      (select content from mycas.testdata tmp where tmp.fileName = f1)
      except
      (select content from mycas.testdata tmp where tmp.fileName = f2)
    );
quit;
 
proc sql;
  create table nearDuplicates as
    select f1, f2 from search4
    where exists (
      (select content from mycas.testdata tmp where tmp.fileName = f1)
      except
      (select content from mycas.testdata tmp where tmp.fileName = f2)
    );
quit;

conclusions

Examining the statistics derived from the performance of profile text provides a practical perspective to gain insight not only through comparison with a reference corpus, but also at the token, sentence, and document levels within the corpus itself. By randomly selecting which sentences to compare, we may observe different results after performing this duplicate identification method. The smaller the ratio, the more duplicate pairs we get. And you might be surprised to know that if we set the ratio to 0.1, the result would still be about 207 pairs, just slightly more than the 172 pairs when the ratio is set to 0.5. The method doesn’t seem to work redundantly because the two files must have the same number of sentences and the same maximum number of characters before we pair them. This requirement gives us a safer place to start our search.

Identifying textual near-duplicates is easy to understand, but it is not so easy to develop standards to contain all types of duplicates. In this article, we propose one way of describing near duplicates, in which the distance is between several sentences or words in order, but does not include such cases as the sentences of two documents are not arranged in the same order, or some particles are combined in such a way that the results are affected by different indexing of sentences . They’re fun to think about, and they can turn into next-level discoveries.

How would you determine similarity near duplicates?

Learn more

[ad_2]

Source link

The RedPajama Project: An Open Source Initiative to Democratize LLMs

Mastering Data Science with Microsoft Fabric: A Tutorial for Beginners

Will AI kill your job?

Leave A Reply Cancel Reply

Find duplicates and near-duplicates in the corpus using natural language processing

Work with Python SWAT

Prepare the data

Fix the data

Filter the data

Compare the data

Check the results

Working with PROC SQL in SAS

Prepare the data

Fix the data

Filter the data

Compare the data

Check the results

conclusions

Learn more

Related Posts

The RedPajama Project: An Open Source Initiative to Democratize LLMs

Mastering Data Science with Microsoft Fabric: A Tutorial for Beginners

Will AI kill your job?

Leave A Reply Cancel Reply