Skip to content

Using the Filter Field

The Crawler knows where to get API data via CSS selectors. But sometimes, acquired raw data needs further processing. One of the ways you can transform acquired data is via the Filter field. The Filter field allows the user to transform data through text replacement.

A filter requires a regular expression pattern and a replacement text. When a field is configured to use a filter, all obtained values for that field will be checked for regular expression matches; matching substrings will be replaced with the replacement text. The Filter field can be configured to hold multiple filters.

Example

For this example, we're going to look at how we can modify the obtained path value using the Filter field so that it follows the OpenAPI specification; assuming the structure of the documentation page we need to crawl looks like below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
<div class="main">
    <div class="operations">
        <div class="operation"> <!-- Operation wrapper -->
            <div class="header">
                <span class="method">GET</span> <!-- HTTP method -->
                <span class="url">/board</span> <!-- Path -->
            </div>
            <!-- ... -->
        </div>
        <div class="operation"> <!-- Operation wrapper -->
            <div class="header">
                <span class="method">DELETE</span> <!-- HTTP method -->
                <span class="url">/board/[boardName]</span> <!-- Path -->
            </div>
            <!-- ... -->
        </div>
        <div class="operation"> <!-- Operation wrapper -->
            <div class="header">
                <span class="method">POST</span> <!-- HTTP method -->
                <span class="url">/board/[boardName]/thread</span> <!-- Path -->
            </div>
            <!-- ... -->
        </div>
    </div>
</div>

First, let's specify the location of the operation path using the Path field:

Label Selector
Path div.header > span.url

Instead of curly braces {}, path parameters are enclosed in brackets [] in our example HTML document. To fix this, we'll be adding a filter for the Path field. To do this, click on the Path field to show its Filter field, and add the entry below:

Regular Expression Replacement Text
\[(\w*)\] {$1}

This tells the Crawler to swap brackets with curly braces. For our example HTML snippet, the Crawler is expected to execute the following transformations:

Raw Value Transformed Value
/board /board
/board/[boardName] /board/{boardName}
/board/[boardName]/thread /board/{boardName}/thread