Capture content
Capture content nodes are nodes in "scrape page" nodes to assign which content will be captured and will be saved into which column. It notifies the program that you wish to extract specific data elements from the page.
Target
See select target. Assign which DOM on pages will be captured.
Extract Type
Use the extract type selector to specify which attribute in the selected DOM(s) is to be extracted. The default setting is for text, however, you have a number of options include the element's HTML, DOM attribute, page attribute, download elements and regular expression, static data.
1. Text
Capture text content of the target(s).
2. Html
Capture HTML code of the target(s).
3. Dom Attribute
Capture Dom attribute of the target(s), here you should input the attribute name(e.g. href, class...).
For example, if you want to scrape URL of a link, set it to "href"; if you want to scrape URL of a image, set it to "src".
4. Page attribute
Capture attribute of the page.
- page title
- page metadata
- page URL
- parent URL
4. Download
Here you should assign a folder to hold the downloaded files.
- Link: Download file from the link of the target.
- Image: Download image of the selected target.
- Wait download: Wait download is for some special situation and the program will wait until some download request(e.g a page has a button, when click button something will be downloaded).
- Screenshot: Create screenshot of the page, and save it to the folder.
5. Regular Express
Extract data from the target's HTML code with a "regular express", For example:
(\b[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b)
will scrape email address. It must contain parentheses, FMiner will extract string in brackets.
6. Static data
Save a static data here, and you can also set an inputted data(format as [$table.column$] or [%variable%]), see input data. Here's 3 special data:
[!current_time!] - The time stamp when save the data.
[!start_time!] - The time stamp starting the project.
[!project_name!] - Project name.
Save to Database Column
If you assign a data table in its ancestor "scrape page" node, you can select a column field to keep this value.
Adjust Data with Javascript (for Pro/Mac version)
When checked it, you can change the captured data with javascript.
The scraped data is a variable of "data", and the last line of the javascript codes returned value will be as the result data. For example, if you captured "mailto:support@fminer.com", and javascript code like this:
data.substring(7)
You will get "support@fminer.com". Another example: We scraped "price:100$", and we want to just need"100", we can write javascript code like this:
i = data.indexOf(':') data.slice(i, -1)
Some useful javascript function:
data.indexOf // find sub string postion data.lastIndexOf //find last sub string postion data.split //split string, for example, data.split('\n')[0] will return the first line
And you can also use regular expression to adjust data.
For example, we scraped organization information in one block like this:
XXXX Center, LLC xxx First Street North Alabaster, AL 1xxx 205-6xx-8xxx
And we need split them to Name, Address, City, State, Zipcode, PhoneNumber, we can write the script like this:
Name:
data.split('\n')[0]
Address:
data.split('\n')[1]
City:
line = data.split('\n')[2] line.split(',')[0]
State:
line = data.split('\n')[2] sz = line.split(',')[1].trim() i = sz.indexOf(' ') sz.slice(0,i)
Zipcode:
line = data.split('\n')[2] sz = line.split(',')[1].trim() i = sz.indexOf(' ') sz.slice(i+1)
Phone:
data.split('\n')[3]