C1TextParser

Overview

Overview

This view shows basic features of TemplateBasedExtractor.

Features

  • Sample Applications

  • Template Based Extractor

    The Template-Based extractor is the most generic as it allows users to parse data structures following a declarative XML template. Since the template can be provided as a separate file, it allows users to provide both template and source from which to parse. The plain text source can contain many instances of the defined structure. All the text that does not match the template specification is simply ignored.

    Input file

    
    HELLO GRAPECITY... :D
    
    name = "Hans Nguyen",
    
    
      name = "Owen Hans", 
      age = 24, 
      residency address {country: "Portugal", zip code: 4700747, street: "Rua da Veiga", door number: 64},
      work address {country: "France", zip code: 61200, street: "Rue aux chats", door number: 38}, alex64palmeira@hotmail.com ,
      phone numbers: 916908384 023343812 933873123  ,  ,  Foxes are small-to-medium-sized, omnivorous mammals belonging to several genera of the family Canidae.
    
    
    Foxes have a flattened skull, upright triangular ears, a pointed, slightly upturned snout, and a long bushy tail (or brush).
    
    Twelve species belong to the monophyletic "true foxes" group of genus Vulpes. 
    
    Approximately another 25 current or extinct species are always or sometimes called foxes; 
    
    
    these foxes are either part of the paraphyletic group of the South American foxes, 
    or of the outlying group, which consists of bat-eared fox, gray fox, and island fox.[1] 
    Foxes live on every continent except Antarctica. 
    
    By far the most common and widespread species of fox is the red fox (Vulpes vulpes) 
    with about 47 recognized subspecies.[2] The global distribution of foxes     name="Alexandre Silva",residency address{country:
    
    "Spain",   zip code:56700,   street:"Calle del Rio",door number:64},work address{country:"Republica de Indochina\"/787$",zip code:87000
    ,street:"Calle de las madres",door number:13},phone numbers:253269784,age=32, 
    
    XXXXXXXXXXXXXXXXXfgfgfgXXXXXXXXXXXXXXXXXXXXX
    
    

    Template file

    <template rootElement="myPersonStruct">
      
      <element name ="myAddressStruct" startingRegex="{" endingRegex="}" childrenSeparatorRegex=",">
        <element name="country" startingRegex="country\s*:" extractFormat="quotedString"/>
        <element name="zipcode" startingRegex="zip code\s*:" extractFormat="regex:[0-9]+"/>
        <element name="street" startingRegex="street\s*:" extractFormat="quotedString"/>
        <element name="door number" startingRegex="door number\s*:" extractFormat="int"/>
      </element>
       
      <element name="myPersonStruct" childrenSeparatorRegex=",">
        <element name="name" startingRegex="name\s*=" extractFormat="quotedString"/>
        <element name="age" startingRegex="age\s*=" extractFormat="int"/>
        <element name="residency address" startingRegex="residency address">
          <element template="myAddressStruct"/>
        </element>
        <element name="work address" startingRegex="work address">
          <element template="myAddressStruct"/>
        </element>
        <element name="phone numbers" startingRegex="phone numbers\s*:">
          <element name="phone number" extractFormat="regex:[0-9]+" occurs="1-*"/>
        </element>
        <element name="email" extractFormat="email" occurs="0-1"/>
      </element>
      
    </template>

    Extracted result

    {
      "Extractor": "XMLTemplateBased",
      "Result": {
      "myPersonStruct": [
        {
          "name": "Owen Hans",
          "age": 24,
          "residency address": {
            "myAddressStruct": {
              "country": "Portugal",
              "zipcode": "4700747",
              "street": "Rua da Veiga",
              "door number": 64
            }
          },
          "work address": {
            "myAddressStruct": {
              "country": "France",
              "zipcode": "61200",
              "street": "Rue aux chats",
              "door number": 38
            }
          },
          "email": "alex64palmeira@hotmail.com",
          "phone numbers": {
            "phone number": [
              "916908384",
              "023343812",
              "933873123"
            ]
          }
        },
        {
          "name": "Alexandre Silva",
          "residency address": {
            "myAddressStruct": {
              "country": "Spain",
              "zipcode": "56700",
              "street": "Calle del Rio",
              "door number": 64
            }
          },
          "work address": {
            "myAddressStruct": {
              "country": "Republica de Indochina\\\"/787$",
              "zipcode": "87000",
              "street": "Calle de las madres",
              "door number": 13
            }
          },
          "phone numbers": {
            "phone number": "253269784"
          },
          "age": 32
        }
      ]
    }
    }
    using System.Collections;
    using System.Globalization;
    using System.Linq;
    using System.Web.Mvc;
    using C1.Web.Mvc;
    using SamplesExplorer.Models;
    using System.Collections.Generic;
    using System;
    using C1.TextParser;
    using System.IO;
    using System.Text;
    
    namespace SamplesExplorer.Controllers
    {
        public partial class C1TextParserController : Controller
        {
            public ActionResult TemplateBasedExtractor(FormCollection collection)
            {
                using (var fst = System.IO.File.Open(Server.MapPath("~/Content/sampleFiles/template.xml"), FileMode.Open))
                {
                    using (var fss = System.IO.File.Open(Server.MapPath("~/Content/sampleFiles/source.txt"), FileMode.Open))
                    {
                        TemplateBasedExtractor templateBasedExtractor = new TemplateBasedExtractor(fst);
                        IExtractionResult extractionResult = templateBasedExtractor.Extract(fss);
                        ViewBag.ExtractionResult = extractionResult.ToJsonString();
                    }
                }
    
                return View();
            }
        }
    }
    
    @section Summary{
        <p>@Html.Raw(Resources.C1TextParser.TemplateExtractor_Text0)</p>
    }
    
        <div>
            <div>
                <h3>@Html.Raw(Resources.C1TextParser.TemplateExtractor_Title)</h3>
    
                <p>@Html.Raw(Resources.C1TextParser.TemplateExtractor_Text1)</p>
            </div>
            <div>
                <h3>Input file</h3>
                <pre class="scrollable-pre">@Html.Raw(ControlPages.GetSampleFileContent("source.txt"))</pre>
            </div>
            <div>
                <h3>Template file</h3>
                <pre class="scrollable-pre">@Html.Raw(ControlPages.GetSampleFileContent("template.xml"))</pre>
            </div>
            <div>
                <h3>Extracted result</h3>
                <pre class="scrollable-pre">@Html.Raw(ViewBag.ExtractionResult)</pre>
            </div>
        </div>