Stephan - Here is another option using just the GetHTMLElement without any
ExecuteScript processor. This uses a CSS selector to pull the elements and
then NiFi Expression Language to split and add the values. It isn't much
different than what you had. You were very close.

On Wed, Aug 31, 2016 at 10:06 AM, Yolanda Davis <[email protected]>
wrote:

> Hi Stephane,
>
> Here's something I hope can help.  In the GetHTMLElement instead of doing
> the selector on "table td" try "table tr"  with an output type of "Text"
> and a destination type of flowfile-content.  This should create flow files
> for each row with data and extract the numeric text from the td elements in
> that data.  From there you can use the ExecuteScript processor to trim the
> whitespace, convert the text values into numbers and sum them. I was able
> to get this to work with the javascript (ECMAScript) below and using the
> example html you provided:
>
> var flowFile = session.get();
> if (flowFile != null) {
>
>   var StreamCallback =  Java.type("org.apache.nifi.
> processor.io.StreamCallback")
>   var IOUtils = Java.type("org.apache.commons.io.IOUtils")
>   var StandardCharsets = Java.type("java.nio.charset.StandardCharsets")
>
>   flowFile = session.write(flowFile,
>     new StreamCallback(function(inputStream, outputStream) {
>         var text = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
>         var res = text.split(" ");
>         var count = 0;
>         for(i in res){
>         if(parseInt(res[i]) != NaN){
>         count+=parseInt(res[i]);
>         }
>         }
>         outputStream.write(count.toString().getBytes(
> StandardCharsets.UTF_8))
>     }))
>   flowFile = session.putAttribute(flowFile, "filename", flowFile.getId() +
> '_count.txt');
>   session.transfer(flowFile, REL_SUCCESS)
> }
>
> I've attached the template I used to do this which hopefully can help as
> well.  Please let me know if you have any questions.
>
> Yolanda
>
>
> On Wed, Aug 31, 2016 at 3:52 AM, <[email protected]>
> wrote:
>
>> Hi All,
>>
>>
>>
>> I’m trying to extract and doing calculation from HTML table with NIFI.
>>
>> The purpose of the test if doing an addition of each TD in the same TR
>> and output the result in file.
>>
>> For this sample the result should be 23 and 43.
>>
>>
>>
>> My table looks like
>>
>>
>>
>> <table>
>>
>> <tr>
>>
>>           <td>11</td>
>>
>>           <td>12</td>
>>
>>      </tr>
>>
>>      <tr>
>>
>>           <td>21</td>
>>
>>           <td>22</td>
>>
>>      </tr>
>>
>> </table>
>>
>> My NIFI workflow is
>>
>>
>>
>> InvokeHTTP > Response > GetHTMLElement > Success > PutFile
>>
>>
>>
>> The CSS Selector for GetHTMLElement is table td.
>>
>> I know that GetHTMLElement produce 0-N element but I don’t know how I
>> can perform calculation of them.
>>
>>
>>
>> All help will be grateful
>>
>>
>>
>> Thanks
>>
>> Regards
>>
>> Stephane
>>
>>
>>
>> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
>> · ·
>> *Stephane Tinseau*
>>
>> *Thomson Reuters*
>> [email protected]
>> thomsonreuters.com
>>
>>
>>
>> ------------------------------
>>
>> This e-mail is for the sole use of the intended recipient and contains
>> information that may be privileged and/or confidential. If you are not an
>> intended recipient, please notify the sender by return e-mail and delete
>> this e-mail and any attachments. Certain required legal entity disclosures
>> can be accessed on our website.
>> <http://site.thomsonreuters.com/site/disclosures/>
>>
>
>
>
> --
> --
> [email protected]
> @YolandaMDavis
>
>
<?xml version="1.0" ?>
<template encoding-version="1.0">
  <description></description>
  <groupId>e0c40fa6-0156-1000-810b-3101113a21c6</groupId>
  <name>HTML_TableValuesAddition_Example_V1</name>
  <snippet>
    <connections>
      <id>e0ebb49e-0156-1000-0000-000000000000</id>
      <parentGroupId>e0c40fa6-0156-1000-0000-000000000000</parentGroupId>
      <backPressureDataSizeThreshold>1 GB</backPressureDataSizeThreshold>
      <backPressureObjectThreshold>10000</backPressureObjectThreshold>
      <destination>
        <groupId>e0c40fa6-0156-1000-0000-000000000000</groupId>
        <id>e0eba495-0156-1000-0000-000000000000</id>
        <type>PROCESSOR</type>
      </destination>
      <flowFileExpiration>0 sec</flowFileExpiration>
      <labelIndex>1</labelIndex>
      <name></name>
      <selectedRelationships>success</selectedRelationships>
      <source>
        <groupId>e0c40fa6-0156-1000-0000-000000000000</groupId>
        <id>e0e83060-0156-1000-0000-000000000000</id>
        <type>PROCESSOR</type>
      </source>
      <zIndex>0</zIndex>
    </connections>
    <connections>
      <id>e1207626-0156-1000-0000-000000000000</id>
      <parentGroupId>e0c40fa6-0156-1000-0000-000000000000</parentGroupId>
      <backPressureDataSizeThreshold>1 GB</backPressureDataSizeThreshold>
      <backPressureObjectThreshold>10000</backPressureObjectThreshold>
      <destination>
        <groupId>e0c40fa6-0156-1000-0000-000000000000</groupId>
        <id>e1229b4c-0156-1000-0000-000000000000</id>
        <type>PROCESSOR</type>
      </destination>
      <flowFileExpiration>0 sec</flowFileExpiration>
      <labelIndex>1</labelIndex>
      <name></name>
      <selectedRelationships>success</selectedRelationships>
      <source>
        <groupId>e0c40fa6-0156-1000-0000-000000000000</groupId>
        <id>e0eba495-0156-1000-0000-000000000000</id>
        <type>PROCESSOR</type>
      </source>
      <zIndex>0</zIndex>
    </connections>
    <connections>
      <id>e123f2a7-0156-1000-0000-000000000000</id>
      <parentGroupId>e0c40fa6-0156-1000-0000-000000000000</parentGroupId>
      <backPressureDataSizeThreshold>1 GB</backPressureDataSizeThreshold>
      <backPressureObjectThreshold>10000</backPressureObjectThreshold>
      <destination>
        <groupId>e0c40fa6-0156-1000-0000-000000000000</groupId>
        <id>e12634c3-0156-1000-0000-000000000000</id>
        <type>PROCESSOR</type>
      </destination>
      <flowFileExpiration>0 sec</flowFileExpiration>
      <labelIndex>1</labelIndex>
      <name></name>
      <selectedRelationships>matched</selectedRelationships>
      <source>
        <groupId>e0c40fa6-0156-1000-0000-000000000000</groupId>
        <id>e1229b4c-0156-1000-0000-000000000000</id>
        <type>PROCESSOR</type>
      </source>
      <zIndex>0</zIndex>
    </connections>
    <connections>
      <id>e128cf5c-0156-1000-0000-000000000000</id>
      <parentGroupId>e0c40fa6-0156-1000-0000-000000000000</parentGroupId>
      <backPressureDataSizeThreshold>1 GB</backPressureDataSizeThreshold>
      <backPressureObjectThreshold>10000</backPressureObjectThreshold>
      <destination>
        <groupId>e0c40fa6-0156-1000-0000-000000000000</groupId>
        <id>e128b88e-0156-1000-0000-000000000000</id>
        <type>PROCESSOR</type>
      </destination>
      <flowFileExpiration>0 sec</flowFileExpiration>
      <labelIndex>1</labelIndex>
      <name></name>
      <selectedRelationships>success</selectedRelationships>
      <source>
        <groupId>e0c40fa6-0156-1000-0000-000000000000</groupId>
        <id>e12634c3-0156-1000-0000-000000000000</id>
        <type>PROCESSOR</type>
      </source>
      <zIndex>0</zIndex>
    </connections>
    <processors>
      <id>e0e83060-0156-1000-0000-000000000000</id>
      <parentGroupId>e0c40fa6-0156-1000-0000-000000000000</parentGroupId>
      <position>
        <x>7.686058065287341</x>
        <y>7.4081621170043945</y>
      </position>
      <config>
        <bulletinLevel>WARN</bulletinLevel>
        <comments></comments>
        <concurrentlySchedulableTaskCount>1</concurrentlySchedulableTaskCount>
        <descriptors>
          <entry>
            <key>Input Directory</key>
            <value>
              <name>Input Directory</name>
            </value>
          </entry>
          <entry>
            <key>File Filter</key>
            <value>
              <name>File Filter</name>
            </value>
          </entry>
          <entry>
            <key>Path Filter</key>
            <value>
              <name>Path Filter</name>
            </value>
          </entry>
          <entry>
            <key>Batch Size</key>
            <value>
              <name>Batch Size</name>
            </value>
          </entry>
          <entry>
            <key>Keep Source File</key>
            <value>
              <name>Keep Source File</name>
            </value>
          </entry>
          <entry>
            <key>Recurse Subdirectories</key>
            <value>
              <name>Recurse Subdirectories</name>
            </value>
          </entry>
          <entry>
            <key>Polling Interval</key>
            <value>
              <name>Polling Interval</name>
            </value>
          </entry>
          <entry>
            <key>Ignore Hidden Files</key>
            <value>
              <name>Ignore Hidden Files</name>
            </value>
          </entry>
          <entry>
            <key>Minimum File Age</key>
            <value>
              <name>Minimum File Age</name>
            </value>
          </entry>
          <entry>
            <key>Maximum File Age</key>
            <value>
              <name>Maximum File Age</name>
            </value>
          </entry>
          <entry>
            <key>Minimum File Size</key>
            <value>
              <name>Minimum File Size</name>
            </value>
          </entry>
          <entry>
            <key>Maximum File Size</key>
            <value>
              <name>Maximum File Size</name>
            </value>
          </entry>
        </descriptors>
        <lossTolerant>false</lossTolerant>
        <penaltyDuration>30 sec</penaltyDuration>
        <properties>
          <entry>
            <key>Input Directory</key>
            <value>/opt/nifi</value>
          </entry>
          <entry>
            <key>File Filter</key>
            <value>TestHTML.html</value>
          </entry>
          <entry>
            <key>Path Filter</key>
          </entry>
          <entry>
            <key>Batch Size</key>
            <value>10</value>
          </entry>
          <entry>
            <key>Keep Source File</key>
            <value>true</value>
          </entry>
          <entry>
            <key>Recurse Subdirectories</key>
            <value>true</value>
          </entry>
          <entry>
            <key>Polling Interval</key>
            <value>0 sec</value>
          </entry>
          <entry>
            <key>Ignore Hidden Files</key>
            <value>true</value>
          </entry>
          <entry>
            <key>Minimum File Age</key>
            <value>0 sec</value>
          </entry>
          <entry>
            <key>Maximum File Age</key>
          </entry>
          <entry>
            <key>Minimum File Size</key>
            <value>0 B</value>
          </entry>
          <entry>
            <key>Maximum File Size</key>
          </entry>
        </properties>
        <runDurationMillis>0</runDurationMillis>
        <schedulingPeriod>5 sec</schedulingPeriod>
        <schedulingStrategy>TIMER_DRIVEN</schedulingStrategy>
        <yieldDuration>1 sec</yieldDuration>
      </config>
      <name>Mock Response From InvokeHTTP</name>
      <relationships>
        <autoTerminate>false</autoTerminate>
        <name>success</name>
      </relationships>
      <style></style>
      <type>org.apache.nifi.processors.standard.GetFile</type>
    </processors>
    <processors>
      <id>e0eba495-0156-1000-0000-000000000000</id>
      <parentGroupId>e0c40fa6-0156-1000-0000-000000000000</parentGroupId>
      <position>
        <x>605.0244980066936</x>
        <y>0.0</y>
      </position>
      <config>
        <bulletinLevel>WARN</bulletinLevel>
        <comments></comments>
        <concurrentlySchedulableTaskCount>1</concurrentlySchedulableTaskCount>
        <descriptors>
          <entry>
            <key>URL</key>
            <value>
              <name>URL</name>
            </value>
          </entry>
          <entry>
            <key>CSS Selector</key>
            <value>
              <name>CSS Selector</name>
            </value>
          </entry>
          <entry>
            <key>HTML Character Encoding</key>
            <value>
              <name>HTML Character Encoding</name>
            </value>
          </entry>
          <entry>
            <key>Output Type</key>
            <value>
              <name>Output Type</name>
            </value>
          </entry>
          <entry>
            <key>Destination</key>
            <value>
              <name>Destination</name>
            </value>
          </entry>
          <entry>
            <key>Prepend Element Value</key>
            <value>
              <name>Prepend Element Value</name>
            </value>
          </entry>
          <entry>
            <key>Append Element Value</key>
            <value>
              <name>Append Element Value</name>
            </value>
          </entry>
          <entry>
            <key>Attribute Name</key>
            <value>
              <name>Attribute Name</name>
            </value>
          </entry>
        </descriptors>
        <lossTolerant>false</lossTolerant>
        <penaltyDuration>30 sec</penaltyDuration>
        <properties>
          <entry>
            <key>URL</key>
            <value>http://www.example.com</value>
          </entry>
          <entry>
            <key>CSS Selector</key>
            <value>tr</value>
          </entry>
          <entry>
            <key>HTML Character Encoding</key>
            <value>UTF-8</value>
          </entry>
          <entry>
            <key>Output Type</key>
            <value>Text</value>
          </entry>
          <entry>
            <key>Destination</key>
            <value>flowfile-content</value>
          </entry>
          <entry>
            <key>Prepend Element Value</key>
          </entry>
          <entry>
            <key>Append Element Value</key>
          </entry>
          <entry>
            <key>Attribute Name</key>
          </entry>
        </properties>
        <runDurationMillis>0</runDurationMillis>
        <schedulingPeriod>0 sec</schedulingPeriod>
        <schedulingStrategy>TIMER_DRIVEN</schedulingStrategy>
        <yieldDuration>1 sec</yieldDuration>
      </config>
      <name>Parse HTML TD Element Values To Text</name>
      <relationships>
        <autoTerminate>true</autoTerminate>
        <name>element not found</name>
      </relationships>
      <relationships>
        <autoTerminate>true</autoTerminate>
        <name>invalid html</name>
      </relationships>
      <relationships>
        <autoTerminate>true</autoTerminate>
        <name>original</name>
      </relationships>
      <relationships>
        <autoTerminate>false</autoTerminate>
        <name>success</name>
      </relationships>
      <style></style>
      <type>org.apache.nifi.GetHTMLElement</type>
    </processors>
    <processors>
      <id>e1229b4c-0156-1000-0000-000000000000</id>
      <parentGroupId>e0c40fa6-0156-1000-0000-000000000000</parentGroupId>
      <position>
        <x>609.2385190946666</x>
        <y>210.00751100067964</y>
      </position>
      <config>
        <bulletinLevel>WARN</bulletinLevel>
        <comments></comments>
        <concurrentlySchedulableTaskCount>1</concurrentlySchedulableTaskCount>
        <descriptors>
          <entry>
            <key>Character Set</key>
            <value>
              <name>Character Set</name>
            </value>
          </entry>
          <entry>
            <key>Maximum Buffer Size</key>
            <value>
              <name>Maximum Buffer Size</name>
            </value>
          </entry>
          <entry>
            <key>Maximum Capture Group Length</key>
            <value>
              <name>Maximum Capture Group Length</name>
            </value>
          </entry>
          <entry>
            <key>Enable Canonical Equivalence</key>
            <value>
              <name>Enable Canonical Equivalence</name>
            </value>
          </entry>
          <entry>
            <key>Enable Case-insensitive Matching</key>
            <value>
              <name>Enable Case-insensitive Matching</name>
            </value>
          </entry>
          <entry>
            <key>Permit Whitespace and Comments in Pattern</key>
            <value>
              <name>Permit Whitespace and Comments in Pattern</name>
            </value>
          </entry>
          <entry>
            <key>Enable DOTALL Mode</key>
            <value>
              <name>Enable DOTALL Mode</name>
            </value>
          </entry>
          <entry>
            <key>Enable Literal Parsing of the Pattern</key>
            <value>
              <name>Enable Literal Parsing of the Pattern</name>
            </value>
          </entry>
          <entry>
            <key>Enable Multiline Mode</key>
            <value>
              <name>Enable Multiline Mode</name>
            </value>
          </entry>
          <entry>
            <key>Enable Unicode-aware Case Folding</key>
            <value>
              <name>Enable Unicode-aware Case Folding</name>
            </value>
          </entry>
          <entry>
            <key>Enable Unicode Predefined Character Classes</key>
            <value>
              <name>Enable Unicode Predefined Character Classes</name>
            </value>
          </entry>
          <entry>
            <key>Enable Unix Lines Mode</key>
            <value>
              <name>Enable Unix Lines Mode</name>
            </value>
          </entry>
          <entry>
            <key>Include Capture Group 0</key>
            <value>
              <name>Include Capture Group 0</name>
            </value>
          </entry>
          <entry>
            <key>values</key>
            <value>
              <name>values</name>
            </value>
          </entry>
        </descriptors>
        <lossTolerant>false</lossTolerant>
        <penaltyDuration>30 sec</penaltyDuration>
        <properties>
          <entry>
            <key>Character Set</key>
            <value>UTF-8</value>
          </entry>
          <entry>
            <key>Maximum Buffer Size</key>
            <value>1 MB</value>
          </entry>
          <entry>
            <key>Maximum Capture Group Length</key>
            <value>1024</value>
          </entry>
          <entry>
            <key>Enable Canonical Equivalence</key>
            <value>false</value>
          </entry>
          <entry>
            <key>Enable Case-insensitive Matching</key>
            <value>false</value>
          </entry>
          <entry>
            <key>Permit Whitespace and Comments in Pattern</key>
            <value>false</value>
          </entry>
          <entry>
            <key>Enable DOTALL Mode</key>
            <value>false</value>
          </entry>
          <entry>
            <key>Enable Literal Parsing of the Pattern</key>
            <value>false</value>
          </entry>
          <entry>
            <key>Enable Multiline Mode</key>
            <value>false</value>
          </entry>
          <entry>
            <key>Enable Unicode-aware Case Folding</key>
            <value>false</value>
          </entry>
          <entry>
            <key>Enable Unicode Predefined Character Classes</key>
            <value>false</value>
          </entry>
          <entry>
            <key>Enable Unix Lines Mode</key>
            <value>false</value>
          </entry>
          <entry>
            <key>Include Capture Group 0</key>
            <value>true</value>
          </entry>
          <entry>
            <key>values</key>
            <value>(.)*</value>
          </entry>
        </properties>
        <runDurationMillis>0</runDurationMillis>
        <schedulingPeriod>0 sec</schedulingPeriod>
        <schedulingStrategy>TIMER_DRIVEN</schedulingStrategy>
        <yieldDuration>1 sec</yieldDuration>
      </config>
      <name>Pull TD Values from Content</name>
      <relationships>
        <autoTerminate>false</autoTerminate>
        <name>matched</name>
      </relationships>
      <relationships>
        <autoTerminate>true</autoTerminate>
        <name>unmatched</name>
      </relationships>
      <style></style>
      <type>org.apache.nifi.processors.standard.ExtractText</type>
    </processors>
    <processors>
      <id>e12634c3-0156-1000-0000-000000000000</id>
      <parentGroupId>e0c40fa6-0156-1000-0000-000000000000</parentGroupId>
      <position>
        <x>0.0</x>
        <y>219.1472866404398</y>
      </position>
      <config>
        <bulletinLevel>WARN</bulletinLevel>
        <comments></comments>
        <concurrentlySchedulableTaskCount>1</concurrentlySchedulableTaskCount>
        <descriptors>
          <entry>
            <key>Regular Expression</key>
            <value>
              <name>Regular Expression</name>
            </value>
          </entry>
          <entry>
            <key>Replacement Value</key>
            <value>
              <name>Replacement Value</name>
            </value>
          </entry>
          <entry>
            <key>Character Set</key>
            <value>
              <name>Character Set</name>
            </value>
          </entry>
          <entry>
            <key>Maximum Buffer Size</key>
            <value>
              <name>Maximum Buffer Size</name>
            </value>
          </entry>
          <entry>
            <key>Replacement Strategy</key>
            <value>
              <name>Replacement Strategy</name>
            </value>
          </entry>
          <entry>
            <key>Evaluation Mode</key>
            <value>
              <name>Evaluation Mode</name>
            </value>
          </entry>
        </descriptors>
        <lossTolerant>false</lossTolerant>
        <penaltyDuration>30 sec</penaltyDuration>
        <properties>
          <entry>
            <key>Regular Expression</key>
            <value>(?s:^.*$)</value>
          </entry>
          <entry>
            <key>Replacement Value</key>
            <value>${values.0:getDelimitedField(1, " "):trim():plus(${values.0:getDelimitedField(2, " "):trim()})}</value>
          </entry>
          <entry>
            <key>Character Set</key>
            <value>UTF-8</value>
          </entry>
          <entry>
            <key>Maximum Buffer Size</key>
            <value>1 MB</value>
          </entry>
          <entry>
            <key>Replacement Strategy</key>
            <value>Always Replace</value>
          </entry>
          <entry>
            <key>Evaluation Mode</key>
            <value>Entire text</value>
          </entry>
        </properties>
        <runDurationMillis>0</runDurationMillis>
        <schedulingPeriod>0 sec</schedulingPeriod>
        <schedulingStrategy>TIMER_DRIVEN</schedulingStrategy>
        <yieldDuration>1 sec</yieldDuration>
      </config>
      <name>NiFi EL to "Plus" values</name>
      <relationships>
        <autoTerminate>true</autoTerminate>
        <name>failure</name>
      </relationships>
      <relationships>
        <autoTerminate>false</autoTerminate>
        <name>success</name>
      </relationships>
      <style></style>
      <type>org.apache.nifi.processors.standard.ReplaceText</type>
    </processors>
    <processors>
      <id>e128b88e-0156-1000-0000-000000000000</id>
      <parentGroupId>e0c40fa6-0156-1000-0000-000000000000</parentGroupId>
      <position>
        <x>1.3621738041279627</x>
        <y>440.812084966166</y>
      </position>
      <config>
        <bulletinLevel>INFO</bulletinLevel>
        <comments></comments>
        <concurrentlySchedulableTaskCount>1</concurrentlySchedulableTaskCount>
        <descriptors>
          <entry>
            <key>Log Level</key>
            <value>
              <name>Log Level</name>
            </value>
          </entry>
          <entry>
            <key>Log Payload</key>
            <value>
              <name>Log Payload</name>
            </value>
          </entry>
          <entry>
            <key>Attributes to Log</key>
            <value>
              <name>Attributes to Log</name>
            </value>
          </entry>
          <entry>
            <key>Attributes to Ignore</key>
            <value>
              <name>Attributes to Ignore</name>
            </value>
          </entry>
          <entry>
            <key>Log prefix</key>
            <value>
              <name>Log prefix</name>
            </value>
          </entry>
        </descriptors>
        <lossTolerant>false</lossTolerant>
        <penaltyDuration>30 sec</penaltyDuration>
        <properties>
          <entry>
            <key>Log Level</key>
            <value>info</value>
          </entry>
          <entry>
            <key>Log Payload</key>
            <value>true</value>
          </entry>
          <entry>
            <key>Attributes to Log</key>
          </entry>
          <entry>
            <key>Attributes to Ignore</key>
          </entry>
          <entry>
            <key>Log prefix</key>
          </entry>
        </properties>
        <runDurationMillis>0</runDurationMillis>
        <schedulingPeriod>0 sec</schedulingPeriod>
        <schedulingStrategy>TIMER_DRIVEN</schedulingStrategy>
        <yieldDuration>1 sec</yieldDuration>
      </config>
      <name>Log Output Values</name>
      <relationships>
        <autoTerminate>true</autoTerminate>
        <name>success</name>
      </relationships>
      <style></style>
      <type>org.apache.nifi.processors.standard.LogAttribute</type>
    </processors>
  </snippet>
  <timestamp>08/31/2016 15:27:41 UTC</timestamp>
</template>

Reply via email to