[Solr Wiki] Update of "DataImportHandler" by NoblePaul

Apache Wiki Thu, 24 Apr 2008 22:00:40 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The following page has been changed by NoblePaul:
http://wiki.apache.org/solr/DataImportHandler

The comment on the change is:
Documentation on Intercative development mode

------------------------------------------------------------------------------
   * '''delta-import''' :  For incremental imports and change detection run the 
command `http://<host>:<port>/solr/dataimport?command=delta-import`
   * '''status''' : To know the status of the current command , hit the URL 
`http://<host>:<port>/solr/dataimport` .It gives an elaborate statistics on 
no:of docs created, deleted, queries run, rows fetched , status etc
   * '''reload-config''' : If the data-config is changed and you wissh to 
reload the file without restarting Solr. run the command 
`http://<host>:<port>/solr/dataimport?command=reload-config` 
-  * '''abort''' : Abort an ongoing opertaion by hitting the url 
`http://<host>:<port>/solr/dataimport?command=abort`
+  * '''abort''' : Abort an ongoing operation by hitting the url 
`http://<host>:<port>/solr/dataimport?command=abort`
+  * '''status''' : See the current by hitting the url 
`http://<host>:<port>/solr/dataimport`
  
  == Full Import Example ==
  
@@ -363, +364 @@

  What about this ''transformer=!DateFormatTransformer'' attribute in the 
entity? . See [#DateFormatTransformer DateFormatTransformer]  Section for 
details
  
  You can use this feature for indexing from REST API's such as rss/atom feeds, 
XML data feeds , other SOLR servers or even well formed xhtml documents . Our 
XPath support has its limitations (no wildcards , only fullpath etc) but we 
have tried to make sure that common use-cases are covered and since it's based 
on a streaming parser, it is extremely fast and consumes constant amount of 
memory even for large XMLs. It does not support namespaces , but it can handle 
xmls with namespaces . When you provide the xpath, just drop the namespace and 
give the rest (eg if the tag is `'<dc:subject>'` the mapping should just 
contain `'subject'`).Easy, isn't it? And you didn't need to write one line of 
code! Enjoy :)
+ 
+ note: Unlike with database , it is note possible to omit the field 
declarations if you are using X!PathEntityProcessor. It relies on the xpaths 
declared in the fields to identify what to extract from the xml. 
  = Extending the tool with APIs =
  The examples we explored are admittedly, trivial . It is not possible to have 
all user needs met by an xml configuration alone. So we expose a few interfaces 
which can be implemented by the user to enhance the functionality.
  
@@ -483, +486 @@

  eg:
  {{{
  <entity name="e" transformer="TemplateTransformer" ..>
- <field column="price" template="hello${e.name},${eparent.surname}" />
+ <field column="namedesc" template="hello${e.name},${eparent.surname}" />
  ...
  </entity>
  }}}
@@ -555, +558 @@

   * Each row that comes out of C is fed into 'f' and 'g' sequentially 
(transformers are chained) . Each transformer can change the input. Note that 
the transformer 'g' produces 2 output rows for an input row `f(C.1))
   * The end output of each entity is combined together to construct a document
     * Note that the intermediate rows from C i.e `C.1, C.2, f(C.1) , f(C1)` 
are ignored
- 
+ == Field declarations ==
+ Fields declared in the <entity> tags help us provide extra information which 
cannot be derived automatically. The tool relies on the 'column' values to 
fetch values from the results. The fields you explicitly add in the 
configuration are equivalent to the fields which are present in the solr 
schema.xml (implicit fields). It automatically inherits all the attributes 
present in the schema.xml. Just that you cannot add extra configuratio. Add the 
field entries when,
+  * The fields emitted from the !EntityProcessor has a different name than the 
field in schema.xml
+  * With in-built transformers . They expect extra information to decide which 
fields to process and how to process
+  * X!PathEntityprocessor or any other processors which explicitly demand 
extra information in each fields
  == What is a row? ==
- A row in !DataImportHandler is a Map (Map<String, Object). In the map , the 
key is the name of the field and the value can be anything which is a valid 
Solr type. The value can also be a Collection of the valid Solr types (this may 
get mapped to a multi-valued field). If the DataSource is RDBMS a query cannot 
emit a multivalued field. But it is possible to create a multivalued field by 
joining an entity with another.i.e if the sub-entity returns multiple rows for 
one row from parent entity it can go into a multivalued field. If the 
datadource is xml it is possible to return a multivalued field.
+ A row in !DataImportHandler is a Map (Map<String, Object). In the map , the 
key is the name of the field and the value can be anything which is a valid 
Solr type. The value can also be a Collection of the valid Solr types (this may 
get mapped to a multi-valued field). If the DataSource is RDBMS a query cannot 
emit a multivalued field. But it is possible to create a multivalued field by 
joining an entity with another.i.e if the sub-entity returns multiple rows for 
one row from parent entity it can go into a multivalued field. If the 
datasource is xml, it is possible to return a multivalued field.
  
  == A VariableResolver ==
  A !VariableResolver is the component which replaces all those placholders 
such as `${<name>}`. It is a multilevel Map .Each namespace is a Map and 
namespaces are separated by periods (.) . eg if there is a placeholder 
${item.ID} , 'item' is a nampespace (which is a map) and 'ID' is a value in 
that namespace. It is possible to nest namespaces like ${item.x.ID} where x 
could be another Map. A reference to the current !VariableResolver can be 
obtained from the Context. Or the object can be directly consumed by using 
${<name>} in 'query' for RDMS queries or 'url' in Http .
@@ -567, +574 @@

   * ''escapeSql'' : Use this to escape special sql characters . eg : 
`'${dataimporter.functions.escapeSql(item.ID)}'` . Takes only one argument and 
must be a valid value in the !VaraiableResolver.
   * ''encodeUrl'' : Us this to encode urls . eg : 
`'${dataimporter.functions.encodeUrl(item.ID)}'` . Takes only one argument and 
must be a valid value in the !VariableResolver
  
- 
+ = Interactive Development Mode =
+ This is a new cool and powerful feature in the tool. It helps you build a 
dataconfigxml with rthe UI. It can be accessed from 
http://host:port/solr/admin/dataimport.jsp . The features are
+  * A UI with two panels . RHS takes in the input and LHS shows the output
+  * When you hit the button 'debug now' it runs the configuration and shows 
the documents created
+  * You can configure the start and rows parameters to debug documents say 115 
to 118 . 
+  * Choose the 'verbose' option to get detailed information about the 
intermediete steps. What was emitted by the query and what went into the 
Transformer and what was the output. 
+  * If an exception occurred during the run, the stacktrace is shown right 
there
+  * The fields produced by the Entities, Transformers may not be visible in 
documents if the fields are either not present in the schema.xml of there is an 
explicit <field> declaration
  
  = Where to find it? =
  DataImportHandler is not in SOLR right now. You can either:

[Solr Wiki] Update of "DataImportHandler" by NoblePaul

Reply via email to