Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception

2013-10-01 Thread Andreas Owen
i'm already using URLDataSource

On 30. Sep 2013, at 5:41 PM, P Williams wrote:

> Hi Andreas,
> 
> When using 
> XPathEntityProcessoryour
> DataSource
> must be of type DataSource.  You shouldn't be using
> BinURLDataSource, it's giving you the cast exception.  Use
> URLDataSource
> or
> FileDataSourceinstead.
> 
> I don't think you need to specify namespaces, at least you didn't used to.
> The other thing that I've noticed is that the anywhere xpath expression //
> doesn't always work in DIH.  You might have to be more specific.
> 
> Cheers,
> Tricia
> 
> 
> 
> 
> 
> On Sun, Sep 29, 2013 at 9:47 AM, Andreas Owen  wrote:
> 
>> how dum can you get. obviously quite dum... i would have to analyze the
>> html-pages with a nested instance like this:
>> 
>> > url="file:///C:\ColdFusion10\cfusion\solr\solr\tkbintranet\docImportUrl.xml"
>> forEach="/docs/doc" dataSource="main">
>> 
>>> url="${rec.urlParse}" forEach="/xhtml:html" dataSource="dataUrl">
>>
>>
>>
>>
>>
>> 
>> 
>> but i'm pretty sure the foreach is wrong and the xpath expressions. in the
>> moment i getting the following error:
>> 
>>Caused by: java.lang.RuntimeException:
>> org.apache.solr.handler.dataimport.DataImportHandlerException:
>> java.lang.ClassCastException:
>> sun.net.www.protocol.http.HttpURLConnection$HttpInputStream cannot be cast
>> to java.io.Reader
>> 
>> 
>> 
>> 
>> 
>> On 28. Sep 2013, at 1:39 AM, Andreas Owen wrote:
>> 
>>> ok i see what your getting at but why doesn't the following work:
>>> 
>>>  
>>>  
>>> 
>>> i removed the tiki-processor. what am i missing, i haven't found
>> anything in the wiki?
>>> 
>>> 
>>> On 28. Sep 2013, at 12:28 AM, P Williams wrote:
>>> 
 I spent some more time thinking about this.  Do you really need to use
>> the
 TikaEntityProcessor?  It doesn't offer anything new to the document you
>> are
 building that couldn't be accomplished by the XPathEntityProcessor alone
 from what I can tell.
 
 I also tried to get the Advanced
 Parsingexample to
 work without success.  There are some obvious typos (
 instead of ) and an odd order to the pieces ( is
 enclosed by ).  It also looks like
 FieldStreamDataSource<
>> http://lucene.apache.org/solr/4_3_1/solr-dataimporthandler/org/apache/solr/handler/dataimport/FieldStreamDataSource.html
>>> is
 the one that is meant to work in this context. If Koji is still around
 maybe he could offer some help?  Otherwise this bit of erroneous
 instruction should probably be removed from the wiki.
 
 Cheers,
 Tricia
 
 $ svn diff
 Index:
 
>> solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
 ===
 ---
 
>> solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
   (revision 1526990)
 +++
 
>> solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
   (working copy)
 @@ -99,13 +99,13 @@
   runFullImport(getConfigHTML("identity"));
   assertQ(req("*:*"), testsHTMLIdentity);
 }
 -
 +
 private String getConfigHTML(String htmlMapper) {
   return
   "" +
   "  " +
   "  " +
 -">>> processor='TikaEntityProcessor' " +
 +">>> processor='TikaEntityProcessor' " +
   "   url='" +
 getFile("dihextras/structured.html").getAbsolutePath() + "' " +
   ((htmlMapper == null) ? "" : (" htmlMapper='" + htmlMapper +
 "'")) + ">" +
   "  " +
 @@ -114,4 +114,36 @@
   "";
 
 }
 +  private String[] testsHTMLH1 = {
 +  "//*[@numFound='1']"
 +  , "//str[@name='h1'][contains(.,'H1 Header')]"
 +  };
 +
 +  @Test
 +  public void testTikaHTMLMapperSubEntity() throws Exception {
 +runFullImport(getConfigSubEntity("identity"));
 +assertQ(req("*:*"), testsHTMLH1);
 +  }
 +
 +  private String getConfigSubEntity(String htmlMapper) {
 +return
 +"" +
 +"" +
 +"" +
 +"" +
 +">>> dataSource='bin' format='html' rootEntity='false'>" +
 +"" +
 +"" +
 +"" +
 +"" +
 +"" +
 +"> forEa

Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception

2013-09-30 Thread P Williams
Hi Andreas,

When using 
XPathEntityProcessoryour
DataSource
must be of type DataSource.  You shouldn't be using
BinURLDataSource, it's giving you the cast exception.  Use
URLDataSource
or
FileDataSourceinstead.

I don't think you need to specify namespaces, at least you didn't used to.
 The other thing that I've noticed is that the anywhere xpath expression //
doesn't always work in DIH.  You might have to be more specific.

Cheers,
Tricia





On Sun, Sep 29, 2013 at 9:47 AM, Andreas Owen  wrote:

> how dum can you get. obviously quite dum... i would have to analyze the
> html-pages with a nested instance like this:
>
>  url="file:///C:\ColdFusion10\cfusion\solr\solr\tkbintranet\docImportUrl.xml"
> forEach="/docs/doc" dataSource="main">
>
>  url="${rec.urlParse}" forEach="/xhtml:html" dataSource="dataUrl">
> 
> 
> 
> 
> 
> 
>
> but i'm pretty sure the foreach is wrong and the xpath expressions. in the
> moment i getting the following error:
>
> Caused by: java.lang.RuntimeException:
> org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.ClassCastException:
> sun.net.www.protocol.http.HttpURLConnection$HttpInputStream cannot be cast
> to java.io.Reader
>
>
>
>
>
> On 28. Sep 2013, at 1:39 AM, Andreas Owen wrote:
>
> > ok i see what your getting at but why doesn't the following work:
> >
> >   
> >   
> >
> > i removed the tiki-processor. what am i missing, i haven't found
> anything in the wiki?
> >
> >
> > On 28. Sep 2013, at 12:28 AM, P Williams wrote:
> >
> >> I spent some more time thinking about this.  Do you really need to use
> the
> >> TikaEntityProcessor?  It doesn't offer anything new to the document you
> are
> >> building that couldn't be accomplished by the XPathEntityProcessor alone
> >> from what I can tell.
> >>
> >> I also tried to get the Advanced
> >> Parsingexample to
> >> work without success.  There are some obvious typos (
> >> instead of ) and an odd order to the pieces ( is
> >> enclosed by ).  It also looks like
> >> FieldStreamDataSource<
> http://lucene.apache.org/solr/4_3_1/solr-dataimporthandler/org/apache/solr/handler/dataimport/FieldStreamDataSource.html
> >is
> >> the one that is meant to work in this context. If Koji is still around
> >> maybe he could offer some help?  Otherwise this bit of erroneous
> >> instruction should probably be removed from the wiki.
> >>
> >> Cheers,
> >> Tricia
> >>
> >> $ svn diff
> >> Index:
> >>
> solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
> >> ===
> >> ---
> >>
> solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
> >>(revision 1526990)
> >> +++
> >>
> solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
> >>(working copy)
> >> @@ -99,13 +99,13 @@
> >>runFullImport(getConfigHTML("identity"));
> >>assertQ(req("*:*"), testsHTMLIdentity);
> >>  }
> >> -
> >> +
> >>  private String getConfigHTML(String htmlMapper) {
> >>return
> >>"" +
> >>"  " +
> >>"  " +
> >> -" >> processor='TikaEntityProcessor' " +
> >> +" >> processor='TikaEntityProcessor' " +
> >>"   url='" +
> >> getFile("dihextras/structured.html").getAbsolutePath() + "' " +
> >>((htmlMapper == null) ? "" : (" htmlMapper='" + htmlMapper +
> >> "'")) + ">" +
> >>"  " +
> >> @@ -114,4 +114,36 @@
> >>"";
> >>
> >>  }
> >> +  private String[] testsHTMLH1 = {
> >> +  "//*[@numFound='1']"
> >> +  , "//str[@name='h1'][contains(.,'H1 Header')]"
> >> +  };
> >> +
> >> +  @Test
> >> +  public void testTikaHTMLMapperSubEntity() throws Exception {
> >> +runFullImport(getConfigSubEntity("identity"));
> >> +assertQ(req("*:*"), testsHTMLH1);
> >> +  }
> >> +
> >> +  private String getConfigSubEntity(String htmlMapper) {
> >> +return
> >> +"" +
> >> +"" +
> >> +"" +
> >> +"" +
> >> +" >> dataSource='bin' format='html' rootEntity='false'>" +
> >> +"" +
> >> +"" +
> >> +"" +
> >> +"" +
> >> +"" +
> >> +" forEach='/html'
> >> dataSource='fld' dataField='tika.text' rootEntity='true' >" +
> >> +"" +
> >> +"" +
> >> +"" +
> >> +"" +
> >> +  

Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception

2013-09-29 Thread Andreas Owen
how dum can you get. obviously quite dum... i would have to analyze the 
html-pages with a nested instance like this:

 









but i'm pretty sure the foreach is wrong and the xpath expressions. in the 
moment i getting the following error:

Caused by: java.lang.RuntimeException: 
org.apache.solr.handler.dataimport.DataImportHandlerException: 
java.lang.ClassCastException: 
sun.net.www.protocol.http.HttpURLConnection$HttpInputStream cannot be cast to 
java.io.Reader





On 28. Sep 2013, at 1:39 AM, Andreas Owen wrote:

> ok i see what your getting at but why doesn't the following work:
>   
>   
>   
> 
> i removed the tiki-processor. what am i missing, i haven't found anything in 
> the wiki?
> 
> 
> On 28. Sep 2013, at 12:28 AM, P Williams wrote:
> 
>> I spent some more time thinking about this.  Do you really need to use the
>> TikaEntityProcessor?  It doesn't offer anything new to the document you are
>> building that couldn't be accomplished by the XPathEntityProcessor alone
>> from what I can tell.
>> 
>> I also tried to get the Advanced
>> Parsingexample to
>> work without success.  There are some obvious typos (
>> instead of ) and an odd order to the pieces ( is
>> enclosed by ).  It also looks like
>> FieldStreamDataSourceis
>> the one that is meant to work in this context. If Koji is still around
>> maybe he could offer some help?  Otherwise this bit of erroneous
>> instruction should probably be removed from the wiki.
>> 
>> Cheers,
>> Tricia
>> 
>> $ svn diff
>> Index:
>> solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
>> ===
>> ---
>> solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
>>(revision 1526990)
>> +++
>> solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
>>(working copy)
>> @@ -99,13 +99,13 @@
>>runFullImport(getConfigHTML("identity"));
>>assertQ(req("*:*"), testsHTMLIdentity);
>>  }
>> -
>> +
>>  private String getConfigHTML(String htmlMapper) {
>>return
>>"" +
>>"  " +
>>"  " +
>> -"> processor='TikaEntityProcessor' " +
>> +"> processor='TikaEntityProcessor' " +
>>"   url='" +
>> getFile("dihextras/structured.html").getAbsolutePath() + "' " +
>>((htmlMapper == null) ? "" : (" htmlMapper='" + htmlMapper +
>> "'")) + ">" +
>>"  " +
>> @@ -114,4 +114,36 @@
>>"";
>> 
>>  }
>> +  private String[] testsHTMLH1 = {
>> +  "//*[@numFound='1']"
>> +  , "//str[@name='h1'][contains(.,'H1 Header')]"
>> +  };
>> +
>> +  @Test
>> +  public void testTikaHTMLMapperSubEntity() throws Exception {
>> +runFullImport(getConfigSubEntity("identity"));
>> +assertQ(req("*:*"), testsHTMLH1);
>> +  }
>> +
>> +  private String getConfigSubEntity(String htmlMapper) {
>> +return
>> +"" +
>> +"" +
>> +"" +
>> +"" +
>> +"> dataSource='bin' format='html' rootEntity='false'>" +
>> +"" +
>> +"" +
>> +"" +
>> +"" +
>> +"" +
>> +"> dataSource='fld' dataField='tika.text' rootEntity='true' >" +
>> +"" +
>> +"" +
>> +"" +
>> +"" +
>> +"" +
>> +"";
>> +  }
>> +
>> }
>> Index:
>> solr/contrib/dataimporthandler-extras/src/test-files/dihextras/solr/collection1/conf/dataimport-schema-no-unique-key.xml
>> ===
>> ---
>> solr/contrib/dataimporthandler-extras/src/test-files/dihextras/solr/collection1/conf/dataimport-schema-no-unique-key.xml
>>  (revision 1526990)
>> +++
>> solr/contrib/dataimporthandler-extras/src/test-files/dihextras/solr/collection1/conf/dataimport-schema-no-unique-key.xml
>>  (working copy)
>> @@ -194,6 +194,8 @@
>>   
>>   
>>   
>> +   
>> +   
>> 
>> 
>> 
>> 
>> 
>> I find the SqlEntityProcessor part particularly odd.  That's the default
>> right?:
>> 2405 T12 C1 oashd.SqlEntityProcessor.initQuery ERROR The query failed
>> 'null' java.lang.RuntimeException: unsupported type : class java.lang.String
>> at
>> org.apache.solr.handler.dataimport.FieldStreamDataSource.getData(FieldStreamDataSource.java:89)
>> at
>> org.apache.solr.handler.dataimport.FieldStreamDataSource.getData(FieldStreamDataSource.java:1)
>> at
>> org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
>> at
>> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEnti

Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception

2013-09-28 Thread Andreas Owen
thanks but the first suggestion is already implemented and the 2. didn't work. 
i have also tried htmlMapper="identity" but nothing worked.

i also tried this but the html was stripped in both fields





but in the end i think it's best to cut tika out because i'm not getting any 
benefits from it. i would just need to get this to work:




the fields are empty and i'm not getting any errors in the logs.


On 28. Sep 2013, at 2:43 AM, Alexandre Rafalovitch wrote:

> This is a rather complicated example to chew through, but try the following
> two things:
> *) dataField="${tika.text}"  => dataField="text" (or less likely htmlMapper
> tika.text)
> You might be trying to read content of the field rather than passing
> reference to the field that seems to be expected. This might explain the
> exception.
> 
> *) It may help to be aware of
> https://issues.apache.org/jira/browse/SOLR-4530 . There is a new
> htmlMapper="identity" flag on Tika entries to ensure more of HTML structure
> passing through. By default, Tika strips out most of the HTML tags.
> 
> Regards,
>   Alex.
> 
> On Thu, Sep 26, 2013 at 5:17 PM, Andreas Owen  wrote:
> 
>>> url="${rec.urlParse}" dataSource="dataUrl" onError="skip" format="html">
>>
>> 
>>> forEach="/html" dataSource="fld" dataField="${tika.text}" rootEntity="true"
>> onError="skip">
>>
>>
>>
>> 
> 
> 
> 
> Personal website: http://www.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all at
> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)



Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception

2013-09-27 Thread Alexandre Rafalovitch
This is a rather complicated example to chew through, but try the following
two things:
*) dataField="${tika.text}"  => dataField="text" (or less likely htmlMapper
tika.text)
You might be trying to read content of the field rather than passing
reference to the field that seems to be expected. This might explain the
exception.

*) It may help to be aware of
https://issues.apache.org/jira/browse/SOLR-4530 . There is a new
htmlMapper="identity" flag on Tika entries to ensure more of HTML structure
passing through. By default, Tika strips out most of the HTML tags.

Regards,
   Alex.

On Thu, Sep 26, 2013 at 5:17 PM, Andreas Owen  wrote:

>  url="${rec.urlParse}" dataSource="dataUrl" onError="skip" format="html">
> 
>
>  forEach="/html" dataSource="fld" dataField="${tika.text}" rootEntity="true"
> onError="skip">
> 
> 
> 
>



Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception

2013-09-27 Thread Andreas Owen
ok i see what your getting at but why doesn't the following work:




i removed the tiki-processor. what am i missing, i haven't found anything in 
the wiki?


On 28. Sep 2013, at 12:28 AM, P Williams wrote:

> I spent some more time thinking about this.  Do you really need to use the
> TikaEntityProcessor?  It doesn't offer anything new to the document you are
> building that couldn't be accomplished by the XPathEntityProcessor alone
> from what I can tell.
> 
> I also tried to get the Advanced
> Parsingexample to
> work without success.  There are some obvious typos (
> instead of ) and an odd order to the pieces ( is
> enclosed by ).  It also looks like
> FieldStreamDataSourceis
> the one that is meant to work in this context. If Koji is still around
> maybe he could offer some help?  Otherwise this bit of erroneous
> instruction should probably be removed from the wiki.
> 
> Cheers,
> Tricia
> 
> $ svn diff
> Index:
> solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
> ===
> ---
> solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
> (revision 1526990)
> +++
> solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
> (working copy)
> @@ -99,13 +99,13 @@
> runFullImport(getConfigHTML("identity"));
> assertQ(req("*:*"), testsHTMLIdentity);
>   }
> -
> +
>   private String getConfigHTML(String htmlMapper) {
> return
> "" +
> "  " +
> "  " +
> -" processor='TikaEntityProcessor' " +
> +" processor='TikaEntityProcessor' " +
> "   url='" +
> getFile("dihextras/structured.html").getAbsolutePath() + "' " +
> ((htmlMapper == null) ? "" : (" htmlMapper='" + htmlMapper +
> "'")) + ">" +
> "  " +
> @@ -114,4 +114,36 @@
> "";
> 
>   }
> +  private String[] testsHTMLH1 = {
> +  "//*[@numFound='1']"
> +  , "//str[@name='h1'][contains(.,'H1 Header')]"
> +  };
> +
> +  @Test
> +  public void testTikaHTMLMapperSubEntity() throws Exception {
> +runFullImport(getConfigSubEntity("identity"));
> +assertQ(req("*:*"), testsHTMLH1);
> +  }
> +
> +  private String getConfigSubEntity(String htmlMapper) {
> +return
> +"" +
> +"" +
> +"" +
> +"" +
> +" dataSource='bin' format='html' rootEntity='false'>" +
> +"" +
> +"" +
> +"" +
> +"" +
> +"" +
> +" dataSource='fld' dataField='tika.text' rootEntity='true' >" +
> +"" +
> +"" +
> +"" +
> +"" +
> +"" +
> +"";
> +  }
> +
> }
> Index:
> solr/contrib/dataimporthandler-extras/src/test-files/dihextras/solr/collection1/conf/dataimport-schema-no-unique-key.xml
> ===
> ---
> solr/contrib/dataimporthandler-extras/src/test-files/dihextras/solr/collection1/conf/dataimport-schema-no-unique-key.xml
>   (revision 1526990)
> +++
> solr/contrib/dataimporthandler-extras/src/test-files/dihextras/solr/collection1/conf/dataimport-schema-no-unique-key.xml
>   (working copy)
> @@ -194,6 +194,8 @@
>
>
>
> +   
> +   
> 
>  
>  
> 
> 
> I find the SqlEntityProcessor part particularly odd.  That's the default
> right?:
> 2405 T12 C1 oashd.SqlEntityProcessor.initQuery ERROR The query failed
> 'null' java.lang.RuntimeException: unsupported type : class java.lang.String
> at
> org.apache.solr.handler.dataimport.FieldStreamDataSource.getData(FieldStreamDataSource.java:89)
> at
> org.apache.solr.handler.dataimport.FieldStreamDataSource.getData(FieldStreamDataSource.java:1)
> at
> org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
> at
> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
> at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:469)
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:495)
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:408)
> at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:323)
> at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:231)
> at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:411)
> at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:476)
> at
> org.apache.solr.handler.dataimport.DataImportHandler.

Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception

2013-09-27 Thread P Williams
I spent some more time thinking about this.  Do you really need to use the
TikaEntityProcessor?  It doesn't offer anything new to the document you are
building that couldn't be accomplished by the XPathEntityProcessor alone
from what I can tell.

I also tried to get the Advanced
Parsingexample to
work without success.  There are some obvious typos (
instead of ) and an odd order to the pieces ( is
enclosed by ).  It also looks like
FieldStreamDataSourceis
the one that is meant to work in this context. If Koji is still around
maybe he could offer some help?  Otherwise this bit of erroneous
instruction should probably be removed from the wiki.

Cheers,
Tricia

$ svn diff
Index:
solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
===
---
solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
 (revision 1526990)
+++
solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
 (working copy)
@@ -99,13 +99,13 @@
 runFullImport(getConfigHTML("identity"));
 assertQ(req("*:*"), testsHTMLIdentity);
   }
-
+
   private String getConfigHTML(String htmlMapper) {
 return
 "" +
 "  " +
 "  " +
-"" +
 "  " +
@@ -114,4 +114,36 @@
 "";

   }
+  private String[] testsHTMLH1 = {
+  "//*[@numFound='1']"
+  , "//str[@name='h1'][contains(.,'H1 Header')]"
+  };
+
+  @Test
+  public void testTikaHTMLMapperSubEntity() throws Exception {
+runFullImport(getConfigSubEntity("identity"));
+assertQ(req("*:*"), testsHTMLH1);
+  }
+
+  private String getConfigSubEntity(String htmlMapper) {
+return
+"" +
+"" +
+"" +
+"" +
+"" +
+"" +
+"" +
+"" +
+"" +
+"" +
+"" +
+"" +
+"" +
+"" +
+"" +
+"" +
+"";
+  }
+
 }
Index:
solr/contrib/dataimporthandler-extras/src/test-files/dihextras/solr/collection1/conf/dataimport-schema-no-unique-key.xml
===
---
solr/contrib/dataimporthandler-extras/src/test-files/dihextras/solr/collection1/conf/dataimport-schema-no-unique-key.xml
   (revision 1526990)
+++
solr/contrib/dataimporthandler-extras/src/test-files/dihextras/solr/collection1/conf/dataimport-schema-no-unique-key.xml
   (working copy)
@@ -194,6 +194,8 @@



+   
+   

  
  


I find the SqlEntityProcessor part particularly odd.  That's the default
right?:
2405 T12 C1 oashd.SqlEntityProcessor.initQuery ERROR The query failed
'null' java.lang.RuntimeException: unsupported type : class java.lang.String
at
org.apache.solr.handler.dataimport.FieldStreamDataSource.getData(FieldStreamDataSource.java:89)
 at
org.apache.solr.handler.dataimport.FieldStreamDataSource.getData(FieldStreamDataSource.java:1)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
 at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
 at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:469)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:495)
 at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:408)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:323)
 at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:231)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:411)
 at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:476)
at
org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:179)
 at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
 at org.apache.solr.util.TestHarness.query(TestHarness.java:291)
at
org.apache.solr.handler.dataimport.AbstractDataImportHandlerTestCase.runFullImport(AbstractDataImportHandlerTestCase.java:96)
 at
org.apache.solr.handler.dataimport.TestTikaEntityProcessor.testTikaHTMLMapperSubEntity(TestTikaEntityProcessor.java:124)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
at
com.carrotsearch.randomizedtesting.Randomize

Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception

2013-09-27 Thread Andreas Owen
i removed the FieldReaderDataSource and dataSource="fld" but it didn't help. i 
get the following for each document:
DataImportHandlerException: Exception in invoking url null Processing 
Document # 9
nullpointerexception


On 26. Sep 2013, at 8:39 PM, P Williams wrote:

> Hi,
> 
> Haven't tried this myself but maybe try leaving out the
> FieldReaderDataSource entirely.  From my quick searching looks like it's
> tied to SQL.  Did you try copying the
> http://wiki.apache.org/solr/TikaEntityProcessor Advanced Parsing example
> exactly?  What happens when you leave out FieldReaderDataSource?
> 
> Cheers,
> Tricia
> 
> 
> On Thu, Sep 26, 2013 at 4:17 AM, Andreas Owen  wrote:
> 
>> i'm using solr 4.3.1 and the dataimporter. i am trying to use
>> XPathEntityProcessor within the TikaEntityProcessor for indexing html-pages
>> but i'm getting this error for each document. i have also tried
>> dataField="tika.text" and dataField="text" to no avail. the nested
>> XPathEntityProcessor "detail" creates the error, the rest works fine. what
>> am i doing wrong?
>> 
>> error:
>> 
>> ERROR - 2013-09-26 12:08:49.006;
>> org.apache.solr.handler.dataimport.SqlEntityProcessor; The query failed
>> 'null'
>> java.lang.ClassCastException: java.io.StringReader cannot be cast to
>> java.util.Iterator
>>at
>> org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
>>at
>> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
>>at
>> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:465)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227)
>>at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)
>>at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)
>>at
>> org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:179)
>>at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
>>at
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
>>at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
>>at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
>>at
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
>>at
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
>>at
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
>>at
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
>>at
>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
>>at
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
>>at
>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
>>at
>> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
>>at
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
>>at
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>>at
>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>>at
>> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
>>at
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>>at org.eclipse.jetty.server.Server.handle(Server.java:365)
>>at
>> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
>>at
>> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
>>at
>> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
>>at
>> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
>>at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
>>at
>> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
>>at
>> org.eclipse.jetty.server.BlockingHttpConnection

Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception

2013-09-26 Thread P Williams
Hi,

Haven't tried this myself but maybe try leaving out the
FieldReaderDataSource entirely.  From my quick searching looks like it's
tied to SQL.  Did you try copying the
http://wiki.apache.org/solr/TikaEntityProcessor Advanced Parsing example
exactly?  What happens when you leave out FieldReaderDataSource?

Cheers,
Tricia


On Thu, Sep 26, 2013 at 4:17 AM, Andreas Owen  wrote:

> i'm using solr 4.3.1 and the dataimporter. i am trying to use
> XPathEntityProcessor within the TikaEntityProcessor for indexing html-pages
> but i'm getting this error for each document. i have also tried
> dataField="tika.text" and dataField="text" to no avail. the nested
> XPathEntityProcessor "detail" creates the error, the rest works fine. what
> am i doing wrong?
>
> error:
>
> ERROR - 2013-09-26 12:08:49.006;
> org.apache.solr.handler.dataimport.SqlEntityProcessor; The query failed
> 'null'
> java.lang.ClassCastException: java.io.StringReader cannot be cast to
> java.util.Iterator
> at
> org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
> at
> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
> at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:465)
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)
> at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319)
> at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227)
> at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)
> at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)
> at
> org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:179)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> at org.eclipse.jetty.server.Server.handle(Server.java:365)
> at
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
> at
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
> at
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
> at
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
> at
> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
> at
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
> at
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> at java.lang.Threa

XPathEntityProcessor nested in TikaEntityProcessor query null exception

2013-09-26 Thread Andreas Owen
i'm using solr 4.3.1 and the dataimporter. i am trying to use 
XPathEntityProcessor within the TikaEntityProcessor for indexing html-pages but 
i'm getting this error for each document. i have also tried 
dataField="tika.text" and dataField="text" to no avail. the nested 
XPathEntityProcessor "detail" creates the error, the rest works fine. what am i 
doing wrong?

error:

ERROR - 2013-09-26 12:08:49.006; 
org.apache.solr.handler.dataimport.SqlEntityProcessor; The query failed 'null'
java.lang.ClassCastException: java.io.StringReader cannot be cast to 
java.util.Iterator
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:465)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)
at 
org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:179)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:365)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at 
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Unknown Source)
ERROR - 2013-09-26 12:08:49.022; org.apache.solr.common.SolrException; 
Exception in entity : 
detail:org.apache.solr.handler.dataimport.DataImportHandlerException: 
java.lang.ClassCastException: java.io.StringReader cannot be cast to 
java.util.Iterator
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:65)
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
at 
org.apache.solr.handler.dataim