Unknown query parser 'terms' with TermsComponent defined

2015-08-25 Thread P Williams
Hi,

We've encountered a strange situation, I'm hoping someone might be able to
shed some light. We're using Solr 4.9 deployed in Tomcat 7.

We build a query that has these params:

'params'={
  'fl'='id',
  'sort'='system_create_dtsi asc',
  'indent'='true',
  'start'='0',
  'q'='_query_:{!raw f=has_model_ssim}Batch AND ({!terms
f=id}ft849m81z)',
  'qt'='standard',
  'wt'='ruby',
  'rows'=['1',
'1000']}},

And it responds with an error message
'error'={

'msg'='Unknown query parser \'terms\'',
'code'=400}}

The terms component is defined in solrconfig.xml:

  searchComponent name=termsComponent class=solr.TermsComponent /

  requestHandler name=/terms class=solr.SearchHandler
lst name=defaults
  bool name=termstrue/bool
/lst
arr name=components
  strtermsComponent/str
/arr
  /requestHandler

And the Standard Response Handler is defined:
requestHandler name=standard class=solr.SearchHandler lst name=
defaults str name=echoParamsexplicit/str str name=defTypelucene
/str /lst /requestHandler

In case its useful, we have
luceneMatchVersion4.9/luceneMatchVersion

Why would we be getting the Unknown query parser \'terms\' error?

Thanks,
Tricia


Re: Unknown query parser 'terms' with TermsComponent defined

2015-08-25 Thread P Williams
Thanks Hoss! It's obvious what the problem(s) are when you lay it all out
that way.

On Tue, Aug 25, 2015 at 12:14 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 1) The terms Query Parser (TermsQParser) has nothing to do with the
 TermsComponent (the first is for quering many distinct terms, the
 later is for requesting info about low level terms in your index)


 https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-TermsQueryParser
 https://cwiki.apache.org/confluence/display/solr/The+Terms+Component

 2) TermsQParser (which is what you are trying to use with the {!terms...
 query syntax) was not added to Solr until 4.10

 3) based on your example query, i'm pretty sure what you want is the
 TermQParser: term (singular, no s) ...


 https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-TermQueryParser

 {!term f=id}ft849m81z


 : We've encountered a strange situation, I'm hoping someone might be able
 to
 : shed some light. We're using Solr 4.9 deployed in Tomcat 7.
 ...
 :   'q'='_query_:{!raw f=has_model_ssim}Batch AND ({!terms
 f=id}ft849m81z)',
 ...
 : 'msg'='Unknown query parser \'terms\'',
 : 'code'=400}}

 ...

 : The terms component is defined in solrconfig.xml:
 :
 :   searchComponent name=termsComponent class=solr.TermsComponent /

 -Hoss
 http://www.lucidworks.com/



Re: Advice on highlighting

2014-09-12 Thread P Williams
Hi Craig,

Have you seen SOLR-4722 (https://issues.apache.org/jira/browse/SOLR-4722)?
 This was my attempt at something similar.

Regards,
Tricia

On Fri, Sep 12, 2014 at 2:23 PM, Craig Longman clong...@iconect.com wrote:

 In order to take our Solr usage to the next step, we really need to
 improve its highlighting abilities.  What I'm trying to do is to be able
 to write a new component that can return the fields that matched the
 search (including numeric fields) and the start/end positions for the
 alphanumeric matches.



 I see three different approaches take, either way will require making
 some modifications to the lucene/solr parts, as it just does not appear
 to be doable as a completely stand alone component.



 1) At initial search time.

 This seemed like a good approach.  I can follow IndexSearcher creating
 the TermContext that parses through AtomicReaderContexts to see if it
 contains a match and then adds it to the contexts available for later.
 However, at this point, inside SegmentTermsEnum.seekExact() it seems
 like Solr is not really looking for matching terms as such, it's just
 scanning what looks like the raw index.  So, I don't think I can easily
 extract term positions at this point.



 2) Write a odified HighlighterComponent.  We have managed to get phrases
 to highlight properly, but it seems like getting the full field matches
 would be more difficult in this module, however, because it does its
 highlighting oblivious to any other criteria, we can't use it as is.
 For example, this search:



   (body:large+AND+user_id:7)+OR+user_id:346



 Will highlight large in records that have user_id = 346 when
 technically (for our purposes at least) it should not be considered a
 hit because the large was accompanied by the user_id = 7 criteria.
 It's not immediately clear to me how difficult it would be to change
 this.



 3) Make a modified DebugComponent and enhance the existing explain()
 methods (in the query types we require it at least) to include more
 information such as the start/end positions of the term that was hit.
 I'm exploring this now, but I don't easily see how I can figure out what
 those positions might be from the explain() information.  Any pointers
 on how, at the point that TermQuery.explain() is being called that I can
 figure out which indexed token was the actual hit on?





 Craig Longman

 C++ Developer

 iCONECT Development, LLC
 519-645-1663





 This message and any attachments are intended only for the use of the
 addressee and may contain information that is privileged and confidential.
 If the reader of the message is not the intended recipient or an authorized
 representative of the intended recipient, you are hereby notified that any
 dissemination of this communication is strictly prohibited. If you have
 received this communication in error, notify the sender immediately by
 return email and delete the message and any attachments from your system.




How to sync lib directory in SolrCloud?

2014-07-31 Thread P Williams
Hi,

I have an existing collection that I'm trying to add to a new SolrCloud.
 This collection has all the normal files in conf but also has a lib
directory to support the filters schema.xml uses.

wget
https://github.com/projectblacklight/blacklight-jetty/archive/v4.9.0.zip
unzip v4.9.0.zip

I add the configuration to Zookeeper

cd /solr-4.9.0/example/scripts
cloud-scripts/zkcli.sh -cmd upconfig -confname blacklight -zkhost
zk1:2181,zk2:2181,zk3:2181 -confdir
~/blacklight-jetty-4.9.0/solr/blacklight-core/conf/

I try to create the collection
curl 
http://solr1:8080/solr/admin/collections?action=CREATEname=blacklightnumShards=3collection.configName=blacklightreplicationFactor=2maxShardsPerNode=2


but it looks like the jars in the lib directory aren't available and this
is what is causing my collection creation to fail.  I guess this makes
sense because it's not one of the files that I added to Zookeeper to share.
 How do I share the lib directory via Zookeeper?

Thanks,
Tricia

[pjenkins@solr1 scripts]$ cloud-scripts/zkcli.sh -cmd upconfig -zkhost
zk1:2181,zk2:2181,zk3:2181 -confdir
~/blacklight-jetty-4.9.0/solr/blacklight-core/conf/ -confname blacklight
INFO  - 2014-07-31 09:28:06.289; org.apache.zookeeper.Environment; Client
environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
INFO  - 2014-07-31 09:28:06.292; org.apache.zookeeper.Environment; Client
environment:host.name=solr1.library.ualberta.ca
INFO  - 2014-07-31 09:28:06.295; org.apache.zookeeper.Environment; Client
environment:java.version=1.7.0_65
INFO  - 2014-07-31 09:28:06.295; org.apache.zookeeper.Environment; Client
environment:java.vendor=Oracle Corporation
INFO  - 2014-07-31 09:28:06.295; org.apache.zookeeper.Environment; Client
environment:java.home=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre
INFO  - 2014-07-31 09:28:06.295; org.apache.zookeeper.Environment; Client

Re: Changing Cache Properties after Indexing

2014-01-17 Thread P Williams
You're both completely right.  There isn't any issue with indexing with
large cache settings.

I ran the same indexing job five times, twice with large cache and twice
with the default values. I threw out the first job because no matter if
it's cached or uncached it runs ~2x slower. This must have been the
observation I based my incorrect caching notion on.

I unloaded with delete of the data directory and reloaded the core each
time.  I'm using DIH with the FileEntityProcessor and
PlainTextEnityProcessor to index ~11000 fulltext books.

w/ cache
0:13:14.823
0:12:33.910

w/o cache
0:12:13.186
0:15:56.566

There is variation, but not anything that could be explained by the cache
settings. Doh!

Thanks,
Tricia


On Mon, Jan 13, 2014 at 6:08 PM, Shawn Heisey s...@elyograg.org wrote:

 On 1/13/2014 4:44 PM, Erick Erickson wrote:

 On the face of it, it's somewhat unusual to have the cache settings
 affect indexing performance. What are you seeing and how are you indexing?


 I think this is probably an indirect problem.  Cache settings don't
 directly affect indexing speed, but when autoWarm values are high and NRT
 indexing is happening, new searchers are requested frequently and the
 autoWarm makes that happen slowly with a lot of resources consumed.

 Thanks,
 Shawn




Changing Cache Properties after Indexing

2014-01-13 Thread P Williams
Hi,

I've gone through steps for tuning my cache sizes and I'm very happy with
the results of load testing.  Unfortunately the cache settings for querying
are not optimal for indexing - and in fact slow it down quite a bit.

I've made the caches small by default for the indexing stage and then want
to override the values using properties when used for querying.  That's
easy enough to do and described in
SolrConfigXmlhttp://wiki.apache.org/solr/SolrConfigXml
.

I store these properties in a solrcore-querying.properties file.  When
indexing is complete I could unload the Solr core, move (mv) this file to
conf/solrcore.properties and then load the Solr core and it would pick up
the new properties.  The only problem with that is in production I won't
have access to the machine to make changes to the file system.  I need to
be able to do this using the Core Admin API.

I can see that I can specify individual properties with the CREATE command,
for instance property.solr.filterCache.size=2003232.  Great!  So this is
possible but I still have two questions:

   1. Is there a way to specify a conf/solrcore-querying.properties file to
   the admin/cores handler instead of each property individually?
   2. The same functionality doesn't seem to be available when I call the
   RELOAD command.  Is this expected behaviour?  Should it be?

Is there a better way?

Thanks,
Tricia


Re: DataImport Handler, writing a new EntityProcessor

2013-12-18 Thread P Williams
Hi Mathias,

I'd recommend testing one thing at a time.  See if you can get it to work
for one image before you try a directory of images.  Also try testing using
the solr-testframework using your ide (I use Eclipse) to debug rather than
your browser/print statements.  Hopefully that will give you some more
specific knowledge of what's happening around your plugin.

I also wrote an EntityProcessor plugin to read from a properties
filehttps://issues.apache.org/jira/browse/SOLR-3928.
 Hopefully that'll give you some insight about this kind of Solr plugin and
testing them.

Cheers,
Tricia




On Wed, Dec 18, 2013 at 3:03 AM, Mathias Lux m...@itec.uni-klu.ac.atwrote:

 Hi all!

 I've got a question regarding writing a new EntityProcessor, in the
 same sense as the Tika one. My EntityProcessor should analyze jpg
 images and create document fields to be used with the LIRE Solr plugin
 (https://bitbucket.org/dermotte/liresolr). Basically I've taken the
 same approach as the TikaEntityProcessor, but my setup just indexes
 the first of 1000 images. I'm using a FileListEntityProcessor to get
 all JPEGs from a directory and then I'm handing them over (see [2]).
 My code for the EntityProcessor is at [1]. I've tried to use the
 DataSource as well as the filePath attribute, but it ends up all the
 same. However, the FileListEntityProcessor is able to read all the
 files according to the debug output, but I'm missing the link from the
 FileListEntityProcessor to the LireEntityProcessor.

 I'd appreciate any pointer or help :)

 cheers,
   Mathias

 [1] LireEntityProcessor http://pastebin.com/JFajkNtf
 [2] dataConfig http://pastebin.com/vSHucatJ

 --
 Dr. Mathias Lux
 Klagenfurt University, Austria
 http://tinyurl.com/mlux-itec



Re: Using data-config.xml from DIH in SolrJ

2013-11-14 Thread P Williams
Hi,

I just discovered
UpdateProcessorFactoryhttp://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/update/processor/package-summary.html
in
a big way.  How did this completely slip by me?

Working on two ideas.
1. I have used the DIH in a local EmbeddedSolrServer previously.  I could
write a ForwardingUpdateProcessorFactory to take that local update and send
it to a HttpSolrServer.
2. I have code which walks the file-system to compose rough documents but
haven't yet written the part that handles the templated fields and
cross-walking of the source(s) to the schema.  I could configure the update
handler on the Solr server side to do this with the RegexReplace
http://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.htmland
DefaultValuehttp://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/update/processor/DefaultValueUpdateProcessorFactory.html
 UpdateProcessorFactor(ies).

Any thoughts on the advantages/disadvantages of these approaches?

Thanks,
Tricia



On Thu, Nov 14, 2013 at 7:49 AM, Erick Erickson erickerick...@gmail.comwrote:

 There's nothing that I know of that takes a DIH configuration and
 uses it through SolrJ. You can use Tika directly in SolrJ if you
 need to parse structured documents though, see:
 http://searchhub.org/2012/02/14/indexing-with-solrj/

 Yep, you're going to be kind of reinventing the wheel a bit I'm
 afraid.

 Best,
 Erick


 On Wed, Nov 13, 2013 at 1:55 PM, P Williams
 williams.tricia.l...@gmail.comwrote:

  Hi All,
 
  I'm building a utility (Java jar) to create SolrInputDocuments and send
  them to a HttpSolrServer using the SolrJ API.  The intention is to find
 an
  efficient way to create documents from a large directory of files (where
  multiple files make one Solr document) and be sent to a remote Solr
  instance for update and commit.
 
  I've already solved the problem using the DataImportHandler (DIH) so I
 have
  a data-config.xml that describes the templated fields and cross-walking
 of
  the source(s) to the schema.  The original data won't always be able to
 be
  co-located with the Solr server which is why I'm looking for another
  option.
 
  I've also already solved the problem using ant and xslt to create a
  temporary (and unfortunately a potentially large) document which the
  UpdateHandler will accept.  I couldn't think of a solution that took
  advantage of the XSLT support in the UpdateHandler because each document
 is
  created from multiple files.  Our current dated Java based solution
  significantly outperforms this solution in terms of disk and time.  I've
  rejected it based on that and gone back to the drawing board.
 
  Does anyone have any suggestions on how I might be able to reuse my DIH
  configuration in the SolrJ context without re-inventing the wheel (or DIH
  in this case)?  If I'm doing something ridiculous I hope you'll point
 that
  out too.
 
  Thanks,
  Tricia
 



Using data-config.xml from DIH in SolrJ

2013-11-13 Thread P Williams
Hi All,

I'm building a utility (Java jar) to create SolrInputDocuments and send
them to a HttpSolrServer using the SolrJ API.  The intention is to find an
efficient way to create documents from a large directory of files (where
multiple files make one Solr document) and be sent to a remote Solr
instance for update and commit.

I've already solved the problem using the DataImportHandler (DIH) so I have
a data-config.xml that describes the templated fields and cross-walking of
the source(s) to the schema.  The original data won't always be able to be
co-located with the Solr server which is why I'm looking for another option.

I've also already solved the problem using ant and xslt to create a
temporary (and unfortunately a potentially large) document which the
UpdateHandler will accept.  I couldn't think of a solution that took
advantage of the XSLT support in the UpdateHandler because each document is
created from multiple files.  Our current dated Java based solution
significantly outperforms this solution in terms of disk and time.  I've
rejected it based on that and gone back to the drawing board.

Does anyone have any suggestions on how I might be able to reuse my DIH
configuration in the SolrJ context without re-inventing the wheel (or DIH
in this case)?  If I'm doing something ridiculous I hope you'll point that
out too.

Thanks,
Tricia


Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception

2013-09-30 Thread P Williams
Hi Andreas,

When using 
XPathEntityProcessorhttp://wiki.apache.org/solr/DataImportHandler#XPathEntityProcessoryour
DataSource
must be of type DataSourceReader.  You shouldn't be using
BinURLDataSource, it's giving you the cast exception.  Use
URLDataSourcehttps://builds.apache.org/job/Solr-Artifacts-4.x/javadoc/solr-dataimporthandler/org/apache/solr/handler/dataimport/URLDataSource.html
or
FileDataSourcehttps://builds.apache.org/job/Solr-Artifacts-4.x/javadoc/solr-dataimporthandler/org/apache/solr/handler/dataimport/FileDataSource.htmlinstead.

I don't think you need to specify namespaces, at least you didn't used to.
 The other thing that I've noticed is that the anywhere xpath expression //
doesn't always work in DIH.  You might have to be more specific.

Cheers,
Tricia





On Sun, Sep 29, 2013 at 9:47 AM, Andreas Owen a...@conx.ch wrote:

 how dum can you get. obviously quite dum... i would have to analyze the
 html-pages with a nested instance like this:

 entity name=rec processor=XPathEntityProcessor
 url=file:///C:\ColdFusion10\cfusion\solr\solr\tkbintranet\docImportUrl.xml
 forEach=/docs/doc dataSource=main

 entity name=htm processor=XPathEntityProcessor
 url=${rec.urlParse} forEach=/xhtml:html dataSource=dataUrl
 field column=text xpath=//content /
 field column=h_2 xpath=//body /
 field column=text_nohtml xpath=//text /
 field column=h_1 xpath=//h:h1 /
 /entity
 /entity

 but i'm pretty sure the foreach is wrong and the xpath expressions. in the
 moment i getting the following error:

 Caused by: java.lang.RuntimeException:
 org.apache.solr.handler.dataimport.DataImportHandlerException:
 java.lang.ClassCastException:
 sun.net.www.protocol.http.HttpURLConnection$HttpInputStream cannot be cast
 to java.io.Reader





 On 28. Sep 2013, at 1:39 AM, Andreas Owen wrote:

  ok i see what your getting at but why doesn't the following work:
 
field xpath=//h:h1 column=h_1 /
field column=text xpath=/xhtml:html/xhtml:body /
 
  i removed the tiki-processor. what am i missing, i haven't found
 anything in the wiki?
 
 
  On 28. Sep 2013, at 12:28 AM, P Williams wrote:
 
  I spent some more time thinking about this.  Do you really need to use
 the
  TikaEntityProcessor?  It doesn't offer anything new to the document you
 are
  building that couldn't be accomplished by the XPathEntityProcessor alone
  from what I can tell.
 
  I also tried to get the Advanced
  Parsinghttp://wiki.apache.org/solr/TikaEntityProcessorexample to
  work without success.  There are some obvious typos (document
  instead of /document) and an odd order to the pieces (dataSources is
  enclosed by document).  It also looks like
  FieldStreamDataSource
 http://lucene.apache.org/solr/4_3_1/solr-dataimporthandler/org/apache/solr/handler/dataimport/FieldStreamDataSource.html
 is
  the one that is meant to work in this context. If Koji is still around
  maybe he could offer some help?  Otherwise this bit of erroneous
  instruction should probably be removed from the wiki.
 
  Cheers,
  Tricia
 
  $ svn diff
  Index:
 
 solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
  ===
  ---
 
 solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
 (revision 1526990)
  +++
 
 solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
 (working copy)
  @@ -99,13 +99,13 @@
 runFullImport(getConfigHTML(identity));
 assertQ(req(*:*), testsHTMLIdentity);
   }
  -
  +
   private String getConfigHTML(String htmlMapper) {
 return
 dataConfig +
   dataSource type='BinFileDataSource'/ +
   document +
  -entity name='Tika' format='xml'
  processor='TikaEntityProcessor'  +
  +entity name='Tika' format='html'
  processor='TikaEntityProcessor'  +
url=' +
  getFile(dihextras/structured.html).getAbsolutePath() + '  +
 ((htmlMapper == null) ?  : ( htmlMapper=' + htmlMapper +
  ')) +  +
   field column='text'/ +
  @@ -114,4 +114,36 @@
 /dataConfig;
 
   }
  +  private String[] testsHTMLH1 = {
  +  //*[@numFound='1']
  +  , //str[@name='h1'][contains(.,'H1 Header')]
  +  };
  +
  +  @Test
  +  public void testTikaHTMLMapperSubEntity() throws Exception {
  +runFullImport(getConfigSubEntity(identity));
  +assertQ(req(*:*), testsHTMLH1);
  +  }
  +
  +  private String getConfigSubEntity(String htmlMapper) {
  +return
  +dataConfig +
  +dataSource type='BinFileDataSource' name='bin'/ +
  +dataSource type='FieldStreamDataSource' name='fld'/ +
  +document +
  +entity name='tika' processor

Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception

2013-09-27 Thread P Williams
$StatementRunner.run(ThreadLeakControl.java:358)
 at java.lang.Thread.run(Thread.java:722)



On Fri, Sep 27, 2013 at 3:55 AM, Andreas Owen a...@conx.ch wrote:

 i removed the FieldReaderDataSource and dataSource=fld but it didn't
 help. i get the following for each document:
 DataImportHandlerException: Exception in invoking url null
 Processing Document # 9
 nullpointerexception


 On 26. Sep 2013, at 8:39 PM, P Williams wrote:

  Hi,
 
  Haven't tried this myself but maybe try leaving out the
  FieldReaderDataSource entirely.  From my quick searching looks like it's
  tied to SQL.  Did you try copying the
  http://wiki.apache.org/solr/TikaEntityProcessor Advanced Parsing example
  exactly?  What happens when you leave out FieldReaderDataSource?
 
  Cheers,
  Tricia
 
 
  On Thu, Sep 26, 2013 at 4:17 AM, Andreas Owen a...@conx.ch wrote:
 
  i'm using solr 4.3.1 and the dataimporter. i am trying to use
  XPathEntityProcessor within the TikaEntityProcessor for indexing
 html-pages
  but i'm getting this error for each document. i have also tried
  dataField=tika.text and dataField=text to no avail. the nested
  XPathEntityProcessor detail creates the error, the rest works fine.
 what
  am i doing wrong?
 
  error:
 
  ERROR - 2013-09-26 12:08:49.006;
  org.apache.solr.handler.dataimport.SqlEntityProcessor; The query failed
  'null'
  java.lang.ClassCastException: java.io.StringReader cannot be cast to
  java.util.Iterator
 at
 
 org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
 at
 
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
 at
 
 org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:465)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227)
 at
 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)
 at
 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)
 at
 
 org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:179)
 at
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
 at
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
 at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
 at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
 at
 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
 at
 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
 at
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
 at
 
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
 at
 
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
 at
 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
 at
 
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
 at
 
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
 at
 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
 at
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
 at
 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
 at
 
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
 at
 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:365)
 at
 
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
 at
 
 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
 at
 
 org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
 at
 
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
 at
 org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856

Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception

2013-09-26 Thread P Williams
Hi,

Haven't tried this myself but maybe try leaving out the
FieldReaderDataSource entirely.  From my quick searching looks like it's
tied to SQL.  Did you try copying the
http://wiki.apache.org/solr/TikaEntityProcessor Advanced Parsing example
exactly?  What happens when you leave out FieldReaderDataSource?

Cheers,
Tricia


On Thu, Sep 26, 2013 at 4:17 AM, Andreas Owen a...@conx.ch wrote:

 i'm using solr 4.3.1 and the dataimporter. i am trying to use
 XPathEntityProcessor within the TikaEntityProcessor for indexing html-pages
 but i'm getting this error for each document. i have also tried
 dataField=tika.text and dataField=text to no avail. the nested
 XPathEntityProcessor detail creates the error, the rest works fine. what
 am i doing wrong?

 error:

 ERROR - 2013-09-26 12:08:49.006;
 org.apache.solr.handler.dataimport.SqlEntityProcessor; The query failed
 'null'
 java.lang.ClassCastException: java.io.StringReader cannot be cast to
 java.util.Iterator
 at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
 at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
 at
 org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
 at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:465)
 at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
 at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
 at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)
 at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319)
 at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227)
 at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)
 at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)
 at
 org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:179)
 at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
 at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
 at
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
 at
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
 at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
 at
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
 at
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
 at
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
 at
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
 at
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
 at
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
 at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
 at
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
 at
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
 at
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:365)
 at
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
 at
 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
 at
 org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
 at
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
 at
 org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
 at
 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
 at
 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
 at java.lang.Thread.run(Unknown Source)
 ERROR - 2013-09-26 12:08:49.022; 

Re: DIH field defaults or re-assigning field values

2013-09-24 Thread P Williams
I discovered how to use the
ScriptTransformerhttp://wiki.apache.org/solr/DataImportHandler#ScriptTransformer
which
worked to solve my problem.  I had to make use
of context.setSessionAttribute(...,...,'global') to store a flag for the
value in the file because the script is only called if there are rows to
transform and I needed to know when the default was appropriate to set in
the root entity.

Thanks for your suggestions Alex.

Cheers,
Tricia


On Wed, Sep 18, 2013 at 1:19 PM, P Williams
williams.tricia.l...@gmail.comwrote:

 Hi All,

 I'm using the DataImportHandler to import documents to my index.  I assign
 one of my document's fields by using a sub-entity from the root to look for
 a value in a file.  I've got this part working.  If the value isn't in the
 file or the file doesn't exist I'd like the field to be assigned a default
 value.  Is there a way to do this?

 I think I'm looking for a way to re-assign the value of a field.  If this
 is possible then I can assign the default value in the root entity and
 overwrite it if the value is found in the sub-entity. Ideas?

 Thanks,
 Tricia



DIH field defaults or re-assigning field values

2013-09-18 Thread P Williams
Hi All,

I'm using the DataImportHandler to import documents to my index.  I assign
one of my document's fields by using a sub-entity from the root to look for
a value in a file.  I've got this part working.  If the value isn't in the
file or the file doesn't exist I'd like the field to be assigned a default
value.  Is there a way to do this?

I think I'm looking for a way to re-assign the value of a field.  If this
is possible then I can assign the default value in the root entity and
overwrite it if the value is found in the sub-entity. Ideas?

Thanks,
Tricia


Re: How to Manage RAM Usage at Heavy Indexing

2013-09-09 Thread P Williams
Hi,

I've been seeing the same thing on CentOS with high physical memory use
with low JVM-Memory use.  I came to the conclusion that this was expected
behaviour.  Using top I noticed that my solr user's java process has
Virtual memory allocated of about twice the size of the index, actual is
within the limits I set when jetty starts.  I infer from this that 98% of
Physical Memory is being used to cache the index.  Walter, Erick and others
are constantly reminding people on list to have RAM the size of the index
available -- I think 98% physical memory use is exactly why.  Here is an
excerpt from Uwe Schindler's well written
piecehttp://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.htmlwhich
explains in greater detail:

*Basically mmap does the same like handling the Lucene index as a swap
file. The mmap() syscall tells the O/S kernel to virtually map our whole
index files into the previously described virtual address space, and make
them look like RAM available to our Lucene process. We can then access our
index file on disk just like it would be a large byte[] array (in Java this
is encapsulated by a ByteBuffer interface to make it safe for use by Java
code). If we access this virtual address space from the Lucene code we
don’t need to do any syscalls, the processor’s MMU and TLB handles all the
mapping for us. If the data is only on disk, the MMU will cause an
interrupt and the O/S kernel will load the data into file system cache. If
it is already in cache, MMU/TLB map it directly to the physical memory in
file system cache. It is now just a native memory access, nothing more! We
don’t have to take care of paging in/out of buffers, all this is managed by
the O/S kernel. Furthermore, we have no concurrency issue, the only
overhead over a standard byte[] array is some wrapping caused by
Java’s ByteBuffer
interface (it is still slower than a real byte[] array, but that is the
only way to use mmap from Java and is much faster than all other directory
implementations shipped with Lucene). We also waste no physical memory, as
we operate directly on the O/S cache, avoiding all Java GC issues described
before.*
*
*
Is it odd that my index is ~16GB but top shows 30GB in virtual memory?
 Would the extra be for the field and filter caches I've increased in size?

I went through a few Java tuning steps relating to OutOfMemoryErrors when
using DataImportHandler with Solr.  The first thing is that when using the
FileEntityProcessor for each file in the file system to be indexed an entry
is made and stored in heap before any indexing actually occurs.  When I
started pointing this at very large directories I started running out of
heap.  One work-around is to divide the job up into smaller batches, but I
was able to allocate more memory so that everything fit.  The next thing is
that with more memory allocated the limiting factor was too many open
files.  After allowing the solr user to open more files I was able to get
past this as well.  There was a sweet spot where indexing with just enough
memory was slow enough that I didn't experience the too many open files
error but why go slow?  Now I'm able to index ~4M documents (newspaper
articles and fulltext monographs) in about 7 hours.

I hope someone will correct me if I'm wrong about anything I've said here
and especially if there is a better way to do things.

Best of luck,
Tricia



On Wed, Aug 28, 2013 at 12:12 PM, Dan Davis dansm...@gmail.com wrote:

 This could be an operating systems problem rather than a Solr problem.
 CentOS 6.4 (linux kernel 2.6.32) may have some issues with page flushing
 and I would read-up up on that.
 The VM parameters can be tuned in /etc/sysctl.conf


 On Sun, Aug 25, 2013 at 4:23 PM, Furkan KAMACI furkankam...@gmail.com
 wrote:

  Hi Erick;
 
  I wanted to get a quick answer that's why I asked my question as that
 way.
 
  Error is as follows:
 
  INFO  - 2013-08-21 22:01:30.978;
  org.apache.solr.update.processor.LogUpdateProcessor; [collection1]
  webapp=/solr path=/update params={wt=javabinversion=2}
  {add=[com.deviantart.reachmeh
  ere:http/gallery/, com.deviantart.reachstereo:http/,
  com.deviantart.reachstereo:http/art/SE-mods-313298903,
  com.deviantart.reachtheclouds:http/,
 com.deviantart.reachthegoddess:http/,
  co
  m.deviantart.reachthegoddess:http/art/retouched-160219962,
  com.deviantart.reachthegoddess:http/badges/,
  com.deviantart.reachthegoddess:http/favourites/,
  com.deviantart.reachthetop:http/
  art/Blue-Jean-Baby-82204657 (1444006227844530177),
  com.deviantart.reachurdreams:http/, ... (163 adds)]} 0 38790
  ERROR - 2013-08-21 22:01:30.979; org.apache.solr.common.SolrException;
  java.lang.RuntimeException: [was class org.eclipse.jetty.io.EofException]
  early EOF
  at
 
 
 com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
  at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
  at
 
 
 

Re: Total Term Frequency per ResultSet in Solr 4.3 ?

2013-07-04 Thread P Williams
Hi Tony,

Have you seen the
TermVectorComponenthttp://wiki.apache.org/solr/TermVectorComponent?
 It will return the TermVectors for the documents in your result set (note
that the rows parameter matters if you want results for the whole set, the
default is 10).  TermVectors also must be stored for each field that you
want term frequency returned for.  Suppose you have the query
http://localhost:8983/solr/collection1/tvrh?q=cablefl=includestv.tf=true on
the example that comes packaged with Solr.  Then part of the response is:

lst name=termVectors
str name=uniqueKeyFieldNameid/str
lst name=IW-02
str name=uniqueKeyIW-02/str
/lst
lst name=9885A004
str name=uniqueKey9885A004/str
lst name=includes
lst name=32mb
int name=tf1/int
/lst
lst name=av
int name=tf1/int
/lst
lst name=battery
int name=tf1/int
/lst
lst name=cable
int name=tf2/int
/lst
lst name=card
int name=tf1/int
/lst
lst name=sd
int name=tf1/int
/lst
lst name=usb
int name=tf1/int
/lst
/lst
/lst
lst name=3007WFP
str name=uniqueKey3007WFP/str
lst name=includes
lst name=cable
int name=tf1/int
/lst
lst name=usb
int name=tf1/int
/lst
/lst
/lst
lst name=MA147LL/A
str name=uniqueKeyMA147LL/A/str
lst name=includes
lst name=cable
int name=tf1/int
/lst
lst name=earbud
int name=tf1/int
/lst
lst name=headphones
int name=tf1/int
/lst
lst name=usb
int name=tf1/int
/lst
/lst
/lst
/lst

Then you can use an XPath query like
sum(//lst[@name='cable']/int[@name='tf']) where 'cable' was the term, to
calculate the term frequency in the 'includes' field for the whole result
set.  You could extend this to get the term frequency across all fields for
your result set with some alterations to the query and schema.xml
configuration.  Alternately you could get the response as json (wt=json)
and use javascript to sum. I know this is not terribly efficient but, if
I'm understanding your request correctly, it's possible.

Cheers,
Tricia


On Thu, Jul 4, 2013 at 10:24 AM, Tony Mullins tonymullins...@gmail.comwrote:

 So what is the workaround for this problem ?
 Can it be done without changing any source code ?

 Thanks,
 Tony


 On Thu, Jul 4, 2013 at 8:01 PM, Yonik Seeley yo...@lucidworks.com wrote:

  Ah, sorry - I thought you were after docfreq, not termfreq.
  -Yonik
  http://lucidworks.com
 
  On Thu, Jul 4, 2013 at 10:57 AM, Tony Mullins tonymullins...@gmail.com
  wrote:
   Hi Yonik,
  
   With facet it didn't work.
  
   Please see the result set doc below
  
  
 
 http://localhost:8080/solr/collection2/select?fl=*,amazing_freq:termfreq%28product,%27amazing%27%29,spider_freq:termfreq%28product,%27spider%27%29fq=id%3A27q=spiderfl=*df=productwt=xmlindent=truefacet=truefacet.query=product:spiderfacet.query=product:amazingrows=20
  
   doc
str name=id27/str
str name=typeMovies/str
 str name=formatdvd/str
 str name=productThe amazing spider man is amazing spider the
   spider/str
 int name=popularity1/int
 long name=_version_1439641369145507840/long
  
 int name=amazing_freq2/int
 int name=spider_freq3/int
 /doc
 /resultlst name=facet_countslst name=facet_queries
 int name=product:spider1/int
  int name=product:amazing1/int
   /lst
  
   As you can see facet is actually just returning the no. of docs found
   against those keywrods not the actual frequency.
   Actual frequency is returned by the field 'amazing_freq' 
 'spider_freq'
  !
  
   So is there any workaround for this to get the total of term-frequency
 in
   resultset without any modification to Solr source code ?
  
  
   Thanks,
   Tony
  
  
   On Thu, Jul 4, 2013 at 7:05 PM, Yonik Seeley yo...@lucidworks.com
  wrote:
  
   If you just want to retrieve those counts, this seems like simple
  faceting.
  
   q=something
   facet=true
   facet.query=product:hunger
   facet.query=product:games
  
   -Yonik
   http://lucidworks.com
  
   On Thu, Jul 4, 2013 at 9:45 AM, Tony Mullins 
 tonymullins...@gmail.com
   wrote:
Hi ,
   
I have lots of crawled data, indexed in my Solr (4.3.0) and lets say
  user
creates a search criteria 'X1' and he/she wants to know the
 occurrence
   of a
specific term in the result set of that 'X1' search criteria.
And then again he/she creates another search criteria 'X2' and
 he/she
   wants
to know the occurrence of that same term in the result set of that
  'X2'
search criteria.
   
At the moment if I give termfreq(field,term) then it gives me the
 term
frequency per document and if I use totaltermfreq(field,term), it
  gives
   me
the total term frequency in entire index not in the result set of my
   search
criteria.
   
So what I need is your help to find how to how to get total
 occurrence
   of a
term in query's result set.
   
If this is my result set
   
doc
str name=typeMovies/str
str name=formatdvd/str
str name=productThe Hunger Games/str/doc
   
  doc
str name=typeBooks/str
str name=formatpaperback/str
str name=productThe 

SolrEntityProcessor doesn't grok responseHeader tag in Ancient Solr 1.2 source

2013-04-23 Thread P Williams
Hi,

I'd like to use the SolrEntityProcessor to partially migrate an old index
to Solr 4.1.  The source is pretty old (dated 2006-06-10 16:05:12Z)...
maybe Solr 1.2?  My data-config.xml is based on the SolrEntityProcessor
example http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor
and wt=xml.
 I'm getting an error from SolrJ complaining about
responseHeader
status0/status
QTime1/QTime
/responseHeader
in the response.  Does anyone know of a work-around?

Thanks,
Tricia

1734 T12 C0 oasc.SolrException.log SEVERE Exception while processing: sep
document :
SolrInputDocument[]:org.apache.solr.handler.dataimport.DataImportHandlerException:
org.apache.solr.common.SolrException: parsing error
Caused by: org.apache.solr.common.SolrException: parsing error
Caused by: java.lang.RuntimeException: this must be known type! not:
responseHeader
at
org.apache.solr.client.solrj.impl.XMLResponseParser.readNamedList(XMLResponseParser.java:222)
 at
org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:128)
... 43 more


Re: SolrEntityProcessor doesn't grok responseHeader tag in Ancient Solr 1.2 source

2013-04-23 Thread P Williams
Thanks Erik.  I remember Solr Flare :)


On Tue, Apr 23, 2013 at 11:56 AM, Erik Hatcher erik.hatc...@gmail.comwrote:

 You might be out of luck with the SolrEntityProcessor I'd recommend
 writing a simple little script that pages through /select?q=*:* from the
 source Solr and write to the destination Solr.   Back in the day there was
 this fun little beast 
 https://github.com/erikhatcher/solr-ruby-flare/blob/master/solr-ruby/lib/solr/importer/solr_source.rb
 where you could do something like this:

Solr::Indexer.new(SolrSource.new(...), mapping).index

 Erik


 On Apr 23, 2013, at 13:41 , P Williams wrote:

  Hi,
 
  I'd like to use the SolrEntityProcessor to partially migrate an old index
  to Solr 4.1.  The source is pretty old (dated 2006-06-10 16:05:12Z)...
  maybe Solr 1.2?  My data-config.xml is based on the SolrEntityProcessor
  example 
 http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor
  and wt=xml.
  I'm getting an error from SolrJ complaining about
  responseHeader
  status0/status
  QTime1/QTime
  /responseHeader
  in the response.  Does anyone know of a work-around?
 
  Thanks,
  Tricia
 
  1734 T12 C0 oasc.SolrException.log SEVERE Exception while processing: sep
  document :
 
 SolrInputDocument[]:org.apache.solr.handler.dataimport.DataImportHandlerException:
  org.apache.solr.common.SolrException: parsing error
  Caused by: org.apache.solr.common.SolrException: parsing error
  Caused by: java.lang.RuntimeException: this must be known type! not:
  responseHeader
  at
 
 org.apache.solr.client.solrj.impl.XMLResponseParser.readNamedList(XMLResponseParser.java:222)
  at
 
 org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:128)
  ... 43 more




Re: How do I recover the position and offset a highlight for solr (4.1/4.2)?

2013-04-16 Thread P Williams
Hi,

It doesn't have the offset information, but checkout my patch
https://issues.apache.org/jira/browse/SOLR-4722 which outputs the position
of each term that's been matched.  I'm eager to get some feedback on this
approach and any improvements that might be suggested.

Cheers,
Tricia


On Wed, Mar 27, 2013 at 8:28 AM, Skealler Nametic bchaillou...@gmail.comwrote:

 Hi,

 I would like to retrieve the position and offset of each highlighting
 found.
 I searched on the internet, but I have not found the exact solution to my
 problem...



Results Order When Performing Wildcard Query

2013-04-09 Thread P Williams
Hi,

I wrote a test of my application which revealed a Solr oddity (I think).
 The test which I wrote on Windows 7 and makes use of the
solr-test-frameworkhttp://lucene.apache.org/solr/4_1_0/solr-test-framework/index.html
fails
under Ubuntu 12.04 because the Solr results I expected for a wildcard query
of the test data are ordered differently under Ubuntu than Windows.  On
both Windows and Ubuntu all items in the result set have a score of 1.0 and
appear to be ordered by docid (which looks like in corresponds to
alphabetical unique id on Windows but not Ubuntu).  I'm guessing that the
root of my issue is that a different docid was assigned to the same
document on each operating system.

The data was imported using a DataImportHandler configuration during a
@BeforeClass step in my JUnit test on both systems.

Any suggestions on how to ensure a consistently ordered wildcard query
result set for testing?

Thanks,
Tricia


Re: Results Order When Performing Wildcard Query

2013-04-09 Thread P Williams
Hey Shawn,

My gut says the difference in assignment of docids has to do with how the
FileListEntityProcessorhttp://wiki.apache.org/solr/DataImportHandler#FileListEntityProcessor
works
on the two operating systems. The documents are updated/imported in a
different order is my guess, but I haven't tested that theory. I still
think it's kind of odd that there would be a difference.

Indexes are created from scratch in my test, so it's not that. java
-versionreports the same values on both machines
java version 1.7.0_17
Java(TM) SE Runtime Environment (build 1.7.0_17-b02)
Java HotSpot(TM) Client VM (build 23.7-b01, mixed mode)

The explicit (arbitrary non-score) sort parameter will work as a
work-around to get my test to pass in both environments while I think about
this some more. Thanks!

Cheers,
Tricia


On Tue, Apr 9, 2013 at 2:13 PM, Shawn Heisey s...@elyograg.org wrote:

 On 4/9/2013 12:08 PM, P Williams wrote:

 I wrote a test of my application which revealed a Solr oddity (I think).
   The test which I wrote on Windows 7 and makes use of the
 solr-test-frameworkhttp://**lucene.apache.org/solr/4_1_0/**
 solr-test-framework/index.htmlhttp://lucene.apache.org/solr/4_1_0/solr-test-framework/index.html
 **

 fails
 under Ubuntu 12.04 because the Solr results I expected for a wildcard
 query
 of the test data are ordered differently under Ubuntu than Windows.  On
 both Windows and Ubuntu all items in the result set have a score of 1.0
 and
 appear to be ordered by docid (which looks like in corresponds to
 alphabetical unique id on Windows but not Ubuntu).  I'm guessing that the
 root of my issue is that a different docid was assigned to the same
 document on each operating system.


 It might be due to differences in how Java works on the two platforms, or
 even something as simple as different Java versions.  I don't know a lot
 about the underlying Lucene stuff, so this next sentence may not be
 correct: If you have are not starting from an index where the actual index
 directory was deleted before the test started (rather than deleting all
 documents), that might produce different internal Lucene document ids.


  The data was imported using a DataImportHandler configuration during a
 @BeforeClass step in my JUnit test on both systems.

 Any suggestions on how to ensure a consistently ordered wildcard query
 result set for testing?


 Include an explicit sort parameter.  That way it will depend on the data,
 not the internal Lucene representation.

 Thanks,
 Shawn




Re: Highlighting data stored outside of Solr

2012-12-17 Thread P Williams
Your problem seems really similar to It should be possible to highlight
external text https://issues.apache.org/jira/browse/SOLR-1397 in JIRA.

Tricia
[https://issues.apache.org/jira/browse/SOLR-1397]

On Tue, Dec 11, 2012 at 12:48 PM, Michael Ryan mr...@moreover.com wrote:

 Has anyone ever attempted to highlight a field that is not stored in Solr?
  We have been considering not storing fields in Solr, but still would like
 to use Solr's built-in highlighting.  On first glance, it looks like it
 would be fairly simply to modify DefaultSolrHighlighter to get the stored
 fields from an external source.  We already do not use term vectors, so no
 concerns there.  Any gotchas that I am not seeing?

 -Michael



Re: Using

2012-10-16 Thread P Williams
Hi,

Just wanted to update with a workaround.

dependency org=org.apache.solr name=solr-test-framework rev=4.0.0
conf=test-default 
exclude type=orbit/
/dependency

Works for me to test my configs and project code with SolrTestCaseJ4 using
IVY as a dependency manager.

Does anyone else think it's odd that the directory structure
solr.home/collection1 is hard coded into the test-framework?

Regards,
Tricia

On Mon, Oct 15, 2012 at 11:19 AM, P Williams williams.tricia.l...@gmail.com
 wrote:

 Hi,

 Thanks for the suggestions.  Didn't work for me :(

 I'm calling
 dependency org=org.apache.solr name=solr-test-framework rev=4.0.0
 conf=test-default/

 which depends on org.eclipse.jetty:jetty-server
 which depends on org.eclipse.jetty.orbit:jettty-servlet

 I think I'm experiencing https://jira.codehaus.org/browse/JETTY-1493.

 The pom file for
 http://repo1.maven.org/maven2/org/eclipse/jetty/orbit/javax.servlet/3.0.0.v201112011016/javax.servlet-3.0.0.v201112011016.pom
  contains packagingorbit/packaging, so ivy looks for
 http://repo1.maven.org/maven2/org/eclipse/jetty/orbit/javax.servlet/3.0.0.v201112011016/javax.servlet-3.0.0.v201112011016.orbit
  rather
 than
 http://repo1.maven.org/maven2/org/eclipse/jetty/orbit/javax.servlet/3.0.0.v201112011016/javax.servlet-3.0.0.v201112011016.jar
  hence
 my troubles.

 I'm an IVY newbie so maybe there is something I'm missing here?  Is there
 another 'conf' value other than 'default' I can use?

 Thanks,
 Tricia



 On Fri, Oct 12, 2012 at 4:32 PM, P Williams 
 williams.tricia.l...@gmail.com wrote:

 Hi,

 Has anyone tried using dependency org=org.apache.solr
 name=solr-test-framework rev=4.0.0 conf=test-default/ with
 Apache IVY in their project?

 rev 3.6.1 works but any of the 4.0.0 ALPHA, BETA and release result in:
 [ivy:resolve] :: problems summary ::
 [ivy:resolve]  WARNINGS
 [ivy:resolve]   [FAILED ]
 org.eclipse.jetty.orbit#javax.servlet;3.0.0.v201112011016!javax.servlet.orbit:
  (0ms)
 [ivy:resolve]    shared: tried
 [ivy:resolve]
 C:\Users\pjenkins\.ant/shared/org.eclipse.jetty.orbit/javax.servlet/3.0.0.v201112011016/orbits/javax.servlet.orbit
 [ivy:resolve]    public: tried
 [ivy:resolve]
 http://repo1.maven.org/maven2/org/eclipse/jetty/orbit/javax.servlet/3.0.0.v201112011016/javax.servlet-3.0.0.v201112011016.orbit
 [ivy:resolve]   ::
 [ivy:resolve]   ::  FAILED DOWNLOADS::
 [ivy:resolve]   :: ^ see resolution messages for details  ^ ::
 [ivy:resolve]   ::
 [ivy:resolve]   ::
 org.eclipse.jetty.orbit#javax.servlet;3.0.0.v201112011016!javax.servlet.orbit
 [ivy:resolve]   ::
 [ivy:resolve]
 [ivy:resolve]
 [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS

 Can anybody point me to the source of this error or a workaround?

 Thanks,
 Tricia





Re: How does Solr know which relative paths to use?

2012-10-16 Thread P Williams
Hi Dotan,

It seems that the examples now use Multiple
Coreshttp://wiki.apache.org/solr/CoreAdminby default.  If your test
server is based on the stock example, you should
see a solr.xml file in your CWD path which is how Solr knows about the
relative paths.  There should also be a README.txt file that will tell you
more about how the directory is expected to be organized.

Cheers,
Tricia

On Tue, Oct 16, 2012 at 3:50 PM, Dotan Cohen dotanco...@gmail.com wrote:

 I have just installed Solr 4.0 on a test server. I start it like so:
 $ pwd
 /some/dir
 $ java -jar start.jar

 The Solr Instance now looks like this:
 CWD
 /some/dir
 Instance
 /some/dir/solr/collection1
 Data
 /some/dir/solr/collection1/data
 Index
 /some/dir/solr/collection1/data/index

 From where did the additional relative paths 'collection1',
 'collection1/data', and 'collection1/data/index' come from? I know
 that I can change the value of CWD with the -Dsolr.solr.home flag, but
 what affects the relative paths mentioned?

 Thanks.


 --
 Dotan Cohen

 http://gibberish.co.il
 http://what-is-what.com



Re: Using

2012-10-15 Thread P Williams
Hi,

Thanks for the suggestions.  Didn't work for me :(

I'm calling
dependency org=org.apache.solr name=solr-test-framework rev=4.0.0
conf=test-default/

which depends on org.eclipse.jetty:jetty-server
which depends on org.eclipse.jetty.orbit:jettty-servlet

I think I'm experiencing https://jira.codehaus.org/browse/JETTY-1493.

The pom file for
http://repo1.maven.org/maven2/org/eclipse/jetty/orbit/javax.servlet/3.0.0.v201112011016/javax.servlet-3.0.0.v201112011016.pom
 contains packagingorbit/packaging, so ivy looks for
http://repo1.maven.org/maven2/org/eclipse/jetty/orbit/javax.servlet/3.0.0.v201112011016/javax.servlet-3.0.0.v201112011016.orbit
rather
than
http://repo1.maven.org/maven2/org/eclipse/jetty/orbit/javax.servlet/3.0.0.v201112011016/javax.servlet-3.0.0.v201112011016.jar
hence
my troubles.

I'm an IVY newbie so maybe there is something I'm missing here?  Is there
another 'conf' value other than 'default' I can use?

Thanks,
Tricia



On Fri, Oct 12, 2012 at 4:32 PM, P Williams
williams.tricia.l...@gmail.comwrote:

 Hi,

 Has anyone tried using dependency org=org.apache.solr
 name=solr-test-framework rev=4.0.0 conf=test-default/ with Apache
 IVY in their project?

 rev 3.6.1 works but any of the 4.0.0 ALPHA, BETA and release result in:
 [ivy:resolve] :: problems summary ::
 [ivy:resolve]  WARNINGS
 [ivy:resolve]   [FAILED ]
 org.eclipse.jetty.orbit#javax.servlet;3.0.0.v201112011016!javax.servlet.orbit:
  (0ms)
 [ivy:resolve]    shared: tried
 [ivy:resolve]
 C:\Users\pjenkins\.ant/shared/org.eclipse.jetty.orbit/javax.servlet/3.0.0.v201112011016/orbits/javax.servlet.orbit
 [ivy:resolve]    public: tried
 [ivy:resolve]
 http://repo1.maven.org/maven2/org/eclipse/jetty/orbit/javax.servlet/3.0.0.v201112011016/javax.servlet-3.0.0.v201112011016.orbit
 [ivy:resolve]   ::
 [ivy:resolve]   ::  FAILED DOWNLOADS::
 [ivy:resolve]   :: ^ see resolution messages for details  ^ ::
 [ivy:resolve]   ::
 [ivy:resolve]   ::
 org.eclipse.jetty.orbit#javax.servlet;3.0.0.v201112011016!javax.servlet.orbit
 [ivy:resolve]   ::
 [ivy:resolve]
 [ivy:resolve]
 [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS

 Can anybody point me to the source of this error or a workaround?

 Thanks,
 Tricia



Re: Using

2012-10-15 Thread P Williams
Apologies, there was a typo in my last message.

org.eclipse.jetty.orbit:jettty-servlet  should have been
org.eclipse.jetty.orbit:javax.servlet


On Mon, Oct 15, 2012 at 11:19 AM, P Williams williams.tricia.l...@gmail.com
 wrote:

 Hi,

 Thanks for the suggestions.  Didn't work for me :(

 I'm calling
 dependency org=org.apache.solr name=solr-test-framework rev=4.0.0
 conf=test-default/

 which depends on org.eclipse.jetty:jetty-server
 which depends on org.eclipse.jetty.orbit:jettty-servlet

 I think I'm experiencing https://jira.codehaus.org/browse/JETTY-1493.

 The pom file for
 http://repo1.maven.org/maven2/org/eclipse/jetty/orbit/javax.servlet/3.0.0.v201112011016/javax.servlet-3.0.0.v201112011016.pom
  contains packagingorbit/packaging, so ivy looks for
 http://repo1.maven.org/maven2/org/eclipse/jetty/orbit/javax.servlet/3.0.0.v201112011016/javax.servlet-3.0.0.v201112011016.orbit
  rather
 than
 http://repo1.maven.org/maven2/org/eclipse/jetty/orbit/javax.servlet/3.0.0.v201112011016/javax.servlet-3.0.0.v201112011016.jar
  hence
 my troubles.

 I'm an IVY newbie so maybe there is something I'm missing here?  Is there
 another 'conf' value other than 'default' I can use?

 Thanks,
 Tricia



 On Fri, Oct 12, 2012 at 4:32 PM, P Williams 
 williams.tricia.l...@gmail.com wrote:

 Hi,

 Has anyone tried using dependency org=org.apache.solr
 name=solr-test-framework rev=4.0.0 conf=test-default/ with
 Apache IVY in their project?

 rev 3.6.1 works but any of the 4.0.0 ALPHA, BETA and release result in:
 [ivy:resolve] :: problems summary ::
 [ivy:resolve]  WARNINGS
 [ivy:resolve]   [FAILED ]
 org.eclipse.jetty.orbit#javax.servlet;3.0.0.v201112011016!javax.servlet.orbit:
  (0ms)
 [ivy:resolve]    shared: tried
 [ivy:resolve]
 C:\Users\pjenkins\.ant/shared/org.eclipse.jetty.orbit/javax.servlet/3.0.0.v201112011016/orbits/javax.servlet.orbit
 [ivy:resolve]    public: tried
 [ivy:resolve]
 http://repo1.maven.org/maven2/org/eclipse/jetty/orbit/javax.servlet/3.0.0.v201112011016/javax.servlet-3.0.0.v201112011016.orbit
 [ivy:resolve]   ::
 [ivy:resolve]   ::  FAILED DOWNLOADS::
 [ivy:resolve]   :: ^ see resolution messages for details  ^ ::
 [ivy:resolve]   ::
 [ivy:resolve]   ::
 org.eclipse.jetty.orbit#javax.servlet;3.0.0.v201112011016!javax.servlet.orbit
 [ivy:resolve]   ::
 [ivy:resolve]
 [ivy:resolve]
 [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS

 Can anybody point me to the source of this error or a workaround?

 Thanks,
 Tricia





Re: Solr - Tika(?) memory leak

2012-01-16 Thread P Williams
Hi,

I'm not sure which version of Solr/Tika you're using but I had a similar
experience which turned out to be the result of a design change to PDFBox.

https://issues.apache.org/jira/browse/SOLR-2886

Tricia

On Sat, Jan 14, 2012 at 12:53 AM, Wayne W waynemailingli...@gmail.comwrote:

 Hi,

 we're using Solr running on tomcat with 1GB in production, and of late
 we've been having a huge number of OutOfMemory issues. It seems from
 what I can tell this is coming from the tika extraction of the
 content. I've processed the java dump file using a memory analyzer and
 its pretty clean at least the class involved. It seems like a leak to
 me, as we don't parse any files larger than 20M, and these objects are
 taking up ~700M

 I've attached 2 screen shots from the tool (not sure if you receive
 attachments).

 But to summarize (class, number of objects, Used heap size, Retained Heap
 Size):


 org.apache.xmlbeans.impl.store.Xob$ElementXObj 838,993
 80,533,728   604,606,040
 org.apache.poi.openxml4j.opc.ZipPackage  2
   112  87,009,848
 char[]
  58732,216,960   38,216,950


 We're really desperate to find a solution to this - any ideas or help
 is greatly appreciated.
 Wayne



Re: avoid overwrite in DataImportHandler

2011-12-08 Thread P Williams
Ah.  Thanks Erick.

I see now that my question is different from sabman's.

Is there a way to use the DataImportHandler's full-import command so that
it does not delete the existing material before it begins?

Thanks,
Tricia

On Thu, Dec 8, 2011 at 6:35 AM, Erick Erickson erickerick...@gmail.comwrote:

 This is all controlled by Solr via the uniqueKey field in your schema.
 Just
 remove that entry.

 But then it's all up to you to handle the fact that there will be multiple
 documents with the same ID all returned as a result of querying. And
 it won't matter what program adds data, *nothing* will be overwritten,
 DIH has no part in that decision.

 Deduplication is about defining some fields in your record and avoiding
 adding another document if the contents are close, where close is a
 slippery concept. I don't think it's related to your problem at all.

 Best
 Erick

 On Wed, Dec 7, 2011 at 3:27 PM, P Williams
 williams.tricia.l...@gmail.com wrote:
  Hi,
 
  I've wondered the same thing myself.  I feel like the clean parameter
 has
  something to do with it but it doesn't work as I'd expect either.  Thanks
  in advance to anyone who can answer this question.
 
  *clean* : (default 'true'). Tells whether to clean up the index before
 the
  indexing is started.
 
  Tricia
 
  On Wed, Dec 7, 2011 at 12:49 PM, sabman sab...@gmail.com wrote:
 
  I have a unique ID defined for the documents I am indexing. I want to
 avoid
  overwriting the documents that have already been indexed. I am using
  XPathEntityProcessor and TikaEntityProcessor to process the documents.
 
  The DataImportHandler does not seem to have the option to set
  overwrite=false. I have read some other forums to use deduplication
 instead
  but I don't see how it is related to my problem.
 
  Any help on this (or explanation on how deduplication would apply to my
  probelm ) would be great. Thanks!
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/avoid-overwrite-in-DataImportHandler-tp3568435p3568435.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 



Re: avoid overwrite in DataImportHandler

2011-12-07 Thread P Williams
Hi,

I've wondered the same thing myself.  I feel like the clean parameter has
something to do with it but it doesn't work as I'd expect either.  Thanks
in advance to anyone who can answer this question.

*clean* : (default 'true'). Tells whether to clean up the index before the
indexing is started.

Tricia

On Wed, Dec 7, 2011 at 12:49 PM, sabman sab...@gmail.com wrote:

 I have a unique ID defined for the documents I am indexing. I want to avoid
 overwriting the documents that have already been indexed. I am using
 XPathEntityProcessor and TikaEntityProcessor to process the documents.

 The DataImportHandler does not seem to have the option to set
 overwrite=false. I have read some other forums to use deduplication instead
 but I don't see how it is related to my problem.

 Any help on this (or explanation on how deduplication would apply to my
 probelm ) would be great. Thanks!

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/avoid-overwrite-in-DataImportHandler-tp3568435p3568435.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: DIH doesn't handle bound namespaces?

2011-11-03 Thread P Williams
Hi Gary,

From
http://wiki.apache.org/solr/DataImportHandler#Usage_with_XML.2BAC8-HTTP_Datasource

*It does not support namespaces , but it can handle xmls with namespaces .
When you provide the xpath, just drop the namespace and give the rest (eg
if the tag is 'dc:subject' the mapping should just
contain 'subject').Easy, isn't it? And you didn't need to write one line of
code! Enjoy **
*
You should be able to use xpath=//titleInfo/title without making any
modifications (removing the namespace) to your xml.

I hope that answers your question.

Regards,
Tricia

On Mon, Oct 31, 2011 at 9:24 AM, Moore, Gary gary.mo...@ars.usda.govwrote:

 I'm trying to import some MODS XML using DIH.  The XML uses bound
 namespacing:

 mods xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance;
  xmlns:mods=http://www.loc.gov/mods/v3;
  xmlns:xlink=http://www.w3.org/1999/xlink;
  xmlns=http://www.loc.gov/mods/v3;
  xsi:schemaLocation=http://www.loc.gov/mods/v3
 http://www.loc.gov/mods/v3/mods-3-4.xsd;
  version=3.4
   mods:titleInfo
  mods:titleMalus domestica: Arnold/mods:title
   /mods:titleInfo
 /mods

 However, XPathEntityProcessor doesn't seem to handle xpaths of the type
 xpath=//mods:titleInfo/mods:title.

 If I remove the namespaces from the source XML:

 mods xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance;
  xmlns:mods=http://www.loc.gov/mods/v3;
  xmlns:xlink=http://www.w3.org/1999/xlink;
  xmlns=http://www.loc.gov/mods/v3;
  xsi:schemaLocation=http://www.loc.gov/mods/v3
 http://www.loc.gov/mods/v3/mods-3-4.xsd;
  version=3.4
   titleInfo
  titleMalus domestica: Arnold/title
   /titleInfo
 /mods

 then xpath=//titleInfo/title works just fine.  Can anyone confirm that
 this is the case and, if so, recommend a solution?
 Thanks
 Gary


 Gary Moore
 Technical Lead
 LCA Digital Commons Project
 NAL/ARS/USDA




Re: Stream still in memory after tika exception? Possible memoryleak?

2011-11-03 Thread P Williams
Hi All,

I'm experiencing a similar problem to the other's in the thread.

I've recently upgraded from apache-solr-4.0-2011-06-14_08-33-23.war to
apache-solr-4.0-2011-10-14_08-56-59.war and then
apache-solr-4.0-2011-10-30_09-00-00.war to index ~5300 pdfs, of various
sizes, using the TikaEntityProcessor.  My indexing would run to completion
and was completely successful under the June build.  The only error was
readability of the fulltext in highlighting.  This was fixed in Tika 0.10
(TIKA-611).  I chose to use the October 14 build of Solr because Tika 0.10
had recently been included (SOLR-2372).

On the same machine without changing any memory settings my initial problem
is a Perm Gen error.  Fine, I increase the PermGen space.

I've set the onError parameter to skip for the TikaEntityProcessor.
 Now I get several (6)

*SEVERE: Exception thrown while getting data*
*java.net.SocketTimeoutException: Read timed out*
*SEVERE: Exception in entity :
tika:org.apache.solr.handler.dataimport.DataImport*
*HandlerException: Exception in invoking url url removed # 2975*

pairs.  And after ~3881 documents, with auto commit set unreasonably
frequently I consistently get an Out of Memory Error

*SEVERE: Exception while processing: f document :
null:org.apache.solr.handle**r.dataimport.DataImportHandlerException:
java.lang.OutOfMemoryError: Java heap s**pace*

The stack trace points
to 
org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151)
and org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
r.java:718).

The October 30 build performs identically.

Funny thing is that monitoring via JConsole doesn't reveal any memory
issues.

Because the out of Memory error did not occur in June, this leads me to
believe that a bug has been introduced to the code since then.  Should I
open an issue in JIRA?

Thanks,
Tricia

On Tue, Aug 30, 2011 at 12:22 PM, Marc Jacobs jacob...@gmail.com wrote:

 Hi Erick,

 I am using Solr 3.3.0, but with 1.4.1 the same problems.
 The connector is a homemade program in the C# programming language and is
 posting via http remote streaming (i.e.

 http://localhost:8080/solr/update/extract?stream.file=/path/to/file.docliteral.id=1
 )
 I'm using Tika to extract the content (comes with the Solr Cell).

 A possible problem is that the filestream needs to be closed, after
 extracting, by the client application, but it seems that there is going
 something wrong while getting a Tika-exception: the stream never leaves the
 memory. At least that is my assumption.

 What is the common way to extract content from officefiles (pdf, doc, rtf,
 xls etc) and index them? To write a content extractor / validator yourself?
 Or is it possible to do this with the Solr Cell without getting a huge
 memory consumption? Please let me know. Thanks in advance.

 Marc

 2011/8/30 Erick Erickson erickerick...@gmail.com

  What version of Solr are you using, and how are you indexing?
  DIH? SolrJ?
 
  I'm guessing you're using Tika, but how?
 
  Best
  Erick
 
  On Tue, Aug 30, 2011 at 4:55 AM, Marc Jacobs jacob...@gmail.com wrote:
   Hi all,
  
   Currently I'm testing Solr's indexing performance, but unfortunately
 I'm
   running into memory problems.
   It looks like Solr is not closing the filestream after an exception,
 but
  I'm
   not really sure.
  
   The current system I'm using has 150GB of memory and while I'm indexing
  the
   memoryconsumption is growing and growing (eventually more then 50GB).
   In the attached graph I indexed about 70k of office-documents
  (pdf,doc,xls
   etc) and between 1 and 2 percent throws an exception.
   The commits are after 64MB, 60 seconds or after a job (there are 6
 evenly
   divided jobs).
  
   After indexing the memoryconsumption isn't dropping. Even after an
  optimize
   command it's still there.
   What am I doing wrong? I can't imagine I'm the only one with this
  problem.
   Thanks in advance!
  
   Kind regards,
  
   Marc
  
 



JSON and DataImportHandler

2010-07-16 Thread P Williams

Hi All,

Has anyone gotten the DataImportHandler to work with json as 
input?  Is there an even easier alternative to DIH?  Could you show me 
an example?


Many thanks,
Tricia