date:20130714

Re: Search with punctuations

2013-07-14 Thread kobe.free.wo...@gmail.com

Hi Erick,

Thanks for your reply!

I have tried both of the suggestions that you have mentioned i.e.,

1. Using WhitespaceTokensizerFactory
2. Using WordDelimiterFilterFactory with
catenateWords="1"

But, I still face the same issue. Should the tokenizers/ factories used must
be the same for both "query" and "index" analyzers?

As per my scenario, when I search for "INTL", I want SOLR to return both the
records containing string like "INTL" and "INT'L".

Please do suggest me other alternatives to achieve this.

Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-with-punctuations-tp4077510p4077973.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Norms

2013-07-14 Thread Mark Miller


On Jul 10, 2013, at 4:39 AM, Daniel Collins  wrote:

> QueryNorm is what I'm still trying to get to the bottom of exactly :) 

If you have not seen it, some reading from the past here…

https://issues.apache.org/jira/browse/LUCENE-1896

- Mark

Re: How to from solr facet exclude specific “Tag”!

2013-07-14 Thread Upayavira

Make your two fq clauses separate fq params? Would be better for your
caches, and would mean the tag is easily associated with the whole fq
querystring.

Upayavira

On Sun, Jul 14, 2013, at 03:14 AM, 张智 wrote:
> solr 4.3
> 
> this is my query request params:
> 
> 0 name="QTime">15true name="indent">true*:* name="_">1373713374569 name="facet.field">{!ex=city}CityId{!ex=company}CompanyId name="wt">xml{!tag=city}CityId:729 AND
> {!tag=company}CompanyId:16122
> 
> This is the query response "Facet" content:
> 
>  name="facet_fields">100171 name="1404">8940677477 name="1366">6578058092 name="729">2921328975... name="7262">808776 name="1146">772765 name="1078">668667 name="2049">402401 name="401">390 name="16122">97100 name="72">000 name="85">000... name="98">000 name="113">000 name="126">000 name="139">0   name="facet_dates"/>
> 
> You can see CityId the "Facet" is correct, it excludes the {! Tag = city}
> CityId: 729 queries , but CompanyId the "Facet" is not correct , he did
> not rule out {! Tag = company} CompanyId: 16122 queries. How to solve it
> ?

Re: Solr caching clarifications

2013-07-14 Thread Manuel Le Normand

Alright, thanks Erick. For the question about memory usage of merges, taken
from  Mike McCandless Blog

The big thing that stays in RAM is a logical int[] mapping old docIDs to
new docIDs, but in more recent versions of Lucene (4.x) we use a much more
efficient structure than a simple int[] ... see
https://issues.apache.org/jira/browse/LUCENE-2357

How much RAM is required is mostly a function of how many documents (lots
of tiny docs use more RAM than fewer huge docs).


A related clarification
As my users are not aware of the fq possibility, i was wondering how do I
make the best out of this field cache. Would if be efficient transforming
implicitly their query to a filter query on fields that are boolean
searches (date range etc. that do not affect the score of a document). Is
this a good practice? Is there any plugin for a query parser that makes it?



>
> Inline
>
> On Thu, Jul 11, 2013 at 8:36 AM, Manuel Le Normand
>  wrote:
> > Hello,
> > As a result of frequent java OOM exceptions, I try to investigate more
into
> > the solr jvm memory heap usage.
> > Please correct me if I am mistaking, this is my understanding of usages
for
> > the heap (per replica on a solr instance):
> > 1. Buffers for indexing - bounded by ramBufferSize
> > 2. Solr caches
> > 3. Segment merge
> > 4. Miscellaneous- buffers for Tlogs, servlet overhead etc.
> >
> > Particularly I'm concerned by Solr caches and segment merges.
> > 1. How much memory consuming (bytes per doc) are FilterCaches
(bitDocSet)
> > and queryResultCaches (DocList)? I understand it is related to the skip
> > spaces between doc id's that match (so it's not saved as a bitmap). But
> > basically, is every id saved as a java int?
>
> Different beasts. filterCache consumes, essentially, maxDoc/8 bytes (you
> can get the maxDoc number from your Solr admin page). Plus some overhead
> for storing the fq text, but that's usually not much. This is for each
> entry up to "Size".



>
> queryResultCache is usually trivial unless you've configured it
extravagantly.
> It's the query string length + queryResultWindowSize integers per entry
> (queryResultWindowSize is from solrconfig.xml).
>
> > 2. QueryResultMaxDocsCached - (for example = 100) means that any query
> > resulting in more than 100 docs will not be cached (at all) in the
> > queryResultCache? Or does it have to do with the documentCache?
> It's just a limit on the queryResultCache entry size as far as I can
> tell. But again
> this cache is relatively small, I'd be surprised if it used
> significant resources.
>
> > 3. DocumentCache - written on the wiki it should be greater than
> > max_results*concurrent_queries. Max result is just the num of rows
> > displayed (rows-start) param, right? Not the queryResultWindow.
>
> Yes. This a cache (I think) for the _contents_ of the documents you'll
> be returning to be manipulated by various components during the life
> of the query.
>
> > 4. LazyFieldLoading=true - when quering for id's only (fl=id) will this
> > cache be used? (on the expense of eviction of docs that were already
loaded
> > with stored fields)
>
> Not sure, but I don't think this will contribute much to memory pressure.
This
> is about now many fields are loaded to get a single value from a doc in
the
> results list, and since one is usually working with 20 or so docs this
> is usually
> a small amount of memory.
>
> > 5. How large is the heap used by mergings? Assuming we have a merge of
10
> > segments of 500MB each (half inverted files - *.pos *.doc etc, half non
> > inverted files - *.fdt, *.tvd), how much heap should be left unused for
> > this merge?
>
> Again, I don't think this is much of a memory consumer, although I
> confess I don't
> know the internals. Merging is mostly about I/O.
>
> >
> > Thanks in advance,
> > Manu
>
> But take a look at the admin page, you can see how much memory various
> caches are using by looking at the plugins/stats section.
>
> Best
> Erick

Re: Apache Solr 4 - after 1st commit the index does not grow

2013-07-14 Thread Erick Erickson

Well, that's one. OutOfMemoryErrors will stop things from happening
for sure, the cure is to give the JVM more memory.

Additionally, multiple update of a doc with the same 
will replace the old copy with a new one, that might be what you're
seeing.

But get rid of the OOM first.

Best
Erick

On Sun, Jul 14, 2013 at 2:40 PM, glumet  wrote:
> When I look into the log, there is:
>
> SEVERE: auto commit error...:java.lang.IllegalStateException: this writer
> hit an OutOfMemoryError; cannot commit
> at
> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2668)
> at
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2834)
> at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2814)
> at
> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:529)
> at org.apache.solr.update.CommitTracker.run(CommitTracker.java:216)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:722)
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Apache-Solr-4-after-1st-commit-the-index-does-not-grow-tp4077913p4077924.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr autodetectparser tikaconfig dataimporter error

2013-07-14 Thread Jack Krupansky


"Caused by: java.lang.NoSuchMethodError:"

That means you have some out of date jars or some newer jars mixed in with 
the old ones.


-- Jack Krupansky

-Original Message- 
From: Andreas Owen

Sent: Sunday, July 14, 2013 3:07 PM
To: solr-user@lucene.apache.org
Subject: Re: solr autodetectparser tikaconfig dataimporter error

hi

is there nowone with a idea what this error is or even give me a pointer 
where to look? If not is there a alternitave way to import documents from a 
xml-file with meta-data and the filename to parse?


thanks for any help.


On 12. Jul 2013, at 10:38 PM, Andreas Owen wrote:


i am using solr 3.5, tika-app-1.4 and tagcloud 1.2.1. when i try to =
import a
file via xml i get this error, it doesn't matter what file format i try =
to index txt, cfm, pdf all the same error:

SEVERE: Exception while processing: rec document :
SolrInputDocument[{id=3Did(1.0)=3D{myTest.txt},
title=3Dtitle(1.0)=3D{Beratungsseminar kundenbrief}, =
contents=3Dcontents(1.0)=3D{wie
kommuniziert man}, author=3Dauthor(1.0)=3D{Peter Z.},
=
path=3Dpath(1.0)=3D{download/online}}]:org.apache.solr.handler.dataimport.=
DataImportHandlerException:
java.lang.NoSuchMethodError:
=
org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
TikaConfig;)V
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:669)
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:622)
at
=
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2=
68)
at
=
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)=

at
=
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.=
java:359)
at
=
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4=
27)
at
=
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40=
8)
Caused by: java.lang.NoSuchMethodError:
=
org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
TikaConfig;)V
at
=
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP=
rocessor.java:122)
at
=
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr=
ocessorWrapper.java:238)
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:596)
... 6 more

Jul 11, 2013 5:23:36 PM org.apache.solr.common.SolrException log
SEVERE: Full Import
failed:org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.NoSuchMethodError:
=
org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
TikaConfig;)V
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:669)
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:622)
at
=
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2=
68)
at
=
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)=

at
=
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.=
java:359)
at
=
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4=
27)
at
=
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40=
8)
Caused by: java.lang.NoSuchMethodError:
=
org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
TikaConfig;)V
at
=
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP=
rocessor.java:122)
at
=
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr=
ocessorWrapper.java:238)
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:596)
... 6 more

Jul 11, 2013 5:23:36 PM org.apache.solr.update.DirectUpdateHandler2 =
rollback

data-config.xml:


http://127.0.0.1/tkb/internet/";
name=3D"main"/>

=20





=09
=09
=09







the lib are included and declared in the logs, i have also tried =
tika-app
1.0 and tagsoup 1.2 with the same result. can someone please help, i =
don't
know where to start looking for the error.

Re: solr autodetectparser tikaconfig dataimporter error

2013-07-14 Thread Andreas Owen

hi

is there nowone with a idea what this error is or even give me a pointer where 
to look? If not is there a alternitave way to import documents from a xml-file 
with meta-data and the filename to parse?

thanks for any help.


On 12. Jul 2013, at 10:38 PM, Andreas Owen wrote:

> i am using solr 3.5, tika-app-1.4 and tagcloud 1.2.1. when i try to =
> import a
> file via xml i get this error, it doesn't matter what file format i try =
> to index txt, cfm, pdf all the same error:
> 
> SEVERE: Exception while processing: rec document :
> SolrInputDocument[{id=3Did(1.0)=3D{myTest.txt},
> title=3Dtitle(1.0)=3D{Beratungsseminar kundenbrief}, =
> contents=3Dcontents(1.0)=3D{wie
> kommuniziert man}, author=3Dauthor(1.0)=3D{Peter Z.},
> =
> path=3Dpath(1.0)=3D{download/online}}]:org.apache.solr.handler.dataimport.=
> DataImportHandlerException:
> java.lang.NoSuchMethodError:
> =
> org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
> TikaConfig;)V
>   at
> =
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
> a:669)
>   at
> =
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
> a:622)
>   at
> =
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2=
> 68)
>   at
> =
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)=
> 
>   at
> =
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.=
> java:359)
>   at
> =
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4=
> 27)
>   at
> =
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40=
> 8)
> Caused by: java.lang.NoSuchMethodError:
> =
> org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
> TikaConfig;)V
>   at
> =
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP=
> rocessor.java:122)
>   at
> =
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr=
> ocessorWrapper.java:238)
>   at
> =
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
> a:596)
>   ... 6 more
> 
> Jul 11, 2013 5:23:36 PM org.apache.solr.common.SolrException log
> SEVERE: Full Import
> failed:org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.NoSuchMethodError:
> =
> org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
> TikaConfig;)V
>   at
> =
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
> a:669)
>   at
> =
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
> a:622)
>   at
> =
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2=
> 68)
>   at
> =
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)=
> 
>   at
> =
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.=
> java:359)
>   at
> =
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4=
> 27)
>   at
> =
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40=
> 8)
> Caused by: java.lang.NoSuchMethodError:
> =
> org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
> TikaConfig;)V
>   at
> =
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP=
> rocessor.java:122)
>   at
> =
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr=
> ocessorWrapper.java:238)
>   at
> =
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
> a:596)
>   ... 6 more
> 
> Jul 11, 2013 5:23:36 PM org.apache.solr.update.DirectUpdateHandler2 =
> rollback
> 
> data-config.xml:
> 
>   
>baseUrl=3D"http://127.0.0.1/tkb/internet/";
> name=3D"main"/>
> 
>url=3D"docImport.xml"
> forEach=3D"/albums/album" dataSource=3D"main">=20
>   
>   
>   
>   
>   
>   =09
>   =09
>   =09
>=
> url=3D"file:///C:\web\development\tkb\internet\public\download\online\${re=
> c.id}"
> dataSource=3D"data" onerror=3D"skip">
>
>   
>   
> 
> 
> 
> the lib are included and declared in the logs, i have also tried =
> tika-app
> 1.0 and tagsoup 1.2 with the same result. can someone please help, i =
> don't
> know where to start looking for the error.

Re: HTTP Status 503 - Server is shutting down

2013-07-14 Thread PeterKerk

Ok, still getting the same error "HTTP Status 503 - Server is shutting down",
so here's what I did now:

- reinstalled tomcat
- deployed solr-4.3.1.war in C:\Program Files\Apache Software
Foundation\Tomcat 6.0\webapps
- copied log4j-1.2.16.jar,slf4j-api-1.6.6.jar,slf4j-log4j12-1.6.6.jar to
C:\Program Files\Apache Software Foundation\Tomcat
6.0\webapps\solr-4.3.1\WEB-INF\lib
- copied log4j.properties from
C:\Dropbox\Databases\solr-4.3.1\example\resources to
C:\Dropbox\Databases\solr-4.3.1\example\lib
- restarted tomcat


Now this shows in my Tomcat console:

14-jul-2013 20:54:38 org.apache.catalina.core.AprLifecycleListener init
INFO: The APR based Apache Tomcat Native library which allows optimal
performanc
e in production environments was not found on the java.library.path:
C:\Program
Files\Apache Software Foundation\Tomcat
6.0\bin;C:\Windows\Sun\Java\bin;C:\Windo
ws\system32;C:\Windows;C:\Program Files\Common Files\Microsoft
Shared\Windows Li
ve;C:\Program Files (x86)\Common Files\Microsoft Shared\Windows
Live;C:\Windows\
system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShe
ll\v1.0\;C:\Program Files\TortoiseSVN\bin;c:\msxsl;C:\Program Files
(x86)\Window
s Live\Shared;C:\Program Files\Microsoft\Web Platform Installer\;C:\Program
File
s (x86)\Microsoft ASP.NET\ASP.NET Web Pages\v1.0\;C:\Program Files
(x86)\Windows
 Kits\8.0\Windows Performance Toolkit\;C:\Program Files\Microsoft SQL
Server\110
\Tools\Binn\;C:\Program Files (x86)\Microsoft SQL
Server\110\Tools\Binn\;C:\Prog
ram Files\Microsoft SQL Server\110\DTS\Binn\;C:\Program Files
(x86)\Microsoft SQ
L Server\110\Tools\Binn\ManagementStudio\;C:\Program Files (x86)\Microsoft
SQL S
erver\110\DTS\Binn\;C:\Program Files (x86)\Java\jre6\bin;C:\Program
Files\Java\j
re631\bin;.
14-jul-2013 20:54:39 org.apache.coyote.http11.Http11Protocol init
INFO: Initializing Coyote HTTP/1.1 on http-8080
14-jul-2013 20:54:39 org.apache.catalina.startup.Catalina load
INFO: Initialization processed in 287 ms
14-jul-2013 20:54:39 org.apache.catalina.core.StandardService start
INFO: Starting service Catalina
14-jul-2013 20:54:39 org.apache.catalina.core.StandardEngine start
INFO: Starting Servlet Engine: Apache Tomcat/6.0.37
14-jul-2013 20:54:39 org.apache.catalina.startup.HostConfig deployDescriptor
INFO: Deploying configuration descriptor manager.xml
14-jul-2013 20:54:39 org.apache.catalina.startup.HostConfig deployWAR
INFO: Deploying web application archive solr-4.3.1.war
log4j:WARN No appenders could be found for logger
(org.apache.solr.servlet.SolrD
ispatchFilter).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
more in
fo.
14-jul-2013 20:54:39 org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deploying web application directory ROOT
14-jul-2013 20:54:39 org.apache.coyote.http11.Http11Protocol start
INFO: Starting Coyote HTTP/1.1 on http-8080
14-jul-2013 20:54:39 org.apache.jk.common.ChannelSocket init
INFO: JK: ajp13 listening on /0.0.0.0:8009
14-jul-2013 20:54:39 org.apache.jk.server.JkMain start
INFO: Jk running ID=0 time=0/55  config=null
14-jul-2013 20:54:39 org.apache.catalina.startup.Catalina start
INFO: Server startup in 732 ms

And this in the catalina.log:

14-jul-2013 20:54:38 org.apache.catalina.core.AprLifecycleListener init
INFO: The APR based Apache Tomcat Native library which allows optimal
performance in production environments was not found on the
java.library.path: C:\Program Files\Apache Software Foundation\Tomcat
6.0\bin;C:\Windows\Sun\Java\bin;C:\Windows\system32;C:\Windows;C:\Program
Files\Common Files\Microsoft Shared\Windows Live;C:\Program Files
(x86)\Common Files\Microsoft Shared\Windows
Live;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program
Files\TortoiseSVN\bin;c:\msxsl;C:\Program Files (x86)\Windows
Live\Shared;C:\Program Files\Microsoft\Web Platform Installer\;C:\Program
Files (x86)\Microsoft ASP.NET\ASP.NET Web Pages\v1.0\;C:\Program Files
(x86)\Windows Kits\8.0\Windows Performance Toolkit\;C:\Program
Files\Microsoft SQL Server\110\Tools\Binn\;C:\Program Files (x86)\Microsoft
SQL Server\110\Tools\Binn\;C:\Program Files\Microsoft SQL
Server\110\DTS\Binn\;C:\Program Files (x86)\Microsoft SQL
Server\110\Tools\Binn\ManagementStudio\;C:\Program Files (x86)\Microsoft SQL
Server\110\DTS\Binn\;C:\Program Files (x86)\Java\jre6\bin;C:\Program
Files\Java\jre631\bin;.
14-jul-2013 20:54:39 org.apache.coyote.http11.Http11Protocol init
INFO: Initializing Coyote HTTP/1.1 on http-8080
14-jul-2013 20:54:39 org.apache.catalina.startup.Catalina load
INFO: Initialization processed in 287 ms
14-jul-2013 20:54:39 org.apache.catalina.core.StandardService start
INFO: Starting service Catalina
14-jul-2013 20:54:39 org.apache.catalina.core.StandardEngine start
INFO: Starting Servlet Engine: Apache Tomcat/6.0.37
14-jul-2013 20:54:39 org.apache.catalina.startup.HostConfig deployDescriptor
INFO:

Re: Apache Solr 4 - after 1st commit the index does not grow

2013-07-14 Thread glumet

When I look into the log, there is:

SEVERE: auto commit error...:java.lang.IllegalStateException: this writer
hit an OutOfMemoryError; cannot commit
at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2668)
at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2834)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2814)
at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:529)
at org.apache.solr.update.CommitTracker.run(CommitTracker.java:216)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Apache-Solr-4-after-1st-commit-the-index-does-not-grow-tp4077913p4077924.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: external file field and fl parameter

2013-07-14 Thread Chris Collins

Yes that worked, thanks Alan.  The consistency of this api is "challenging".

C
On Jul 14, 2013, at 11:03 AM, Alan Woodward  wrote:

> Hi Chris,
> 
> Try wrapping the field name in a field() function in your fl parameter list, 
> like so:
> fl=field(eff_field_name)
> 
> Alan Woodward
> www.flax.co.uk
> 
> 
> On 14 Jul 2013, at 18:41, Chris Collins wrote:
> 
>> Why would I be re-indexing an external file field? The whole purpose is that 
>> its brought in at runtime and not part of the index?
>> 
>> C
>> On Jul 14, 2013, at 10:13 AM, Shawn Heisey  wrote:
>> 
>>> On 7/14/2013 7:05 AM, Chris Collins wrote:
 Yep I did switch on stored=true in the field type.  I was able to confirm 
 a few ways that there are values for the eff by two methods:
 
 1) changing desc to asc produced drastically different results.
 
 2) debugging FileFloatSource the following was getting triggered filling 
 the vals array:
while ((doc = docsEnum.nextDoc()) != 
 DocIdSetIterator.NO_MORE_DOCS)
  {
  vals[doc] = fval;
  }
 
 At least by you asking these questions I guess it should work.  I will 
 continue dissecting. 
>>> 
>>> Did you reindex when you changed the schema?  Sorting uses indexed
>>> values, not stored values.  The fl parameter requires the stored values.
>>> These are separate within the index, and one cannot substitute for the
>>> other.  If you didn't reindex, then you won't have the stored values for
>>> existing documents.
>>> 
>>> http://wiki.apache.org/solr/HowToReindex
>>> 
>>> Thanks,
>>> Shawn
>>> 
>>> 
>> 
>

Re: external file field and fl parameter

2013-07-14 Thread Alan Woodward

Hi Chris,

Try wrapping the field name in a field() function in your fl parameter list, 
like so:
fl=field(eff_field_name)

Alan Woodward
www.flax.co.uk


On 14 Jul 2013, at 18:41, Chris Collins wrote:

> Why would I be re-indexing an external file field? The whole purpose is that 
> its brought in at runtime and not part of the index?
> 
> C
> On Jul 14, 2013, at 10:13 AM, Shawn Heisey  wrote:
> 
>> On 7/14/2013 7:05 AM, Chris Collins wrote:
>>> Yep I did switch on stored=true in the field type.  I was able to confirm a 
>>> few ways that there are values for the eff by two methods:
>>> 
>>> 1) changing desc to asc produced drastically different results.
>>> 
>>> 2) debugging FileFloatSource the following was getting triggered filling 
>>> the vals array:
>>> while ((doc = docsEnum.nextDoc()) != 
>>> DocIdSetIterator.NO_MORE_DOCS)
>>>   {
>>>   vals[doc] = fval;
>>>   }
>>> 
>>> At least by you asking these questions I guess it should work.  I will 
>>> continue dissecting. 
>> 
>> Did you reindex when you changed the schema?  Sorting uses indexed
>> values, not stored values.  The fl parameter requires the stored values.
>> These are separate within the index, and one cannot substitute for the
>> other.  If you didn't reindex, then you won't have the stored values for
>> existing documents.
>> 
>> http://wiki.apache.org/solr/HowToReindex
>> 
>> Thanks,
>> Shawn
>> 
>> 
>

Re: ACL implementation: Pseudo-join performance & Atomic Updates

2013-07-14 Thread Oleg Burlaca

Hello Erick,

> Join performance is most sensitive to the number of values
> in the field being joined on. So if you have lots and lots of
> distinct values in the corpus, join performance will be affected.
Yep, we have a list of unique Id's that we get by first searching for
records
where loggedInUser IS IN (userIDs)
This corpus is stored in memory I suppose? (not a problem) and then the
bottleneck is to match this huge set with the core where I'm searching?

Somewhere in maillist archive people were talking about "external list of
Solr unique IDs"
but didn't find if there is a solution.
Back in 2010 Yonik posted a comment:
http://find.searchhub.org/document/363a4952446b3cd#363a4952446b3cd

> bq: I suppose the delete/reindex approach will not change soon
> There is ongoing work (search the JIRA for "Stacked Segments")
Ah, ok, I was feeling it affects the architecture, ok, now the only hope is
Pseudo-Joins ))

> One way to deal with this is to implement a "post filter", sometimes
called
> a "no cache" filter.
thanks, will have a look, but as you describe it, it's not the best option.

The approach
"too many documents, man. Please refine your query. Partial results below"
means faceting will not work correctly?

... I have in mind a hybrid approach, comments welcome:
Most of the time users are not searching, but browsing content, so our
"virtual filesystem" stored in SOLR will use only the index with the Id of
the file and the list of users that have access to it. i.e. not touching
the fulltext index at all.

Files may have metadata (EXIF info for images for ex) that we'd like to
filter by, calculate facets.
Meta will be stored in both indexes.

In case of a fulltext query:
1. search FT index (the fulltext index), get only the number of search
results, let it be Rf
2. search DAC index (the index with permissions), get number of search
results, let it be Rd

let maxR be the maximum size of the corpus for the pseudo-join.
*That was actually my question: what is a reasonable number? 10, 100, 1000 ?
*

if (Rf < maxR) or (Rd < maxR) then use the smaller corpus to join onto the
second one.
this happens when (only a few documents contains the search query) OR (user
has access to a small number of files).

In case none of these happens, we can use the
"too many documents, man. Please refine your query. Partial results below"
but first searching the FT index, because we want relevant results first.

What do you think?

Regards,
Oleg

On Sun, Jul 14, 2013 at 7:42 PM, Erick Erickson wrote:

> Join performance is most sensitive to the number of values
> in the field being joined on. So if you have lots and lots of
> distinct values in the corpus, join performance will be affected.
>
> bq: I suppose the delete/reindex approach will not change soon
>
> There is ongoing work (search the JIRA for "Stacked Segments")
> on actually doing something about this, but it's been "under consideration"
> for at least 3 years so your guess is as good as mine.
>
> bq: notice that the worst situation is when everyone has access to all the
> files, it means the first filter will be the full index.
>
> One way to deal with this is to implement a "post filter", sometimes called
> a "no cache" filter. The distinction here is that
> 1> it is not cached (duh!)
> 2> it is only called for documents that have made it through all the
>  other "lower cost" filters (and the main query of course).
> 3> "lower cost" means the filter is either a standard, cached filters
> and any "no cache" filters with a cost (explicitly stated in the query)
> lower than this one's.
>
> Critically, and unlike "normal" filter queries, the result set is NOT
> calculated for all documents ahead of time
>
> You _still_ have to deal with the sysadmin doing a *:* query as you
> are well aware. But one can mitigate that by having the post-filter
> fail all documents after some arbitrary N, and display a message in the
> app like "too many documents, man. Please refine your query. Partial
> results below". Of course this may not be acceptable, but
>
> HTH
> Erick
>
> On Sun, Jul 14, 2013 at 12:05 PM, Jack Krupansky
>  wrote:
> > Take a look at LucidWorks Search and its access control:
> >
> http://docs.lucidworks.com/display/help/Search+Filters+for+Access+Control
> >
> > Role-based security is an easier nut to crack.
> >
> > Karl Wright of ManifoldCF had a Solr patch for document access control at
> > one point:
> > SOLR-1895 - ManifoldCF SearchComponent plugin for enforcing ManifoldCF
> > security at search time
> > https://issues.apache.org/jira/browse/SOLR-1895
> >
> >
> http://www.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011
> >
> > For some other thoughts:
> > http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security
> >
> > I'm not sure if external file fields will be of any value in this
> situation.
> >
> > There is also a proposal for bitwise operations:
> > SOLR-1913 - QParserPlugin plugin for Search Results Fi

Re: external file field and fl parameter

2013-07-14 Thread Chris Collins

Why would I be re-indexing an external file field? The whole purpose is that 
its brought in at runtime and not part of the index?

C
On Jul 14, 2013, at 10:13 AM, Shawn Heisey  wrote:

> On 7/14/2013 7:05 AM, Chris Collins wrote:
>> Yep I did switch on stored=true in the field type.  I was able to confirm a 
>> few ways that there are values for the eff by two methods:
>> 
>> 1) changing desc to asc produced drastically different results.
>> 
>> 2) debugging FileFloatSource the following was getting triggered filling the 
>> vals array:
>>  while ((doc = docsEnum.nextDoc()) != 
>> DocIdSetIterator.NO_MORE_DOCS)
>>{
>>vals[doc] = fval;
>>}
>> 
>> At least by you asking these questions I guess it should work.  I will 
>> continue dissecting. 
> 
> Did you reindex when you changed the schema?  Sorting uses indexed
> values, not stored values.  The fl parameter requires the stored values.
> These are separate within the index, and one cannot substitute for the
> other.  If you didn't reindex, then you won't have the stored values for
> existing documents.
> 
> http://wiki.apache.org/solr/HowToReindex
> 
> Thanks,
> Shawn
> 
>

Apache Solr 4 - after 1st commit the index does not grow

2013-07-14 Thread glumet

I have written my own plugin for Apache Nutch 2.2.1 to crawl images, videos
and podcasts from selected sites (I have 180 urls in my seed). I put this
metadata to a hBase store and now I want to save it to the index (Solr). I
have a lot of metadatas to save (webpages + images + videos + podcast).

I am using Nutch script bin/crawl for the whole process (inject, generate,
fetch, parse... and finally solrindex and dedup) but I have one problem.
When I run this script for a first time, there are stored approximately 6000
documents (Lets say it is 3700 docs for images, 1700 for wegpages and the
rest of docs are for videos and podcasts) to the index. It is ok...

but...

When I run the script for a second time, third time and so on... the index
does not increase the number of documents (there are still 6000 documents)
but a count of rows stored in hBase table grows (there is 97383 rows now)...

Do you now where is the problem please? I am fighting with this problem
really long time and I dont know... If it could be helpful, this is my
configuration of solrconfix.xml http://pastebin.com/uxMW2nuq and this is my
nutch-site.xml http://pastebin.com/4bj1wdmT



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Apache-Solr-4-after-1st-commit-the-index-does-not-grow-tp4077913.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: external file field and fl parameter

2013-07-14 Thread Shawn Heisey

On 7/14/2013 7:05 AM, Chris Collins wrote:
> Yep I did switch on stored=true in the field type.  I was able to confirm a 
> few ways that there are values for the eff by two methods:
> 
> 1) changing desc to asc produced drastically different results.
> 
> 2) debugging FileFloatSource the following was getting triggered filling the 
> vals array:
>   while ((doc = docsEnum.nextDoc()) != 
> DocIdSetIterator.NO_MORE_DOCS)
> {
> vals[doc] = fval;
> }
> 
> At least by you asking these questions I guess it should work.  I will 
> continue dissecting. 

Did you reindex when you changed the schema?  Sorting uses indexed
values, not stored values.  The fl parameter requires the stored values.
 These are separate within the index, and one cannot substitute for the
other.  If you didn't reindex, then you won't have the stored values for
existing documents.

http://wiki.apache.org/solr/HowToReindex

Thanks,
Shawn

Re: SolrCloud leader

2013-07-14 Thread Shawn Heisey

On 7/14/2013 6:42 AM, kowish.adamosh wrote:
> The problem is that I don't want to invoke data import on 8 server nodes but
> to choose only one for scheduling. Of course if this server will shut down
> then another one needs to take the scheduler role. I can see that there is
> task for sheduling https://issues.apache.org/jira/browse/SOLR-2305 . I hope
> they will take into account SolrCloud. And that's why I wanted to know if
> current node is *currently* elected as the leader. The leader would be the
> scheduler.
> 
> In the meanwhile, any ideas of how to solve data import scheduling on
> SolrCloud architecture?

As Jack already replied, this is outside the scope of Solr.

SOLR-2305 has been around for a VERY long time.  Adding scheduling
capability to the dataimport handler is not very hard, but nobody has
done so because we do not believe this is something Solr should be
handling.  Also, it's easy to get something wrong, so users can run into
bugs that would break their scheduling.

Every operating system has scheduling capability.  Windows has the task
scheduler.  On virtually all other operating systems, you'll find cron.
 These systems have had years of operation for their authors to work out
the bugs, and they are VERY solid.

We would not be able to make the same robustness guarantee if we
included scheduling in Solr.  Additionally, we really want to be sure
that Solr never does anything on its own that has not been specifically
requested by a user or program, or through certain external events such
as a hardware or software failure.

For my own multi-server Linux Solr installation, which doesn't use
SolrCloud even though it's got two complete copies of the index and uses
shards, I have worked out how to do clustered scheduling.  I have a
corosync/pacemaker cluster set up on my servers, which ensures that only
one copy of my cronjobs is running on the cluster.  If a server dies, it
will start up the cronjobs on another server.

Thanks,
Shawn

Re: ACL implementation: Pseudo-join performance & Atomic Updates

2013-07-14 Thread Oleg Burlaca

Hello Jack,

Thanks for so many links, my comments are below, I'll found a way to
rephrase all my questions in one:
How to implement a DAC (Discretionary Access Control) similar to Windows OS
using SOLR?

What we have: a hierarchical filesystem, user and groups, permissions
applied at the level of a file/folder.
What we need: full-text search & restricting access based on ACL.
How to deal with a change in permissions for a big folder?
How to check if the user can delete a folder?  (it means he should have
write access to all files/sub-folders)


> Role-based security is an easier nut to crack
yep, but we need DAC :(

> http://docs.lucidworks.com/display/help/Search+Filters+for+Access+Control
The documentation doesn't reveal what happens when content should be
reindexed, although the last chapter "Document-based Authorization" shows
the same approach: user list specified at the level of the document.

> Karl Wright of ManifoldCF had a Solr patch for document access control at
one point:
> SOLR-1895 - ManifoldCF SearchComponent plugin for enforcing ManifoldCF
security at search time
> https://issues.apache.org/**jira/browse/SOLR-1895
It states "LCF SearchComponent which filters returned results based on
access tokens provided by LCF's authority service"
That means filtering is applied on the results only.
Issues: faceting doesn't work correctly (i.e. counting), because the filter
isn't applied yet.
Even worse: you have to scroll through the result set until you find
records accessible by the user (what if the user has access to 10 from
100,000 files?)

> http://www.slideshare.net/**lucenerevolution/wright-nokia-**
manifoldcfeurocon-2011
Page 9 says "docs and access tokens".
"Separate bins for "allow" tokens, "deny" tokens for "file" "
It's similar to the approach we use: each record in SOLR has two fields:
"readAccess" and "WriteAccess" both is a multivalued field with userId's.
it allows us to quickly delete a bunch of items the user has access to for
ex. (or checking a hierarchical delete)

> http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security 
> 
"It works by adding security tokens from the source repositories as
metadata on the indexed documents"
Again, the permission info is stored within the record itself, and if we
change access for big folder, it means reindexing.

> https://issues.apache.org/**jira/browse/SOLR-1913
Thanks for the link, need to meditate if I can find a way to use it.

> But the bottom line is that clearly updating all documents in the index
is a non-starter.
I have scratched my head, and monitoring SOLR features for a long time,
trying to find something I can use. Today I've watched Yonik Seeley video:
http://vimeopro.com/user11514798/apache-lucene-eurocon-2012/video/55387447
and found PSEUDO-JOINS, nice This seems a perfect solution, I can have
two indexes, one with full-text and another one with objId and userId's,
the second one should be fast to update I hope.

But the question is: what about performance?

Regards




On Sun, Jul 14, 2013 at 7:05 PM, Jack Krupansky wrote:

> Take a look at LucidWorks Search and its access control:
> http://docs.lucidworks.com/**display/help/Search+Filters+**
> for+Access+Control
>
> Role-based security is an easier nut to crack.
>
> Karl Wright of ManifoldCF had a Solr patch for document access control at
> one point:
> SOLR-1895 - ManifoldCF SearchComponent plugin for enforcing ManifoldCF
> security at search time
> https://issues.apache.org/**jira/browse/SOLR-1895
>
> http://www.slideshare.net/**lucenerevolution/wright-nokia-**
> manifoldcfeurocon-2011
>
> For some other thoughts:
> http://wiki.apache.org/solr/**SolrSecurity#Document_Level_**Security
>
> I'm not sure if external file fields will be of any value in this
> situation.
>
> There is also a proposal for bitwise operations:
> SOLR-1913 - QParserPlugin plugin for Search Results Filtering Based on
> Bitwise Operations on Integer Fields
> https://issues.apache.org/**jira/browse/SOLR-1913
>
> But the bottom line is that clearly updating all documents in the index is
> a non-starter.
>
> -- Jack Krupansky
>
> -Original Message- From: Oleg Burlaca
> Sent: Sunday, July 14, 2013 11:02 AM
> To: solr-user@lucene.apache.org
> Subject: ACL implementation: Pseudo-join performance & Atomic Updates
>
>
> Hello all,
>
> Situation:
> We have a collection of files in SOLR with ACL applied: each file has a
> multi-valued field

Re: HTTP Status 503 - Server is shutting down

2013-07-14 Thread Shawn Heisey

On 7/14/2013 6:43 AM, PeterKerk wrote:
> Hi Shawn,
> 
> I'm also getting the HTTP Status 503 - Server is shutting down error when
> navigating to http://localhost:8080/solr-4.3.1/

> INFO: Deploying web application archive solr-4.3.1.war
> log4j:WARN No appenders could be found for logger
> (org.apache.solr.servlet.SolrD
> ispatchFilter).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> more in
> fo.

THe logging.properties file is used for JDK logging, which was the
default in Solr prior to version 4.3.0.  In older versions, jarfiles
were embedded in the .war file that set up slf4j to use
java.util.logging, also known as JDK logging because this logging
framework comes with Java.

Solr 4.3.0 and later does not have ANY slf4j jarfiles in the .war file,
so you need to put them in your classpath.  Jarfiles are included in the
example, in example/lib/ext, and those jarfiles set up logging to use
log4j, a much more flexible logging framework than JDK logging.

JDK logging is typically set up with a file called logging.properties,
which I think you must use a system property to configure.  You aren't
using JDK logging, you are using log4j, which uses a file called
log4j.properties.

http://wiki.apache.org/solr/SolrLogging#Using_the_example_logging_setup_in_containers_other_than_Jetty

It appears that you have followed part of the instructions above and
copied jars from example/lib/ext to a lib directory on your classpath.
Now if you copy example/resources/log4j.properties to the same place,
logging should work.  It will not log to the tomcat log, it will log to
the location specified in log4j.properties, which by default is
logs/solr.log relative to the current working directory.

As I already said on this thread, if you want Tomcat to be in control of
the logging, you must switch back to java.util.logging as described in
the wiki:

http://wiki.apache.org/solr/SolrLogging#Switching_from_Log4J_back_to_JUL_.28java.util.logging.29

Thanks,
Shawn

Re: ACL implementation: Pseudo-join performance & Atomic Updates

2013-07-14 Thread Erick Erickson

Join performance is most sensitive to the number of values
in the field being joined on. So if you have lots and lots of
distinct values in the corpus, join performance will be affected.

bq: I suppose the delete/reindex approach will not change soon

There is ongoing work (search the JIRA for "Stacked Segments")
on actually doing something about this, but it's been "under consideration"
for at least 3 years so your guess is as good as mine.

bq: notice that the worst situation is when everyone has access to all the
files, it means the first filter will be the full index.

One way to deal with this is to implement a "post filter", sometimes called
a "no cache" filter. The distinction here is that
1> it is not cached (duh!)
2> it is only called for documents that have made it through all the
 other "lower cost" filters (and the main query of course).
3> "lower cost" means the filter is either a standard, cached filters
and any "no cache" filters with a cost (explicitly stated in the query)
lower than this one's.

Critically, and unlike "normal" filter queries, the result set is NOT
calculated for all documents ahead of time

You _still_ have to deal with the sysadmin doing a *:* query as you
are well aware. But one can mitigate that by having the post-filter
fail all documents after some arbitrary N, and display a message in the
app like "too many documents, man. Please refine your query. Partial
results below". Of course this may not be acceptable, but

HTH
Erick

On Sun, Jul 14, 2013 at 12:05 PM, Jack Krupansky
 wrote:
> Take a look at LucidWorks Search and its access control:
> http://docs.lucidworks.com/display/help/Search+Filters+for+Access+Control
>
> Role-based security is an easier nut to crack.
>
> Karl Wright of ManifoldCF had a Solr patch for document access control at
> one point:
> SOLR-1895 - ManifoldCF SearchComponent plugin for enforcing ManifoldCF
> security at search time
> https://issues.apache.org/jira/browse/SOLR-1895
>
> http://www.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011
>
> For some other thoughts:
> http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security
>
> I'm not sure if external file fields will be of any value in this situation.
>
> There is also a proposal for bitwise operations:
> SOLR-1913 - QParserPlugin plugin for Search Results Filtering Based on
> Bitwise Operations on Integer Fields
> https://issues.apache.org/jira/browse/SOLR-1913
>
> But the bottom line is that clearly updating all documents in the index is a
> non-starter.
>
> -- Jack Krupansky
>
> -Original Message- From: Oleg Burlaca
> Sent: Sunday, July 14, 2013 11:02 AM
> To: solr-user@lucene.apache.org
> Subject: ACL implementation: Pseudo-join performance & Atomic Updates
>
>
> Hello all,
>
> Situation:
> We have a collection of files in SOLR with ACL applied: each file has a
> multi-valued field that contains the list of userID's that can read it:
>
> here is sample data:
> Id | content  | userId
> 1  | text text | 4,5,6,2
> 2  | text text | 4,5,9
> 3  | text text | 4,2
>
> Problem:
> when ACL is changed for a big folder, we compute the ACL for all child
> items and reindex in SOLR using atomic updates (updating only 'userIds'
> column), but because it deletes/reindexes the record, the performance is
> very poor.
>
> Question:
> I suppose the delete/reindex approach will not change soon (probably it's
> due to actual SOLR architecture), ?
>
> Possible solution: assuming atomic updates will be super fast on an index
> without fulltext, keep a separate ACLIndex and FullTextIndex and use
> Pseudo-Joins:
>
> Example: searching 'foo' as user '999'
> /solr/FullTextIndex/select/?q=foo&fq{!join fromIndex=ACLIndex from=Id to=Id
> }userId:999
>
> Question: what about performance here? what if the index is 100,000
> records?
> notice that the worst situation is when everyone has access to all the
> files, it means the first filter will be the full index.
>
> Would be happy to get any links that deal with the issue of Pseudo-join
> performance for large datasets (i.e. initial filtered set of IDs).
>
> Regards,
> Oleg
>
> P.S. we found that having the list of all users that have access for each
> record is better overall, because there are much more read requests (people
> accessing the library) then write requests (a new user is added/removed).

Re: Getting indexed content of files using ExtractingRequestHandler

2013-07-14 Thread Erick Erickson

I'm completely ignorant of all things PHP, including the
state of any Solr client code, so I'm afraid I can't
help with that...

Best
Erick

On Sun, Jul 14, 2013 at 11:03 AM, xan  wrote:
> Thanks for the link. Also, having gone quite far with my work using the PHP
> Solr client, isn't there anything that could be done using the PHP Solr
> client only?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Getting-indexed-content-of-files-using-ExtractingRequestHandler-tp4077856p4077893.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: ACL implementation: Pseudo-join performance & Atomic Updates

2013-07-14 Thread Jack Krupansky


Take a look at LucidWorks Search and its access control:
http://docs.lucidworks.com/display/help/Search+Filters+for+Access+Control

Role-based security is an easier nut to crack.

Karl Wright of ManifoldCF had a Solr patch for document access control at 
one point:
SOLR-1895 - ManifoldCF SearchComponent plugin for enforcing ManifoldCF 
security at search time

https://issues.apache.org/jira/browse/SOLR-1895

http://www.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011

For some other thoughts:
http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security

I'm not sure if external file fields will be of any value in this situation.

There is also a proposal for bitwise operations:
SOLR-1913 - QParserPlugin plugin for Search Results Filtering Based on 
Bitwise Operations on Integer Fields

https://issues.apache.org/jira/browse/SOLR-1913

But the bottom line is that clearly updating all documents in the index is a 
non-starter.


-- Jack Krupansky

-Original Message- 
From: Oleg Burlaca

Sent: Sunday, July 14, 2013 11:02 AM
To: solr-user@lucene.apache.org
Subject: ACL implementation: Pseudo-join performance & Atomic Updates

Hello all,

Situation:
We have a collection of files in SOLR with ACL applied: each file has a
multi-valued field that contains the list of userID's that can read it:

here is sample data:
Id | content  | userId
1  | text text | 4,5,6,2
2  | text text | 4,5,9
3  | text text | 4,2

Problem:
when ACL is changed for a big folder, we compute the ACL for all child
items and reindex in SOLR using atomic updates (updating only 'userIds'
column), but because it deletes/reindexes the record, the performance is
very poor.

Question:
I suppose the delete/reindex approach will not change soon (probably it's
due to actual SOLR architecture), ?

Possible solution: assuming atomic updates will be super fast on an index
without fulltext, keep a separate ACLIndex and FullTextIndex and use
Pseudo-Joins:

Example: searching 'foo' as user '999'
/solr/FullTextIndex/select/?q=foo&fq{!join fromIndex=ACLIndex from=Id to=Id
}userId:999

Question: what about performance here? what if the index is 100,000
records?
notice that the worst situation is when everyone has access to all the
files, it means the first filter will be the full index.

Would be happy to get any links that deal with the issue of Pseudo-join
performance for large datasets (i.e. initial filtered set of IDs).

Regards,
Oleg

P.S. we found that having the list of all users that have access for each
record is better overall, because there are much more read requests (people
accessing the library) then write requests (a new user is added/removed).

Re: Getting indexed content of files using ExtractingRequestHandler

2013-07-14 Thread xan

Thanks for the link. Also, having gone quite far with my work using the PHP
Solr client, isn't there anything that could be done using the PHP Solr
client only?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Getting-indexed-content-of-files-using-ExtractingRequestHandler-tp4077856p4077893.html
Sent from the Solr - User mailing list archive at Nabble.com.

ACL implementation: Pseudo-join performance & Atomic Updates

2013-07-14 Thread Oleg Burlaca

Hello all,

Situation:
We have a collection of files in SOLR with ACL applied: each file has a
multi-valued field that contains the list of userID's that can read it:

here is sample data:
Id | content  | userId
1  | text text | 4,5,6,2
2  | text text | 4,5,9
3  | text text | 4,2

Problem:
when ACL is changed for a big folder, we compute the ACL for all child
items and reindex in SOLR using atomic updates (updating only 'userIds'
column), but because it deletes/reindexes the record, the performance is
very poor.

Question:
I suppose the delete/reindex approach will not change soon (probably it's
due to actual SOLR architecture), ?

Possible solution: assuming atomic updates will be super fast on an index
without fulltext, keep a separate ACLIndex and FullTextIndex and use
Pseudo-Joins:

Example: searching 'foo' as user '999'
/solr/FullTextIndex/select/?q=foo&fq{!join fromIndex=ACLIndex from=Id to=Id
}userId:999

Question: what about performance here? what if the index is 100,000
records?
notice that the worst situation is when everyone has access to all the
files, it means the first filter will be the full index.

Would be happy to get any links that deal with the issue of Pseudo-join
performance for large datasets (i.e. initial filtered set of IDs).

Regards,
Oleg

P.S. we found that having the list of all users that have access for each
record is better overall, because there are much more read requests (people
accessing the library) then write requests (a new user is added/removed).

Re: Getting indexed content of files using ExtractingRequestHandler

2013-07-14 Thread Erick Erickson

Right, sorry...
http://searchhub.org/dev/2012/02/14/indexing-with-solrj/


On Sun, Jul 14, 2013 at 8:31 AM, xan  wrote:
> Sorry, but did you forget to send me the example's link?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Getting-indexed-content-of-files-using-ExtractingRequestHandler-tp4077856p4077877.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud leader

2013-07-14 Thread Jack Krupansky

In theory, each of the nodes uses the same configuration, right? So, in 
theory, ANY of the nodes can do a DIH import. It is only way down low in the 
update processing chain that an individual Solr input document needs to have 
its key hashed and then the request is routed to the leader of the 
appropriate shard.


In short, YOU decide whatever node that YOU want the DIH import to run on, 
and Solr will automatically take care of actual distribution of individual 
document update requests.


If you want to pick a leader node, fine, but there is no requirement or need 
that you do so.


Scheduling is currently outside of the scope of Solr and SolrCloud.

-- Jack Krupansky

-Original Message- 
From: kowish.adamosh

Sent: Sunday, July 14, 2013 8:42 AM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud leader

The problem is that I don't want to invoke data import on 8 server nodes but
to choose only one for scheduling. Of course if this server will shut down
then another one needs to take the scheduler role. I can see that there is
task for sheduling https://issues.apache.org/jira/browse/SOLR-2305 . I hope
they will take into account SolrCloud. And that's why I wanted to know if
current node is *currently* elected as the leader. The leader would be the
scheduler.

In the meanwhile, any ideas of how to solve data import scheduling on
SolrCloud architecture?

Kowish



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-leader-tp4077759p4077878.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: external file field and fl parameter

2013-07-14 Thread Chris Collins

Yep I did switch on stored=true in the field type.  I was able to confirm a few 
ways that there are values for the eff by two methods:

1) changing desc to asc produced drastically different results.

2) debugging FileFloatSource the following was getting triggered filling the 
vals array:
while ((doc = docsEnum.nextDoc()) != 
DocIdSetIterator.NO_MORE_DOCS)
{
vals[doc] = fval;
}

At least by you asking these questions I guess it should work.  I will continue 
dissecting. 

Thanks Erick.

C
On Jul 14, 2013, at 5:16 AM, Erick Erickson  wrote:

> Did you store the field? I.e. set stored="true"? And does the EFF contain
> values for the docs you're returning?
> 
> Best
> Erick
> 
> On Sun, Jul 14, 2013 at 3:32 AM, Chris Collins  wrote:
>> I am playing with external file field for sorting.  I created a dynamic 
>> field using the ExternalFileField type.
>> 
>> I naively assumed that the "fl" argument would allow me to return the value 
>> the external field but doesnt seem to do so.
>> 
>> For instance I have a defined a dynamic field:
>> 
>> *_efloat
>> 
>> then I used:
>> 
>> sort=foo_efloat desc
>> fl=foo_efloat, score, description
>> 
>> I get the score and description but the foo_efloat seems to be missing in 
>> action.
>> 
>> 
>> Thoughts?
>> 
>> C
>> 
>

Re: HTTP Status 503 - Server is shutting down

2013-07-14 Thread PeterKerk

Hi Shawn,

I'm also getting the HTTP Status 503 - Server is shutting down error when
navigating to http://localhost:8080/solr-4.3.1/

I already copied the logging.properties file from
C:\Dropbox\Databases\solr-4.3.1\example\etc to
C:\Dropbox\Databases\solr-4.3.1\example\lib

Here's my Tomcat console log:

14-jul-2013 14:21:57 org.apache.catalina.core.AprLifecycleListener init
INFO: The APR based Apache Tomcat Native library which allows optimal
performanc
e in production environments was not found on the java.library.path:
C:\Program
Files\Apache Software Foundation\Tomcat
6.0\bin;C:\Windows\Sun\Java\bin;C:\Windo
ws\system32;C:\Windows;C:\Program Files\Common Files\Microsoft
Shared\Windows Li
ve;C:\Program Files (x86)\Common Files\Microsoft Shared\Windows
Live;C:\Windows\
system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShe
ll\v1.0\;C:\Program Files\TortoiseSVN\bin;c:\msxsl;C:\Program Files
(x86)\Window
s Live\Shared;C:\Program Files\Microsoft\Web Platform Installer\;C:\Program
File
s (x86)\Microsoft ASP.NET\ASP.NET Web Pages\v1.0\;C:\Program Files
(x86)\Windows
 Kits\8.0\Windows Performance Toolkit\;C:\Program Files\Microsoft SQL
Server\110
\Tools\Binn\;C:\Program Files (x86)\Microsoft SQL
Server\110\Tools\Binn\;C:\Prog
ram Files\Microsoft SQL Server\110\DTS\Binn\;C:\Program Files
(x86)\Microsoft SQ
L Server\110\Tools\Binn\ManagementStudio\;C:\Program Files (x86)\Microsoft
SQL S
erver\110\DTS\Binn\;C:\Program Files (x86)\Java\jre6\bin;C:\Program
Files\Java\j
re631\bin;.
14-jul-2013 14:21:57 org.apache.coyote.http11.Http11Protocol init
INFO: Initializing Coyote HTTP/1.1 on http-8080
14-jul-2013 14:21:57 org.apache.catalina.startup.Catalina load
INFO: Initialization processed in 283 ms
14-jul-2013 14:21:57 org.apache.catalina.core.StandardService start
INFO: Starting service Catalina
14-jul-2013 14:21:57 org.apache.catalina.core.StandardEngine start
INFO: Starting Servlet Engine: Apache Tomcat/6.0.37
14-jul-2013 14:21:57 org.apache.catalina.startup.HostConfig deployDescriptor
INFO: Deploying configuration descriptor manager.xml
14-jul-2013 14:21:57 org.apache.catalina.startup.HostConfig deployWAR
INFO: Deploying web application archive solr-4.3.1.war
log4j:WARN No appenders could be found for logger
(org.apache.solr.servlet.SolrD
ispatchFilter).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
more in
fo.
14-jul-2013 14:21:58 org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deploying web application directory ROOT
14-jul-2013 14:21:58 org.apache.coyote.http11.Http11Protocol start
INFO: Starting Coyote HTTP/1.1 on http-8080
14-jul-2013 14:21:58 org.apache.jk.common.ChannelSocket init
INFO: JK: ajp13 listening on /0.0.0.0:8009
14-jul-2013 14:21:58 org.apache.jk.server.JkMain start
INFO: Jk running ID=0 time=0/55  config=null
14-jul-2013 14:21:58 org.apache.catalina.startup.Catalina start
INFO: Server startup in 719 ms



--
View this message in context: 
http://lucene.472066.n3.nabble.com/HTTP-Status-503-Server-is-shutting-down-tp4065958p4077879.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud leader

2013-07-14 Thread kowish.adamosh

The problem is that I don't want to invoke data import on 8 server nodes but
to choose only one for scheduling. Of course if this server will shut down
then another one needs to take the scheduler role. I can see that there is
task for sheduling https://issues.apache.org/jira/browse/SOLR-2305 . I hope
they will take into account SolrCloud. And that's why I wanted to know if
current node is *currently* elected as the leader. The leader would be the
scheduler.

In the meanwhile, any ideas of how to solve data import scheduling on
SolrCloud architecture?

Kowish



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-leader-tp4077759p4077878.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Getting indexed content of files using ExtractingRequestHandler

2013-07-14 Thread xan

Sorry, but did you forget to send me the example's link?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Getting-indexed-content-of-files-using-ExtractingRequestHandler-tp4077856p4077877.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Problem using Term Component in solr

2013-07-14 Thread Erick Erickson

by "regularizing the title" I meant either indexing and
searching exactly:
Medical Engineering and Physics
or
Medical Eng. and Phys.

Or you could remove the stopwords yourself at both
index and query time, which would fix your "Physics
of Fluids" example.

The problem here is that you'll be forever fiddling with
this and getting it _almost_ right, then the next
anomaly will happen Siiigh

You might actually be much better off with an ngram
or edgeNgram approach. You'd probably want to
tokenize the titles, and perhaps auto-generate phrase
queries...

Best
Erick

On Sun, Jul 14, 2013 at 7:30 AM, Parul Gupta(Knimbus)
 wrote:
> Hi,
>
> Vocabulary is not known that's the main issue else I will implement synonyms
> instead.
>  what do u mean by 'regularizing the title'.
>
> so let me know some solution...
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Problem-using-Term-Component-in-solr-tp4077200p4077865.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Getting indexed content of files using ExtractingRequestHandler

2013-07-14 Thread Erick Erickson

Well, cURL is generally not what people use for production. What I'd consider
is using SolrJ (which you can access Tika from) and then store the raw pdf
(or whatever) document as a binary data type in Solr.

Here's an example (with DB indexing mixed in, but you should be able
to pull that part out).

Best
Erick

On Sun, Jul 14, 2013 at 4:05 AM, xan  wrote:
> Hi,
>
> I'm using the PHP Solr client (ver: 1.0.2).
>
> I'm indexing the contents through my database.
> Suppose $data is a stdClass object having id, name, title, etc. from a
> database entry.
>
> Next, I declare a solr Document and assign fields to it.:
>
> $doc = new SolrInputDocument();
> $doc->addField ('id' , $data->id);
> $doc->addField ('name' , $data->name);
> 
> 
>
> I wanted to know how can I store the contents of a pdf file (whose path I've
> stored in $data->filepath), in the same solr document, say in a field
> ('filecontent').
>
> Referring to the wiki, I was unable to figure out the proper cURL request
> for achieving this. I was able to create a completely new solr document but
> how do I get the contents of the pdf file in the same solr document so that
> I can store that in a field?
>
>
> $doc = new SolrInputDocument();
> $doc->addField ('id' , $data->id);
> $doc->addField ('name' , $data->name);
> 
> 
> //fire the curl request here referring to the file at $data->filepath
> $doc->addField ('filecontent' , //content of the pdf file);
>
> Also, instead of firing the raw cURL request, is there a better way? I don't
> know if the current PECL SOLR Client 1.0.2 has the feature of indexing pdf
> files.
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Getting-indexed-content-of-files-using-ExtractingRequestHandler-tp4077856.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: external file field and fl parameter

2013-07-14 Thread Erick Erickson

Did you store the field? I.e. set stored="true"? And does the EFF contain
values for the docs you're returning?

Best
Erick

On Sun, Jul 14, 2013 at 3:32 AM, Chris Collins  wrote:
> I am playing with external file field for sorting.  I created a dynamic field 
> using the ExternalFileField type.
>
> I naively assumed that the "fl" argument would allow me to return the value 
> the external field but doesnt seem to do so.
>
> For instance I have a defined a dynamic field:
>
> *_efloat
>
> then I used:
>
> sort=foo_efloat desc
> fl=foo_efloat, score, description
>
> I get the score and description but the foo_efloat seems to be missing in 
> action.
>
>
> Thoughts?
>
> C
>

Re: Custom processing in Solr Request Handler plugin and its debugging ?

2013-07-14 Thread Erick Erickson

Not sure how to do the "pass to another request handler" thing, but
the debugging part is pretty straightforward. I use IntelliJ, but as far
as I know Eclipse has very similar capabilities.

First, I cheat and path to the jar that's the output from my IDE, that
saves copying the jar around. So my solrconfig.xml file has  a lib
directive like
../../../../../eoe/project/out/artifact/jardir
where this is wherever your IDE wants to put it. It can sometimes be
tricky to get enough ../../../ in there.

Second, "edit config", select "remote" and a form comes up. Fill
in host and port, something like "localhost" and "5900" (this latter
is whatever you want". In IntelliJ that'll give you the specific command
to use to start Solr so you can attach. This looks like the following
for my setup:
java -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=5900
-jar start.jar

Now just fire up Solr as above. Fire up your remote debugging
session in IntelliJ. Set breakpoints as you wish. NOT: the "suspend=y"
bit above means that Solr will do _nothing_ until you attach the
debugger and hit "go"

HTH
Erick

On Sat, Jul 13, 2013 at 6:57 AM, Tony Mullins  wrote:
> Please any help on how to pass the search request to different
> RequestHandler from within the custom RequestHandler and how to debug the
> custom RequestHandler plugin ?
>
> Thanks,
> Tony
>
>
> On Fri, Jul 12, 2013 at 4:41 PM, Tony Mullins wrote:
>
>> Hi,
>>
>> I have defined my new Solr RequestHandler plugin like this in
>> SolrConfig.xml
>>
>> 
>> 
>>
>> And its working fine.
>>
>> Now I want to do some custom processing from my this plugin by making a
>> search query to regular '/select' handler.
>>  
>>  
>> 
>>
>> And then receive the results back from '/select' handler and perform some
>> custom processing on those results and send the response back to my custom
>> "/myendpoint" handler.
>>
>> And for this I need help on how to make a call to '/select' handler from
>> within the .MyRequestPlugin class and perform some calculation on the
>> results.
>>
>> I also need some help on how to debug my plugin ? As its .jar is been
>> deployed to solr_hom/lib ... how can I attach my plugin's code in eclipse
>> to Solr process so I could debug it when user will send request to my
>> plugin.
>>
>> Thanks,
>> Tony
>>

Re: add to ContributorsGroup - Instructions for setting up SolrCloud on jboss

2013-07-14 Thread Erick Erickson

Done, sorry it took so long, hadn't looked at the list in a couple of days.


Erick

On Fri, Jul 12, 2013 at 5:46 PM, Ali, Saqib  wrote:
> username: saqib
>
>
> On Fri, Jul 12, 2013 at 2:35 PM, Ali, Saqib  wrote:
>
>> Hello,
>>
>> Can you please add me to the ContributorsGroup? I would like to add
>> instructions for setting up SolrCloud using Jboss.
>>
>> thanks.
>>
>>

Re: Multiple queries or Filtering Queries in Solr

2013-07-14 Thread Erick Erickson

Isn't this just a filter query? (fq=)?

Something like
q=query2&fq=query1

Although I don't quite understand the 500 > 50, but you can always
tack on additional fq clauses, it's basically set intersection.

As for limiting the results a user sees, that's what thr &rows parameter is for.

So another way of looking at this is "can you form a query that expresses
the use-case and just show the top N (in this case 50)?" Does that work?

Best
Erick

On Fri, Jul 12, 2013 at 10:44 AM, dcode  wrote:
>
>
> My problem is I have n fields (say around 10) in Solr that are searchable,
> they all are indexed and stored. I would like to run a query first on my
> whole index of say 5000 docs which will hit around an average of 500 docs.
> Next I would like to query using a different set of keywords on these 500
> docs and NOT on the whole index.
>
> So the first time I send a query a score will be generated, the second time
> I run a query the new score generated should be based on the 500 documents
> of the previous query, or in other words Solr should consider only these 500
> docs as the whole index.
>
> To summarise this, Index of 5000 will be filtered to 500 and then 50
> (5000>500>50). Its basically filtering but I would like to do this in Solr.
>
> I have reasonable basic knowledge and still learning.
>
> Update: If represented mathematically it would look like this:
> results1=f(query1)
> results2=f(query2, results1)
> final_results=f(query3, results2)
>
> I would like this to be accomplish using a program and end-user will only
> see 50 results. So faceting is not an option.
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Multiple-queries-or-Filtering-Queries-in-Solr-tp4077574.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Does Solrj Batch Processing Querying May Confuse?

2013-07-14 Thread Erick Erickson

Well, if you can find one of the docs, or you know one of the IDs
that's missing, try explainOther, see:
http://wiki.apache.org/solr/CommonQueryParameters#explainOther

Best
Erick

On Fri, Jul 12, 2013 at 8:29 AM, Furkan KAMACI  wrote:
> I've crawled some webpages and indexed them at Solr. I've queried data at
> Solr via Solrj. url is my unique field and I've define my query as like
> that:
>
> ModifiableSolrParams params = new ModifiableSolrParams();
> params.set("q", "lang:tr");
> params.set("fl", "url");
> params.set("sort", "url desc");
>
> I've run my program to query 1000 rows at each query and wrote them in a
> file. However I realized that there are some documents that are indexed at
> Solr (I query them from admin page, not from Solrj as a 1000 row batch
> process) but is not at my file. What may be the problem for that?

How to from solr facet exclude specific “Tag”!

2013-07-14 Thread 张智

solr 4.3

this is my query request params:

015truetrue*:*1373713374569{!ex=city}CityId{!ex=company}CompanyIdxml{!tag=city}CityId:729 AND 
{!tag=company}CompanyId:16122

This is the query response "Facet" content:

100171894067747765780580922921328975...808776772765668667402401390971...00

You can see CityId the "Facet" is correct, it excludes the {! Tag = city} 
CityId: 729 queries , but CompanyId the "Facet" is not correct , he did not 
rule out {! Tag = company} CompanyId: 16122 queries. How to solve it ?

Re: Problem using Term Component in solr

2013-07-14 Thread Parul Gupta(Knimbus)

Hi,

Vocabulary is not known that's the main issue else I will implement synonyms
instead.
 what do u mean by 'regularizing the title'.

so let me know some solution...



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-using-Term-Component-in-solr-tp4077200p4077865.html
Sent from the Solr - User mailing list archive at Nabble.com.

Getting indexed content of files using ExtractingRequestHandler

2013-07-14 Thread xan

Hi,

I'm using the PHP Solr client (ver: 1.0.2).

I'm indexing the contents through my database. 
Suppose $data is a stdClass object having id, name, title, etc. from a
database entry.

Next, I declare a solr Document and assign fields to it.:

$doc = new SolrInputDocument();
$doc->addField ('id' , $data->id);
$doc->addField ('name' , $data->name);



I wanted to know how can I store the contents of a pdf file (whose path I've
stored in $data->filepath), in the same solr document, say in a field
('filecontent').

Referring to the wiki, I was unable to figure out the proper cURL request
for achieving this. I was able to create a completely new solr document but
how do I get the contents of the pdf file in the same solr document so that
I can store that in a field?


$doc = new SolrInputDocument();
$doc->addField ('id' , $data->id);
$doc->addField ('name' , $data->name);


//fire the curl request here referring to the file at $data->filepath
$doc->addField ('filecontent' , //content of the pdf file);

Also, instead of firing the raw cURL request, is there a better way? I don't
know if the current PECL SOLR Client 1.0.2 has the feature of indexing pdf
files.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Getting-indexed-content-of-files-using-ExtractingRequestHandler-tp4077856.html
Sent from the Solr - User mailing list archive at Nabble.com.

external file field and fl parameter

2013-07-14 Thread Chris Collins

I am playing with external file field for sorting.  I created a dynamic field 
using the ExternalFileField type.  

I naively assumed that the "fl" argument would allow me to return the value the 
external field but doesnt seem to do so.

For instance I have a defined a dynamic field:

*_efloat

then I used:

sort=foo_efloat desc
fl=foo_efloat, score, description

I get the score and description but the foo_efloat seems to be missing in 
action.


Thoughts?

C

40 matches

Mail list logo