Re: ExtractingRequestHandler and Solr 3.1

2011-04-14 Thread Liam O'Boyle
Hi Grant,

After comparing the differences between my solrconfig.xml and that used by
the example, the key difference is that I didn't have true in the defaults for the ERH.  Commenting out
this line in the example configuration causes the example to display the
same behaviour as I'm seeing.

I've added the option back in and it all works as expected, but seem to be a
change in the configuration.  I didn't have captureAttr enabled because I
don't have it enabled in my 1.4 production environment (I'm just checking
the upgrade process at the moment) and this problem doesn't happen for me
there.  Is the change deliberate?

Thanks,
Liam

On 13 April 2011 23:25, Grant Ingersoll  wrote:

>
> On Apr 13, 2011, at 12:06 AM, Liam O'Boyle wrote:
>
> > Afternoon,
> >
> > After an upgrade to Solr 3.1 which has largely been very smooth and
> > painless, I'm having a minor issue with the ExtractingRequestHandler.
> >
> > The problem is that it's inserting metadata into the extracted
> > content, as well as mapping it to a dynamic field.  Previously the
> > same configuration only mapped it to a dynamic field and I'm not sure
> > how it's managing to add it into my content as well.
> >
> > The requestHandler configuration is as follows
> >
> >  >  startup="lazy"
> >  class="solr.extraction.ExtractingRequestHandler" >
> > 
> >  
> > attr_source_content_type
> > true
> > ignored_
> > 
> > 
> >
> > The schema has a dynamic field for attr_*,  > type="textgen" indexed="true" stored="true" multiValued="true" />.
> >
> > The request being submitted is (reformatted for readability, extracted
> > from the catalina log)
> >
> > literal.ib_extension=blarg
> > literal.ib_date=2010-09-09T21:41:30Z
> > literal.ib_custom2=custom2
> > resource.name=test.txt
> > literal.ib_custom3=custom3
> > literal.ib_authorid=1
> > literal.ib_custom1=custom1
> > literal.ib_custom6=custom6
> > literal.ib_custom7=custom7
> > literal.ib_custom4=custom4
> > literal.ib_linkid=1
> > literal.ib_custom5=custom5
> > literal.ib_tags=foo
> > literal.ib_tags=bar
> > literal.ib_tags=blarg
> > commit=true
> > literal.ib_permissionid=1
> > literal.ib_filters=1
> > literal.ib_filters=2
> > literal.ib_filters=3
> > literal.ib_description=My+Description
> > literal.ib_title=My+Title
> > json.nl=map
> > wt=json
> > literal.ib_realid=1
> > literal.ib_custom9=custom9
> > literal.ib_id=fb1
> > fmap.content=ib_content
> > literal.ib_custom8=custom8
> > literal.ib_type=foobar
> > uprefix=attr_
> > literal.ib_clientid=1
> >
> > After indexing, the ib_content field contains the contents of the
> > file, prefixed with "stream_content_type application/octet-stream
> > stream_size 971 Content-Encoding UTF-8 Content-Type text/plain
> > resourceName test.txt".  These have all been mapped to the dynamic
> > field, so I have attr_content_encoding, attr_source_content_type,
> > attr_stream_content_type and attr_stream_size all with their correct
> > values as well.
> >
> > There are no copyField parameters to add content from attr_* fields
> > into anything else and I've had no luck tracking down where this is
> > coming from.  Has there been some option added which controls this
> > behaviour?
>
>
>
> I'm not aware of anything changing here, other than we upgraded Tika.  Can
> you isolate the problem and share the test?  I tried it on trunk (I can get
> 3.1.0 if needed, but they should be the same in regards to the ERH) using
> the examples on the http://wiki.apache.org/solr/ExtractingRequestHandlerpage 
> and I don't see the behavior.
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem docs using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


-- 
Liam O'Boyle

IntelligenceBank Pty Ltd
Level 1, 31 Coventry Street Southbank, Victoria 3006, Australia
P:   +613 8618 7810   F:   +613 8618 7899   M: +61 403 88 66 44

*Awarded 2010 "Best New Business" and "Business of the Year" - Business3000
Awards*

This email and any attachments are confidential and may contain legally
privileged information or copyright material. If you are not an intended
recipient, please contact us at once by return email and then delete both
messages. We do not accept liability in connection with transmission of
information using the internet.


ExtractingRequestHandler and Solr 3.1

2011-04-13 Thread Liam O'Boyle
Afternoon,

After an upgrade to Solr 3.1 which has largely been very smooth and
painless, I'm having a minor issue with the ExtractingRequestHandler.

The problem is that it's inserting metadata into the extracted
content, as well as mapping it to a dynamic field.  Previously the
same configuration only mapped it to a dynamic field and I'm not sure
how it's managing to add it into my content as well.

The requestHandler configuration is as follows

  

  
 attr_source_content_type
 true
 ignored_

  

The schema has a dynamic field for attr_*, .

The request being submitted is (reformatted for readability, extracted
from the catalina log)

literal.ib_extension=blarg
literal.ib_date=2010-09-09T21:41:30Z
literal.ib_custom2=custom2
resource.name=test.txt
literal.ib_custom3=custom3
literal.ib_authorid=1
literal.ib_custom1=custom1
literal.ib_custom6=custom6
literal.ib_custom7=custom7
literal.ib_custom4=custom4
literal.ib_linkid=1
literal.ib_custom5=custom5
literal.ib_tags=foo
literal.ib_tags=bar
literal.ib_tags=blarg
commit=true
literal.ib_permissionid=1
literal.ib_filters=1
literal.ib_filters=2
literal.ib_filters=3
literal.ib_description=My+Description
literal.ib_title=My+Title
json.nl=map
wt=json
literal.ib_realid=1
literal.ib_custom9=custom9
literal.ib_id=fb1
fmap.content=ib_content
literal.ib_custom8=custom8
literal.ib_type=foobar
uprefix=attr_
literal.ib_clientid=1

After indexing, the ib_content field contains the contents of the
file, prefixed with "stream_content_type application/octet-stream
stream_size 971 Content-Encoding UTF-8 Content-Type text/plain
resourceName test.txt".  These have all been mapped to the dynamic
field, so I have attr_content_encoding, attr_source_content_type,
attr_stream_content_type and attr_stream_size all with their correct
values as well.

There are no copyField parameters to add content from attr_* fields
into anything else and I've had no luck tracking down where this is
coming from.  Has there been some option added which controls this
behaviour?

Cheers,
Liam


Re: Solr and Permissions

2011-04-12 Thread Liam O'Boyle
ManifoldCF sounds like it might be the right solution, so long as it's
not secretly building a filter query in the back end, otherwise it
will hit the same limits.

In the meantime, I have made a minor improvement to my filter query;
it now scans the permitted IDs and attempts to build a filter query
using ranges (e.g. instead of 1 OR 2 OR 3 it will filter using [1 TO
3]) which will hopefully keep me going in the meantime.

Liam

On 12 March 2011 01:46, go canal  wrote:
> Thank you Jan, I will take a look at the MainfoldCF.
> So it seems that the solution is basically to implement something outside of
> Solr for permission control.
> thanks,
> canal
>
>
>
>
> 
> From: Jan Høydahl 
> To: solr-user@lucene.apache.org
> Sent: Fri, March 11, 2011 4:17:22 PM
> Subject: Re: Solr and Permissions
>
> Hi,
>
> Talk to the ManifoldCF guys - they have successfully implemented support for
> document level security for many repositories including CMC/ECMs and may have
> some hints for you to write your own Authority connector against your system,
> which will fetch the ACL for the document and index it with the document 
> itself.
> This eliminates long query-time filters.
>
> Re-indexing content for which ACLs have changed is a very common way of doing
> this, and you should not worry too much about performance implications before
> there is a real issue. In real world, you don't change folder permissions very
> often, and that will be a cost you'll have to live with. If you worry that 
> this
> lag between repository state and index state may cause people to see content
> they are not entitled to, it is possible to do late binding filtering of the
> result set as well, but I would avoid that if possible.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> On 11. mars 2011, at 06.48, go canal wrote:
>
>> To be fair, I think there is a slight difference between a Content Management
>> and a Search Engine.
>>
>> Access control at per document level, per type level, supporting dynamic role
>> changes, etc.are more like  content management use cases; where search 
>> solution
>>
>> like Solr focuses on different set of use cases;
>>
>> But in real world, any content management systems need full text search; so 
>> the
>>
>> question is to how to support search with permission control.
>>
>> JackRabbit integrated with Lucene/Tika, this could be one solution but I do 
>> not
>>
>> know its performance and scalability;
>>
>> CouchDB also integrates with Lucene/Tika, another option?
>>
>> I have yet to see a Search Engine that provides some sort of Content 
>> Management
>>
>> features like we are discussing here (Solr, Elastic Search ?)
>>
>>
>> Then the last option is probably to build an application that works with a
>> document repository with all necessary content management features and Solr
>> which provides search capability;  and handling the permissions outside Solr?
>> thanks,
>> canal
>>
>>
>>
>>
>> 
>> From: Liam O'Boyle 
>> To: solr-user@lucene.apache.org
>> Cc: go canal 
>> Sent: Fri, March 11, 2011 2:28:19 PM
>> Subject: Re: Solr and Permissions
>>
>> As Canal points out,  grouping into types is not always possible.
>>
>> In our case, permissions are not on a per-type level, but either on a per
>> "folder" (of which there can be hundreds) or per item in some cases (of
>> which there can be... any number at all).
>>
>> Reindexing is also to slow to really be an option; some of the items use
>> Tika to extract content, which means that we need to reextract the content
>> (variable length of time; average is about half a second, but on some
>> documents it will sit there until the connection times out) .  Querying it,
>> modifying then resubmitting without rerunning content extraction is still
>> faster, but involves sending even more data over the network; either way is
>> relatively slow.
>>
>> Liam
>>
>> On 11 March 2011 16:24, go canal  wrote:
>>
>>> I have similar requirements.
>>>
>>> Content type is one solution; but there are also other use cases where this
>>> not
>>> enough.
>>>
>>> Another requirement is, when the access permission is changed, we need to
>>> update
>>> the field - my understanding is we can not unless re-index the whole
>>> document
>>> again. Am I correct?
>>> thanks,
>>&g

Re: Solr and Permissions

2011-03-10 Thread Liam O'Boyle
As Canal points out,  grouping into types is not always possible.

In our case, permissions are not on a per-type level, but either on a per
"folder" (of which there can be hundreds) or per item in some cases (of
which there can be... any number at all).

Reindexing is also to slow to really be an option; some of the items use
Tika to extract content, which means that we need to reextract the content
(variable length of time; average is about half a second, but on some
documents it will sit there until the connection times out) .  Querying it,
modifying then resubmitting without rerunning content extraction is still
faster, but involves sending even more data over the network; either way is
relatively slow.

Liam

On 11 March 2011 16:24, go canal  wrote:

> I have similar requirements.
>
> Content type is one solution; but there are also other use cases where this
> not
> enough.
>
> Another requirement is, when the access permission is changed, we need to
> update
> the field - my understanding is we can not unless re-index the whole
> document
> again. Am I correct?
>  thanks,
> canal
>
>
>
>
> 
> From: Sujit Pal 
> To: solr-user@lucene.apache.org
> Sent: Fri, March 11, 2011 10:39:27 AM
> Subject: Re: Solr and Permissions
>
> How about assigning content types to documents in the index, and map
> users to a set of content types they are allowed to access? That way you
> will pass in fewer parameters in the fq.
>
> -sujit
>
> On Fri, 2011-03-11 at 11:53 +1100, Liam O'Boyle wrote:
> > Morning,
> >
> > We use solr to index a range of content to which, within our application,
> > access is restricted by a system of user groups and permissions.  In
> order
> > to ensure that search results don't reveal information about items which
> the
> > user doesn't have access to, we need to somehow filter the results; this
> > needs to be done within Solr itself, rather than after retrieval, so that
> > the facet and result counts are correct.
> >
> > Currently we do this by creating a filter query which specifies all of
> the
> > items which may be allowed to match (e.g. id: (foo OR bar OR blarg OR
> ...)),
> > but this has definite scalability issues - we're starting to run into
> > issues, as this can be a set of ORs of potentially unlimited size (and
> > practically, we're hitting the low thousands sometimes).  While we can
> > adjust maxBooleanClauses upwards, I understand that this has performance
> > implications...
> >
> > So, has anyone had to implement something similar in the past?  Any
> > suggestions for a more scalable approach?  Any advice on safe and
> sensible
> > limits on how far I can push maxBooleanClauses?
> >
> > Thanks for your advice,
> >
> > Liam
>
>
>
>



-- 
Liam O'Boyle

IntelligenceBank Pty Ltd
Level 1, 31 Coventry Street Southbank, Victoria 3006, Australia
P:   +613 8618 7810   F:   +613 8618 7899   M: +61 403 88 66 44

*Awarded 2010 "Best New Business" and "Business of the Year" - Business3000
Awards*

This email and any attachments are confidential and may contain legally
privileged information or copyright material. If you are not an intended
recipient, please contact us at once by return email and then delete both
messages. We do not accept liability in connection with transmission of
information using the internet.


Solr and Permissions

2011-03-10 Thread Liam O'Boyle
Morning,

We use solr to index a range of content to which, within our application,
access is restricted by a system of user groups and permissions.  In order
to ensure that search results don't reveal information about items which the
user doesn't have access to, we need to somehow filter the results; this
needs to be done within Solr itself, rather than after retrieval, so that
the facet and result counts are correct.

Currently we do this by creating a filter query which specifies all of the
items which may be allowed to match (e.g. id: (foo OR bar OR blarg OR ...)),
but this has definite scalability issues - we're starting to run into
issues, as this can be a set of ORs of potentially unlimited size (and
practically, we're hitting the low thousands sometimes).  While we can
adjust maxBooleanClauses upwards, I understand that this has performance
implications...

So, has anyone had to implement something similar in the past?  Any
suggestions for a more scalable approach?  Any advice on safe and sensible
limits on how far I can push maxBooleanClauses?

Thanks for your advice,

Liam


Re: New PHP API for Solr (Logic Solr API)

2011-03-10 Thread Liam O'Boyle
How about the Solr PHP Client (http://code.google.com/p/solr-php-client/)?
 We use this and have been quite happy with it, and it seems that it
addresses all of the concerns you expressed.

What advantages does yours offer?

Liam

On 8 March 2011 17:02, Burak  wrote:

> On 03/07/2011 12:43 AM, Stefan Matheis wrote:
>
>> Burak,
>>
>> what's wrong with the existing PHP-Extension
>> (http://php.net/manual/en/book.solr.php)?
>>
> I think "wrong" is not the appropriate word here. But if I had to summarize
> why I wrote this API:
>
> * Not everybody is enthusiastic about adding another item to an already
> long list of server dependencies. I just wanted a pure PHP option.
> * I am not a C programmer either so the ability to understand the source
> code and modify it according to my needs is another advantage.
> * Yes, a PECL package would be faster. However, in 99% of the cases, after
> everything is said, coded, and byte-code cached, my biggest bottlenecks end
> up being the database and network.
> * Last of all, choice is what open source means to me.
>
> Burak
>
>
>
>
>
>
>
>
>


-- 
Liam O'Boyle

IntelligenceBank Pty Ltd
Level 1, 31 Coventry Street Southbank, Victoria 3006, Australia
P:   +613 8618 7810   F:   +613 8618 7899   M: +61 403 88 66 44

*Awarded 2010 "Best New Business" and "Business of the Year" - Business3000
Awards*

This email and any attachments are confidential and may contain legally
privileged information or copyright material. If you are not an intended
recipient, please contact us at once by return email and then delete both
messages. We do not accept liability in connection with transmission of
information using the internet.


Re: How to Update Value of One Field of a Document in Index?

2010-09-10 Thread Liam O'Boyle
Hi Savannah,

You can only reindex the entire document; if you only have the ID,
then do a search to retrieve the rest of the data, then reindex.  This
assumes that all of the fields you need to index are stored (so that
you can retrieve them) and not just indexed.

Liam

On Fri, Sep 10, 2010 at 3:29 PM, Savannah Beckett
 wrote:
>
> I use nutch to crawl and index to Solr.  My code is working.  Now, I want to
> update the value of one of the fields of a document in the solr index after 
> the
> document was already indexed, and I have only the document id.  How do I do
> that?
>
> Thanks.
>
>
>


Re: Date faceting +1MONTH problem

2010-09-09 Thread Liam O'Boyle
Hi Chris,

Yes, I saw the facet.range.include feature and briefly tried to implement it
before realising that it was Solr 3.1 only :)  I agree that it seems like
the best solution to problem.

Reindexing with a +1MILLI hack had occurred to me and I guess that's what
I'll do in the meantime; it just seemed like something that people must have
run into before!  I suppose it depends on the granularity of your
timestamps; all of my values are actually just dates, so I've been putting
them in as the date with T00:00:00.000Z, which makes the overlap problem
very obvious.

If anyone else has come across a solution for this, feel free to suggest
another approach, otherwise it's reindexing time.

Cheers,
Liam


On Fri, Sep 10, 2010 at 8:38 AM, Chris Hostetter
wrote:

> : I'm trying to break down the data over a year into facets by month; to
> avoid
> : overlap, I'm using -1MILLI on the start and end dates and using a gap of
> : +1MONTH.
> :
> : However, it seems like February completely breaks my monthly cycles,
> leading
>
> Yep.
>
> Everything you posted makes sense to me in how DateMath works - "Jan 31 @
> 23:59.999" + "1 MONTH" results in "Feb 28 @ 23:59.999" ... at which point
> adding "1 MONTH" to that results in "Mar 28 @ ..." because there is no
> context of what the initial starting point was.
>
> It's not a situation i've ever personally run into ... one workarround
> would be to use a "+1MILLI" fudge factor at indexing time, instead of a
> "-1MILLI" fudge factor at query time ... that shouldn't have this problem.
>
> If you'd like to open a bug to trak this, I think it might be possible to
> fix this behavior (there are some features in the Java calendaring code
> that make things like "Jan 31 + 2 Months" do the right thing) but
> personally I think working on SOLR-1896 (combined with the new
> facet.range.include param) is a more effective use of time so
> we can eliminate the need for this type of hack completely in future Solr
> releases.
>
> -Hoss
>
> --
> http://lucenerevolution.org/  ...  October 7-8, Boston
> http://bit.ly/stump-hoss  ...  Stump The Chump!
>
>


Date faceting +1MONTH problem

2010-09-09 Thread Liam O'Boyle
Evening,

I'm trying to break down the data over a year into facets by month; to avoid
overlap, I'm using -1MILLI on the start and end dates and using a gap of
+1MONTH.

However, it seems like February completely breaks my monthly cycles, leading
to incorrect counts further down the line; facets that are after February
only go to the 28th of the month, and items in the other two or three days
get pushed into the next facet.  What's the correct way to do this?

An example is shown below, the facet periods go 2008-12-31, 2009-01-31,
2009-02-28 and then from then on only hit 28.

[2008-12-31T23:59:59.999Z] => 0
[2009-01-31T23:59:59.999Z] => 0
[2009-02-28T23:59:59.999Z] => 0
[2009-03-28T23:59:59.999Z] => 0
[2009-04-28T23:59:59.999Z] => 0
[2009-05-28T23:59:59.999Z] => 0
[2009-06-28T23:59:59.999Z] => 0
[2009-07-28T23:59:59.999Z] => 0
[2009-08-28T23:59:59.999Z] => 13
[2009-09-28T23:59:59.999Z] => 6
[2009-10-28T23:59:59.999Z] => 2
[2009-11-28T23:59:59.999Z] => 7
[gap] => +1MONTH
[end] => 2009-12-28T23:59:59.999Z

Thanks for your help,

Liam


Re: Date Facets

2010-02-24 Thread Liam O'Boyle
In response to myself,

The problem occurs because the date ranges are inclusive.  I can fix
this by making facet.date.gap = +1MONTH-1SECOND, but is there a way to
specify that the upper bound is exclusive, rather than inclusive?

Liam

On Wed, 2010-02-24 at 16:54 +1100, Liam O'Boyle wrote:
> Afternoon,
> 
> I have a strange problem occurring with my date faceting.  I seem to
> have more results in my facets than in my actual result set.
> 
> The query filters by date to show results for one year, i.e.
> ib_date:[2000-01-01T00:00:00Z TO 2000-12-31T23:59:59Z], then uses date
> faceting to break up the dates by month, using the following
> parameters
> 
> facet=true
> facet.date=ib_date
> facet.date.start=2000-01-01T00:00:00Z
> facet.date.end=2000-12-31T23:59:59Z
> facet.date.gap=+1MONTH
> 
> However, I end up with more numbers in the facets than there are
> documents in the response, including facets for dates that aren't
> matched. See below for a summary of the results pulled out
> through /solr/select.
> 
> 
> -
> 
> 2000-12-01T00:00:00Z
> 
> −
> 
> 2000-08-01T00:00:00Z
> 
> −
> 
> 2000-06-01T00:00:00Z
> 
> −
> 
> 2000-11-01T00:00:00Z
> 
> 
> −
> 
> 
> 
> −
> 
> −
> 
> 0
> 0
> 0
> 0
> 1
> 1
> 1
> 1
> 0
> 1
> 2
> 1
> +1MONTH
> 2001-01-01T00:00:00Z
> 
> 
> 
> 
> Is there something I'm missing here?
> 
> Thanks,
> Liam




Date Facets

2010-02-23 Thread Liam O'Boyle
Afternoon,

I have a strange problem occurring with my date faceting.  I seem to
have more results in my facets than in my actual result set.

The query filters by date to show results for one year, i.e.
ib_date:[2000-01-01T00:00:00Z TO 2000-12-31T23:59:59Z], then uses date
faceting to break up the dates by month, using the following parameters

facet=true
facet.date=ib_date
facet.date.start=2000-01-01T00:00:00Z
facet.date.end=2000-12-31T23:59:59Z
facet.date.gap=+1MONTH

However, I end up with more numbers in the facets than there are
documents in the response, including facets for dates that aren't
matched. See below for a summary of the results pulled out
through /solr/select.


-

2000-12-01T00:00:00Z

−

2000-08-01T00:00:00Z

−

2000-06-01T00:00:00Z

−

2000-11-01T00:00:00Z


−



−

−

0
0
0
0
1
1
1
1
0
1
2
1
+1MONTH
2001-01-01T00:00:00Z




Is there something I'm missing here?

Thanks,
Liam


signature.asc
Description: This is a digitally signed message part


Re: Upgrading Tika in Solr

2010-02-17 Thread Liam O'Boyle
I just copied in the newer .jars and got rid of the old ones and
everything seemed to work smoothly enough.

Liam

On Tue, 2010-02-16 at 13:11 -0500, Grant Ingersoll wrote:
> I've got a task open to upgrade to 0.6.  Will try to get to it this week.  
> Upgrading is usually pretty trivial.
> 
> 
> On Feb 14, 2010, at 12:37 AM, Liam O'Boyle wrote:
> 
> > Afternoon,
> > 
> > I've got a large collections of documents which I'm attempting to add to
> > a Solr index using Tika via the ExtractingRequestHandler, but there are
> > a large number that it has problems with (PDFs, PPTX and XLS documents
> > mainly).  
> > 
> > I've tried them with the most recent stand alone version of Tika and it
> > handles most of the failing documents correctly.  I tried using a recent
> > nightly build of Solr, but the same problems seem to occur.
> > 
> > Are there instructions somewhere on installing a more recent Tika build
> > into Solr?
> > 
> > Thanks,
> > Liam
> > 
> > 
> 
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem using Solr/Lucene: 
> http://www.lucidimagination.com/search
> 




Upgrading Tika in Solr

2010-02-13 Thread Liam O'Boyle
Afternoon,

I've got a large collections of documents which I'm attempting to add to
a Solr index using Tika via the ExtractingRequestHandler, but there are
a large number that it has problems with (PDFs, PPTX and XLS documents
mainly).  

I've tried them with the most recent stand alone version of Tika and it
handles most of the failing documents correctly.  I tried using a recent
nightly build of Solr, but the same problems seem to occur.

Are there instructions somewhere on installing a more recent Tika build
into Solr?

Thanks,
Liam