Re: Using POS payloads for chunking

2017-06-15 Thread José Tomás Atria
Ah, good to know! I'm actually using lower level calls, as I'm building the TokenStream by hand from UIMA annotations and not using any analyzer, but I'll keep that in mind for uture projects. Thanks! On Thu, Jun 15, 2017 at 12:10 PM Erick Erickson wrote: > José: > > Do note that, while the byt

Re: Using POS payloads for chunking

2017-06-15 Thread Erick Erickson
José: Do note that, while the bytearray isn't limited, prior to LUCENE-7705 most of the tokenizers you would use limited the incoming token to 256 at most. This is not at all a _Lucene_ limitation at a low level, rather if you're indexing data with a delimited payload (say abc|your_payload_here) t

Re: Using POS payloads for chunking

2017-06-15 Thread José Tomás Atria
Hi Markus, thanks for your response! Now I feel stupid, that is clearly a much simpler approach and it has the added benefits that it would not require me to meddle into the scoring process, which I'm still a bit terrified of. Thanks for the tip. I guess the question is still valid though? i.e. h

RE: Using POS payloads for chunking

2017-06-14 Thread Markus Jelsma
encode. > > > > Thanks! > > Markus > > > > -Original message- > > > From:Erick Erickson > > > Sent: Wednesday 14th June 2017 23:29 > > > To: java-user > > > Subject: Re: Using POS payloads for chunking > > > > > &

Re: Using POS payloads for chunking

2017-06-14 Thread Tommaso Teofili
s. Payloads are versatile! > > > > > > The downside of payloads is that they are limited to 8 bits. Although > we can easily fit our reduced treebank in there, we also use single bits to > signal for compound/subword, and stemmed/unstemmed and some others. > >

RE: Using POS payloads for chunking

2017-06-14 Thread Markus Jelsma
-Original message- > From:Erick Erickson > Sent: Wednesday 14th June 2017 23:29 > To: java-user > Subject: Re: Using POS payloads for chunking > > Markus: > > I don't believe that payloads are limited in size at all. LUCENE-7705 > was done in part because there

Re: Using POS payloads for chunking

2017-06-14 Thread Erick Erickson
ds, > Markus > > -Original message- >> From:Erik Hatcher >> Sent: Wednesday 14th June 2017 23:03 >> To: java-user@lucene.apache.org >> Subject: Re: Using POS payloads for chunking >> >> Markus - how are you encoding payloads as bitsets and use them for scori

RE: Using POS payloads for chunking

2017-06-14 Thread Markus Jelsma
23:03 > To: java-user@lucene.apache.org > Subject: Re: Using POS payloads for chunking > > Markus - how are you encoding payloads as bitsets and use them for scoring? > Curious to see how folks are leveraging them. > > Erik > > > On Jun 14, 2017, at 4:45 PM, Mar

Re: Using POS payloads for chunking

2017-06-14 Thread Erik Hatcher
Markus - how are you encoding payloads as bitsets and use them for scoring? Curious to see how folks are leveraging them. Erik > On Jun 14, 2017, at 4:45 PM, Markus Jelsma wrote: > > Hello, > > We use POS-tagging too, and encode them as payload bitsets for scoring, which > is, as f

RE: Using POS payloads for chunking

2017-06-14 Thread Markus Jelsma
Hello, We use POS-tagging too, and encode them as payload bitsets for scoring, which is, as far as is know, the only possibility with payloads. So, instead of encoding them as payloads, why not index your treebanks POS-tags as tokens on the same position, like synonyms. If you do that, you can